CN116644296A

CN116644296A - Data enhancement method and device

Info

Publication number: CN116644296A
Application number: CN202310928849.4A
Authority: CN
Inventors: 严海旭; 兰晓松; 刘羿; 何贝
Original assignee: Beijing Sinian Zhijia Technology Co ltd
Current assignee: Beijing Sinian Zhijia Technology Co ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-08-25
Anticipated expiration: 2043-07-27
Also published as: CN116644296B

Abstract

The application provides a data enhancement method and a device, comprising the following steps: generating a sample data set based on at least one frame of point cloud data, wherein the sample data set comprises sample data corresponding to a plurality of sample categories; carrying out statistical analysis on the sample data set, and determining sampling parameters of each sample category, wherein the sampling parameters comprise at least one of sampling conditions and sampling quantity; screening target sample data conforming to sampling parameters of the sample category from sample data corresponding to the sample category aiming at each sample category; sampling, for each sample class, the sample class based on the target sample data; a training data set is generated based on the sample data sampled for each sample class. Therefore, a training data set with higher data quality can be generated before model training is performed through screening, so that the problem of model overfitting is avoided while the training data sample is balanced.

Description

Data enhancement method and device

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a data enhancement method and apparatus.

Background

The effect of machine learning is greatly dependent on training data. Taking 3D object detection in the automatic driving field as an example, when data are collected, due to the high cost of manually labeling the data, only some common objects can be collected, but for some unusual objects, enough data are difficult to collect, and the problem of unbalanced data samples occurs, so that the model tends to learn more categories during training, and ignores less categories, thereby affecting the generalization capability of the model. For example, in a real operating scenario of autopilot, some categories of objects occur very infrequently, such as pedestrians occurring very infrequently compared to vehicles and road signs; however, if the recognition accuracy of the model to the pedestrians is low, the automatic driving vehicle can not timely avoid the pedestrians, so that accidents can be caused. Therefore, solving the data sample imbalance problem is very important for machine learning.

To address this problem, a common approach in the industry today is resampling, which includes both upsampling and downsampling methods. However, the downsampling scheme wastes a large amount of quality data samples; the upsampling scheme randomly occurs in the model training process, lacks control over the quality of the sampled data and cannot ensure whether the upsampled generated data is reasonable, which can negatively affect the training of the model. Sometimes the model is countered, but the model is over-fitted. Therefore, both of these existing schemes fail to meet the data enhancement requirements.

Disclosure of Invention

In view of the above, the present application is directed to a data enhancement method and apparatus, which determines sampling parameters of each sample class by performing statistical analysis on a sample dataset; sample data of each sample category is screened in advance according to the sampling parameters, and sampling is carried out in the screened sample data to generate a training data set. Therefore, a training data set with higher data quality can be generated before model training is carried out, and the problem of model overfitting is avoided while the training data sample is balanced.

The embodiment of the application provides a data enhancement method, which comprises the following steps:

generating a sample data set based on at least one frame of point cloud data, wherein the sample data set comprises sample data corresponding to a plurality of sample categories;

carrying out statistical analysis on the sample data set, and determining sampling parameters of each sample category, wherein the sampling parameters comprise at least one of sampling conditions and sampling quantity;

screening target sample data conforming to sampling parameters of the sample category from sample data corresponding to the sample category aiming at each sample category;

sampling, for each sample class, the sample class based on the target sample data;

a training data set is generated based on the sample data sampled for each sample class.

Further, performing statistical analysis on the sample data set to determine sampling parameters of each sample class, including:

carrying out statistical analysis on the sample data set, and determining the sample number of each sample class and/or the characteristic distribution of each sample class on a data characteristic item;

determining the sampling number of each sample category according to the sample number of each sample category; and/or determining the sampling condition of each sample category according to the characteristic distribution of each sample category on the data characteristic item.

Further, determining the number of samples for each sample class based on the number of samples for each sample class includes:

determining class balance reference quantity according to the number of the sample classes and the number of samples of each sample class;

determining the number grade of each sample class according to the sample number of each sample class and the class balance reference quantity;

and determining the sampling quantity of each sample class according to the sampling rule corresponding to the sample quantity and the quantity grade of each sample class.

Further, the number levels at least include a high frequency level, a medium frequency level and a low frequency level, and each number level corresponds to a sample number interval without overlapping ranges;

the sampling rule is as follows:

the sampling number corresponding to the high frequency level is equal to 0;

the sampling number corresponding to the intermediate frequency level enables the sampled sample number to reach the lower limit value of the sample number interval corresponding to the high frequency level;

and the sampling number corresponding to the low-frequency level enables the sampled sample number to reach the lower limit value of the sample number interval corresponding to the intermediate-frequency level.

Further, the data characteristic items comprise the number of point clouds corresponding to each sample and/or the distance distribution of each sample relative to the target position;

determining sampling conditions of each sample category according to the characteristic distribution of each sample category on the data characteristic item, wherein the sampling conditions comprise:

according to the characteristic distribution of each sample category on the data characteristic item, determining the distance distribution range and/or the point cloud quantity threshold value of the samples in each sample category;

and determining the sampling condition of each sample category according to the distance distribution range and/or the point cloud quantity threshold.

Further, for each sample class, sampling the sample class based on the target sample data, including:

determining the corresponding frame sampling quantity of the sample class in each frame of point cloud data according to the frame number of the at least one frame of point cloud data;

and aiming at each frame of point cloud data, sampling in the target sample data according to the frame sampling number corresponding to the sample class, and obtaining sample data obtained after sampling the sample class.

Further, generating a training data set based on the sample data sampled for each sample class includes:

and adding the sample data obtained after sampling each sample type into the frame point cloud data to obtain training frame data, wherein the training data set comprises at least one training frame data.

The embodiment of the application also provides a data enhancement device, which comprises:

the first generation module is used for generating a sample data set based on at least one frame of point cloud data, wherein the sample data set comprises sample data corresponding to a plurality of sample categories;

the analysis module is used for carrying out statistical analysis on the sample data set and determining sampling parameters of each sample category, wherein the sampling parameters comprise at least one of sampling conditions and sampling quantity;

the screening module is used for screening target sample data which accords with the sampling parameters of the sample category from the sample data corresponding to the sample category aiming at each sample category;

the sampling module is used for sampling each sample category based on the target sample data;

and the second generation module is used for generating a training data set based on the sample data obtained by sampling for each sample category.

The embodiment of the application also provides electronic equipment, which comprises: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating over the bus when the electronic device is running, said machine readable instructions when executed by said processor performing the steps of a data enhancement method as described above.

The embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a data enhancement method as described above.

According to the data enhancement method and device provided by the embodiment of the application, the sampling parameters of each sample type are determined by carrying out statistical analysis on the sample data set; sample data of each sample category is screened in advance according to the sampling parameters, and sampling is carried out in the screened sample data to generate a training data set. Therefore, a training data set with higher data quality can be generated before model training is carried out, and the problem of model overfitting is avoided while the training data sample is balanced.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a data enhancement method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data enhancement device according to an embodiment of the present application;

fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, every other embodiment obtained by a person skilled in the art without making any inventive effort falls within the scope of protection of the present application.

It was found that the effect of machine learning is greatly dependent on training data. The following description will take 3D object detection in the field of automatic driving as an example.

3D object detection is an important link in the field of autopilot. Specifically, the 3D target detection technology aims at identifying roads, vehicles, pedestrians and the like in running by utilizing three-dimensional point cloud data produced by a laser radar, so that an automatic driving vehicle has environment sensing capability, and the safety and the intelligence of the automatic driving vehicle are improved. However, when data are collected, due to the high cost of manually labeling the data and other reasons, only some common objects can be collected, but for some unusual objects, enough data are difficult to collect, and the problem of unbalanced data samples occurs, so that the model tends to learn more categories during training, and ignores less categories, thereby influencing the generalization capability of the model. For example, in a real operating scenario of autopilot, some categories of objects occur very infrequently, such as pedestrians occurring very infrequently compared to vehicles and road signs; however, if the recognition accuracy of the model to the pedestrians is low, the automatic driving vehicle can not timely avoid the pedestrians in the application stage, so that accidents can be caused. Therefore, solving the data sample imbalance problem is very important for 3D object detection.

To address this problem, a common approach in the industry today is resampling, which includes both upsampling and downsampling methods. However, the downsampling scheme wastes a large amount of quality data samples; the upsampling scheme randomly occurs in the model training process, lacks control over the quality of the sampled data and cannot ensure whether the upsampled generated data is reasonable, which can negatively affect the training of the model. Sometimes the model is countered, but the model is over-fitted. Therefore, both of these existing schemes cannot meet the data enhancement requirements in 3D object detection.

Based on the above, the embodiment of the application provides a data enhancement method and a data enhancement device, which determine sampling parameters of each sample category by carrying out statistical analysis on a sample data set; sample data of each sample category is screened in advance according to the sampling parameters, and sampling is carried out in the screened sample data to generate a training data set. Therefore, a training data set with higher data quality can be generated before model training is performed through pre-screening, so that the problem of model overfitting is avoided while the training data sample is balanced.

Referring to fig. 1, fig. 1 is a flowchart of a data enhancement method according to an embodiment of the present application. As shown in fig. 1, a method provided by an embodiment of the present application includes:

s101, generating a sample data set based on at least one frame of point cloud data.

Here, at least one frame of point cloud data of the surrounding environment may be collected by a lidar in a perception system of the autonomous vehicle, forming an initial dataset; each frame of point cloud data comprises point cloud coordinate data and corresponding point cloud annotation data; the point cloud annotation data comprises at least one annotation frame, and each annotation frame and the corresponding in-frame point cloud are regarded as one sample, so that each frame of point cloud data comprises sample data of at least one sample. Different samples may belong to different sample categories, such as transportation categories, pedestrians, vehicles, etc.

In this step, the point cloud data of each frame is statistically classified according to the sample category of each sample included in the point cloud data of each frame, and a sample data set in units of a single sample can be generated. The sample data set comprises sample data corresponding to a plurality of sample categories.

S102, carrying out statistical analysis on the sample data set, and determining sampling parameters of each sample type.

The sample parameters of each sample class can be determined by carrying out statistical analysis on the sample data set, so that the sample data are filtered according to the sample parameters for sampling to balance the sample data corresponding to different sample classes. The sampling parameter includes at least one of a sampling condition and a sampling number.

In one possible implementation, step S102 may include:

s1021, carrying out statistical analysis on the sample data set, and determining the sample number of each sample type and/or the characteristic distribution of each sample type on the data characteristic item.

The data characteristic items comprise the number of point clouds corresponding to each sample and/or the distance distribution of each sample relative to the target position. In specific implementation, the content of the statistical analysis may include information such as the number of samples in each sample category, the number of point clouds in each sample frame, the distance distribution from each point of the point clouds to the target position, and the distance variation range; and (3) taking the sample category as a unit for a statistical analysis result of the sample data set, and determining the sample number of each sample category and/or the characteristic distribution of each sample category on the data characteristic item.

S1022, determining the sampling number of each sample category according to the sample number of each sample category; and/or determining the sampling condition of each sample category according to the characteristic distribution of each sample category on the data characteristic item.

In a first possible implementation, determining the number of samples for each sample class in S1022 may include:

step 1, determining class balance reference quantity according to the number of sample classes and the number of samples of each sample class.

Specifically, category balance reference；/>The total number of samples in the sample dataset may be determined by summing the number of samples for each sample class; />Representing the number of sample categories.

And 2, determining the number grade of each sample class according to the sample number of each sample class and the class balance reference quantity.

Specifically, can be provided withSetting a classification thresholdThen balancing the reference quantity and the grading threshold value according to the number of samples of each sample class, said class>The number level of each sample class is determined. Wherein the number level includes at least a high frequency levelIntermediate frequency class->And low frequency class->Each number level corresponds to a sample number interval having no overlapping range with each other. The formula is:

in the method, in the process of the application,indicate->A number level of the individual sample categories; />Indicate->Number of samples of the individual sample classes.

And step 3, determining the sampling quantity of each sample class according to the sampling rules corresponding to the sample quantity and the quantity grade of each sample class. Wherein, the sampling rule is:

the sampling number corresponding to the high frequency level is equal to 0; i.e. the high frequency level does not require sampling.

The intermediate frequencyThe number of samples corresponding to the level enables the number of samples after sampling to reach the lower limit value of the sample number interval corresponding to the high-frequency level; i.e. the number of additional samples of the intermediate frequency class is。

The sampling number corresponding to the low-frequency level enables the sampled sample number to reach the lower limit value of the sample number interval corresponding to the intermediate-frequency level; i.e. the number of additional samples of the low frequency class is。

Therefore, for sample types with a small number, the sampling number is determined and sampled in the mode, so that the sample number can be expanded, the problem of unbalanced sample number is avoided, and the generalization capability of the model is indirectly improved. While the number of samples is expanded, the sample types with different number levels can still keep the relative number relation after sampling, so that the model can correctly distribute learning during training.

In a second possible implementation manner, the data characteristic item includes the number of point clouds corresponding to each sample and/or the distance distribution of each sample relative to the target position. In general, the target location may be a self-vehicle location of an autonomous vehicle. Taking a cone barrel sample in a sample data set of a port as an example, statistical analysis shows that the cone barrel sample in the sample data set is generally distributed on two sides of a vehicle, and the cone barrel sample is most distributed near 15-20 meters away from the vehicle, and the distribution of the number of point clouds is relatively uniform up to about 100 meters at the most.

The determining of the sampling condition for each sample class in S1022 may include:

according to the characteristic distribution of each sample category on the data characteristic item, determining the distance distribution range and/or the point cloud quantity threshold value of the samples in each sample category; and determining the sampling condition of each sample category according to the distance distribution range and/or the point cloud quantity threshold.

For example, the sampling condition of a certain sample class may be that the distance from the center point in the sample frame to the own vehicle is within a preset distance distribution range, and the number of point clouds in the sample frame is greater than a threshold value of the number of point clouds.

S103, screening target sample data which accords with sampling parameters of the sample type from sample data corresponding to the sample type according to each sample type.

In the step, target sample data conforming to the sampling parameters of each sample category can be screened from sample data corresponding to the sample category in advance, so that the target sample data with high data quality and reasonable data quality can be screened, and the collected error noise sample data can be filtered.

S104, sampling each sample category based on the target sample data.

In one possible implementation, step S104 may include:

determining the corresponding frame sampling quantity of the sample class in each frame of point cloud data according to the frame number of the at least one frame of point cloud data; and aiming at each frame of point cloud data, sampling in the target sample data according to the frame sampling number corresponding to the sample class, and obtaining sample data obtained after sampling the sample class.

Specifically, for any one sample classThe corresponding number of frame samples in each frame of point cloud data, namely the sample category which needs to be increased in each frame of point cloud data ∈>Can be expressed as: />；/>Representing the number of frames. Thereafter, according to the number of frame samples +.>In the sample class->Sampling is carried out in the corresponding target sample data.

S105, generating a training data set based on sample data obtained by sampling for each sample type.

In one possible implementation, step S105 may include: and adding the sample data obtained after sampling each sample type into the frame point cloud data to obtain training frame data, wherein the training data set comprises at least one training frame data. The training data set can be used for training in a model training stage to obtain a 3D target detection model.

In this way, the pre-sampling mode used in the embodiment of the application generates the training data set in the data preparation stage before the model training stage, the model uses the regenerated training data set for training, and no other sampling mode is needed for data enhancement during training. By advancing the sampling process, the training speed is increased, the randomness is reduced, and the model performance can be improved in stability.

The embodiment of the application provides a data enhancement method, which comprises the following steps: generating a sample data set based on at least one frame of point cloud data, wherein the sample data set comprises sample data corresponding to a plurality of sample categories; carrying out statistical analysis on the sample data set, and determining sampling parameters of each sample category, wherein the sampling parameters comprise at least one of sampling conditions and sampling quantity; screening target sample data conforming to sampling parameters of the sample category from sample data corresponding to the sample category aiming at each sample category; sampling, for each sample class, the sample class based on the target sample data; a training data set is generated based on the sample data sampled for each sample class.

Determining sampling parameters of each sample category by carrying out statistical analysis on the sample data set; sample data of each sample category is screened in advance according to the sampling parameters, and sampling is carried out in the screened sample data to generate a training data set. Therefore, a training data set with higher data quality can be generated before model training is performed through screening, so that the problem of model overfitting is avoided while the training data sample is balanced. In addition, the random sampling process is advanced, so that multiple rounds of random sampling during model training in the prior art are converted into random sampling under the limitation of a single round rule before training, the model training speed is improved, and the randomness is reduced.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a data enhancement device according to an embodiment of the application. As shown in fig. 2, the apparatus 300 includes:

a first generation module 310, configured to generate a sample data set based on at least one frame of point cloud data, where the sample data set includes sample data corresponding to a plurality of sample categories;

an analysis module 320, configured to perform statistical analysis on the sample data set, and determine a sampling parameter of each sample class, where the sampling parameter includes at least one of a sampling condition and a sampling number;

a screening module 330, configured to screen, for each sample class, target sample data that accords with sampling parameters of the sample class from sample data corresponding to the sample class;

a sampling module 340, configured to sample, for each sample class, the sample class based on the target sample data;

the second generating module 350 is configured to generate a training data set based on the sample data obtained by sampling for each sample class.

Further, the analysis module 320 is configured to, when configured to perform statistical analysis on the sample data set, determine sampling parameters of each sample class, the analysis module 320 is configured to:

Further, the analysis module 320 is configured to, when configured to determine the number of samples of each sample class according to the number of samples of each sample class, the analysis module 320 is configured to:

the sampling rule is as follows:

the sampling number corresponding to the high frequency level is equal to 0;

the analysis module 320 is configured to, when configured to determine a sampling condition of each sample class according to a feature distribution of each sample class on the data feature item, the analysis module 320 is configured to:

Further, when the sampling module 340 is configured to sample each sample class based on the target sample data, the sampling module 340 is configured to:

Further, when the second generating module 350 is configured to generate a training data set based on the sample data obtained by sampling for each sample class, the second generating module 350 is configured to:

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the application. As shown in fig. 3, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

The memory 420 stores machine-readable instructions executable by the processor 410, and when the electronic device 400 is running, the processor 410 communicates with the memory 420 through the bus 430, and when the machine-readable instructions are executed by the processor 410, a step of a data enhancement method in the method embodiment shown in fig. 1 may be executed, and a specific implementation may refer to the method embodiment and will not be described herein.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program may execute the steps of a data enhancement method in the method embodiment shown in fig. 1 when the computer program is executed by a processor, and the specific implementation manner may refer to the method embodiment and will not be described herein.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of data enhancement, the method comprising:

2. The method of claim 1, wherein statistically analyzing the sample dataset to determine sampling parameters for each sample class comprises:

3. The method of claim 2, wherein determining the number of samples for each sample class based on the number of samples for each sample class comprises:

4. The method of claim 3, wherein the step of,

the number levels at least comprise a high frequency level, a medium frequency level and a low frequency level, and each number level corresponds to a sample number interval without overlapping range;

the sampling rule is as follows:

the sampling number corresponding to the high frequency level is equal to 0;

5. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the data characteristic items comprise the number of point clouds corresponding to each sample and/or the distance distribution of each sample relative to the target position;

6. The method of claim 1, wherein for each sample class, sampling that sample class based on the target sample data comprises:

7. The method of claim 6, wherein generating a training data set based on the sampled sample data for each sample class comprises:

8. A data enhancement device, the device comprising:

9. An electronic device, comprising: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating via said bus when the electronic device is running, said machine readable instructions when executed by said processor performing the steps of a data enhancement method according to any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of a data enhancement method according to any of claims 1 to 7.