CN117195285A

CN117195285A - Automatic driving data processing method

Info

Publication number: CN117195285A
Application number: CN202311122849.1A
Authority: CN
Inventors: 王炳伟; 徐健; 徐辉; 李徐钰
Original assignee: Changzhou Xingyu Automotive Lighting Systems Co Ltd
Current assignee: Changzhou Xingyu Automotive Lighting Systems Co Ltd
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2023-12-08

Abstract

The invention discloses a processing method of automatic driving data, which comprises the following steps: s1, acquiring a real data set of automatic driving; s2, preprocessing a real data set to obtain real characteristic data; s3, constructing a data generation model, and training the data generation model by adopting real characteristic data; s4, synthesizing a synthesized data set which retains the characteristics of the real data by using the trained data generation model. The invention generates the synthesized data with real data characteristics through the data generation model, wherein the synthesized data only contains data related to automatic driving, does not contain personal privacy data of a user, protects the privacy of the user, and can be used for researching an automatic driving system.

Description

Automatic driving data processing method

Technical Field

The invention relates to the technical field of automatic driving, in particular to a processing method of automatic driving data.

Background

With the rapid development of automatic driving technology, the collection and use of a large amount of driving data becomes a key to improving the performance and safety of an automatic driving system. However, the driving data may include sensitive information of the user, such as a vehicle position, a driving path, and personal characteristics, etc., and there is a risk of privacy disclosure. In the prior art, methods of data desensitization, data encryption, data shielding, access control, data partitioning, differential privacy, privacy protection protocol and the like are often adopted for protecting the data privacy, and the methods have reversibility (for example, private information can be retrieved through decryption), and part of useful data information can be lost, so that the analysis of automatic driving performance is disadvantageous. Thus, a balance needs to be found between absolute privacy protection and actual data utility.

Disclosure of Invention

The invention aims to solve the technical problems that: how to guarantee the usability of data while protecting the privacy of the user. Therefore, the invention provides the processing method of the automatic driving data, which can generate anonymous data to replace real data, can effectively protect the privacy of users, and also maintains the characteristics of the real data.

The technical scheme adopted for solving the technical problems is as follows: a method of processing autopilot data, comprising: s1, acquiring a real data set of automatic driving; s2, preprocessing the real data set to obtain real characteristic data; s3, constructing a data generation model, and training the data generation model by adopting the real characteristic data; s4, synthesizing a synthesized data set which retains the characteristics of the real data by using the trained data generation model.

Further, the preprocessing of the real data set includes:

and performing data cleaning, data screening, data encoding and data normalization on the real data set.

Further, the data is encoded as: converting the text data in the real data into numerical values; the data were normalized to: and normalizing the numbers in the real data.

Further, constructing a data generation model, including:

establishing an optimized objective function min _G max _D V(D,G)；

Selecting an activation function and a loss function;

model parameters are optimized.

Further, the data generation model includes: the system comprises a generating network and a judging network, wherein the generating network is used for generating synthetic data, and the judging network is used for judging the authenticity of the generated synthetic data; the generating network comprises three hidden layers, and the node numbers of the three hidden layers are respectively set to be 150, 200 and 150; the discrimination network comprises two hidden layers, and the number of nodes of the two hidden layers is 200.

Further, training the data generation model includes:

inputting the real characteristic data into the generation network, and outputting a group of synthetic data by the generation network; the discrimination network judges the synthesized data and outputs a probability value; the judging network feeds back the probability value to the generating network;

if the probability value does not reach the set threshold value, the generating network regenerates new synthesized data; repeating the above process until the probability value output by the discrimination network reaches a set threshold.

Further, the data encoding includes:

dividing the real data set into M categories according to different characteristic items;

coding N states according to N-bit state registers for each class, wherein n=n is provided for each class to contain N features, the bit number of the state registers is N, and each bit is represented by 0 or 1;

after each category is encoded, the M categories are combined.

Furthermore, the data normalization adopts a minimum and maximum normalization processing mode to normalize real data to be between [ -1,1 ].

Further, in the optimizing objective function,

wherein x represents real data, z represents noise data, E _x Representing the expectation of real data, E _z The expected noise data is represented, D represents a discrimination network, G represents a generation network, and y represents a discrimination condition.

Further, the method further comprises the following steps: s5, verifying and evaluating the synthesized data.

The invention has the beneficial effects that the synthetic data with real data characteristics is generated through the data generation model, the synthetic data only contains data related to automatic driving, personal privacy data of a user cannot be contained, and the invention can be used for researching an automatic driving system while protecting the privacy of the user.

Drawings

The invention will be further described with reference to the drawings and examples.

Fig. 1 is a flow chart of a processing method of the present invention.

Fig. 2 is a graph showing the change in loss values of the discrimination network and the generation network in the training process of the present invention.

Fig. 3 is a schematic diagram of the distribution of the synthesized data (black) and the real data (gray) of the present invention.

Fig. 4 is a graph comparing average accuracy of machine learning models of the present invention.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention. Furthermore, features defining "first", "second" may include one or more such features, either explicitly or implicitly. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

As shown in fig. 1, the method for processing autopilot data according to the present invention includes: s1, acquiring a real data set of automatic driving; s2, preprocessing a real data set to obtain real characteristic data; s3, constructing a data generation model, and training the data generation model by adopting real characteristic data; s4, synthesizing a synthesized data set which retains the characteristics of the real data by using the trained data generation model.

In other words, the invention generates the synthesized data (namely anonymous data) based on the real data, the data related to the privacy of the user is hidden in the synthesized data, the data related to the automatic driving is reserved, and the invention can be used for researching the performance of an automatic driving system.

It should be noted that, after obtaining the real data of the autopilot, the real data needs to be preprocessed, where the preprocessing includes: and performing data cleaning, data screening, data encoding and data normalization on the real data set.

The data cleaning comprises the following steps: deletion of missing values, invalid values, and data inconsistent with the data type in the real data. In this way, the cleaned real data can be correctly called in the subsequent processing process.

The data screening comprises the following steps: and screening out characteristic data related to automatic driving from the cleaned real data set. The automatic driving is realized by means of cooperation of artificial intelligence, visual calculation, a radar, a monitoring device and a positioning system, so that the vehicle can automatically and safely run without any active operation of human beings. Thus, the feature data related to the automatic driving includes, for example, semantic segmentation data, lane line detection data, radar data, driving behavior data, positioning data, and the like. However, the obtained real data set may also contain data related to the automatic driving, such as passenger voice data, temperature adjustment data, etc. In order to ensure the effectiveness of the subsequent synthesized data and the efficiency of data processing, a screening is performed on the real data set during preprocessing.

And data coding and normalization processing can be carried out on the screened real data set. The data encoding is as follows: and converting the text data in the real data into numerical values. Data normalization is: and normalizing the numbers in the real data.

Since the data generation model cannot recognize text information, data encoding processing is required. The data encoding includes: dividing the real data set into M categories according to different characteristic items; coding N states according to N-bit state registers for each class, wherein n=n is provided for each class to contain N features, the bit number of the state registers is N, and each bit is represented by 0 or 1; after each category is encoded, the M categories are combined. For example, it is assumed that the real data set includes two categories of driving behavior data including three features of "straight", "left turn", "right turn", and radar data including two features of "distance", "azimuth". Then, for driving behavior data, n=3, the number of bits of the status register is equal to 3, and after encoding, "straight" is denoted by 001, "left turn" is denoted by 010, and "right turn" is denoted by 100. For radar data, n=2, then the number of bits of the status register is equal to 2, and after encoding, the "distance" is denoted by 01 and the "azimuth" is denoted by 10. Combining the two categories may obtain a plurality of sample data: [ straight, distance ], [ straight, azimuth ], [ left turn, distance ], [ left turn, azimuth ], [ right turn, distance ], [ right turn, azimuth ]; the corresponding codes are [0,0,1,0,1], [0,0,1,1,0], [0,1,0,0,1], [0,1,0,1,0], [1,0,0,0,1], and [1,0,0,1,0], respectively. That is, after encoding, the text feature is converted into a numerical value.

In the invention, the data normalization adopts a minimum maximum normalization processing mode to normalize the real data to [ -1,1]Between them. The formula for the normalization of the minimum maximum value is:x represents a true value, x' represents a normalized value, x _min Representing the minimum value of the feature, x _max Representing the maximum value of the feature. Since the applicable numerical range in the data generation model constructed herein is [ -1,1]Therefore, normalization of the real data also requires normalization of the values to [ -1,1]Between them.

Specifically, constructing a data generation model includes: establishing an optimized objective function min _G max _D V (D, G); selecting an activation function and a loss function; model parameters are optimized. The data generation model comprises a generation network and a discrimination network, wherein the generation network is used for generating the synthetic data, and the discrimination network is used for judging the authenticity of the generated synthetic data. In the optimization of the objective function(s),

wherein x represents real data, z represents noise data, E _x Representing the expectation of real data, E _z The expected noise data is represented, D represents a discrimination network, G represents a generation network, and y represents a discrimination condition. The noise data z follows a gaussian distribution and is randomly generated.

In the data generation model, the discrimination network D tries to increase V, and the generation network G tries to decrease V, and the discrimination network D and the generation network G are in a relationship of opposing each other. The calculation of the loss function is typically generated in a discrimination network D, which outputs true or false. The whole optimization objective function can be split into two parts:

in the present invention, since it is judged whether the output of the network D is true or false, sigmo id is selected as the activation function and Binary_ Crossentropy (BCE) is selected as the loss function. The input to the generation network G is the feature data, the inakrlu is selected as the activation function, the Binary_ Crossentropy (BCE) is selected as the loss function, and the inakrlu can cause the x value in the model training to have a smaller gradient when smaller than 0, instead of directly determining as 0 as in the inakrlu function. When optimizing the model parameters, initial values can be set first, and then the parameters are continuously optimized in the training process. For example, training a data generation model, including: inputting the real characteristic data into a generating network, and outputting a group of synthesized data by the generating network; the discrimination network judges the synthesized data and outputs a probability value; the judging network feeds the probability value back to the generating network; if the probability value does not reach the set threshold value, generating a network to regenerate new synthesized data; and repeating the process until the probability value output by the discrimination network reaches the set threshold value. The key of training is that the generation network G and the discrimination network D are updated alternately and circularly. The model of the invention has the following optimized parameters: the generating network comprises three hidden layers, and the node numbers of the three hidden layers are respectively set to be 150, 200 and 150; the judging network comprises two hidden layers, and the number of nodes of the two hidden layers is 200; the optimizer was Adam, learning rate 0.0002, batch size 512, noise dimension 99. Too large a learning rate is likely to exceed an optimal value, and if too low, the learning rate results in low optimization efficiency and excessively long operation time.

The objective of optimizing the objective function is to produce a distribution that is highly similar to the distribution shown in the real data. This is achieved by generating a very small and very large game between the network G and the discrimination network D. Discrimination networks aim to learn and discriminate between true and false samples, while generation networks learn and refine in terms of generating false samples to spoof discriminators until a Nash equilibrium point is established between the two networks. Thus, the data generation model can simulate the real data distribution and generate the trusted composite data. The optimized objective function of the invention is also provided with a discrimination condition y, and the network can be discriminated only when the data generated by the generated network is true enough and consistent with the condition y. The discrimination conditions y can be set as to whether to accelerate in automatic driving, the sex of the vehicle owner, etc., and the unsupervised learning is converted into supervised learning by setting the discrimination conditions y, so that the network can learn better under control.

The relationship between generating a network and discriminating a network will be described with reference to examples of counterfeiters and police. The counterfeiter corresponds to the generation network, and the police corresponds to the discrimination network. The counterfeit money is manufactured according to the appearance of the genuine money, and then the police can judge the authenticity. Initially, police may immediately recognize counterfeit coins because of the low technical ability of the counterfeiters. After failure, counterfeiters use better counterfeit money making method to improve their own technology and make more lifelike counterfeit money. At the same time, the police's ability to identify counterfeit money is also increasing, which is a challenge process. Finally, the counterfeit money produced by the counterfeiter is actually too genuine, and the police root is not clear and fake. Thus, the probability of police guessing the pair becomes 0.5. At this time, finally, the counterfeit money manufactured by the counterfeiter is a good descriptive money, and the counterfeiter has mastered various characteristics of the money. The generation network and the discrimination network are also such a countermeasure process, eventually reaching a balance point. When the probability value output by the discrimination network reaches 0.5, the training effect is considered to be expected.

During training, the effectiveness of the training may be assessed by the magnitude of the loss function. As shown in fig. 2, as the count of training epochs (the number of times the epoch description algorithm views the entire data set) increases, the loss functions of the generation network and the discrimination network continuously fluctuate, eventually tending to 0.5. Thus, a better data generation model can be obtained.

In addition, the training effect of the model can be judged by comparing the distribution difference between the real data and the synthesized data. As shown in fig. 3, since the real data and the synthesized data belong to high-dimensional data, the dimension is reduced by a PCA (principal component analysis) method, and after the dimension is reduced, the real data and the synthesized data are subjected to three-dimensional visualization processing, and as can be seen from the figure, the distribution of the synthesized data (black) is substantially the same as the distribution of the real data (gray).

The method further comprises the following steps: s5, verifying and evaluating the synthesized data. By the trained data generation model, synthesized data close to real data can be output. In order to verify the availability of the synthesized data, the invention adopts the prediction effect of machine learning to make judgment. The real data and the synthetic data are taken as two samples, each sample being according to 2:1 into training and testing sets. The machine learning model (ISF, RF, KNN) was trained with a training set, tested with a test set, and the prediction accuracy of the machine learning model output is shown in table 1. As can be seen from the prediction accuracy of table 1, the prediction accuracy of the machine learning model output is substantially at the same level using the real data and using the synthesized data. The synthetic data generated by the data generation model has basically the same characteristics as the real data, and can be used for research analysis of automatic driving.

TABLE 1

Model

1

2

3

4

5

6

7

8

9

10

ori-ISF

67.56％

67.48％

67.39％

68.10％

68.58％

67.76％

67.61％

67.21％

67.76％

ori-RF

67.56％

68.01％

67.43％

68.20％

67.72％

68.01％

68.20％

69.20％

67.30％

68.17％

ori-KNN

67.10％

67.90％

68.20％

68.01％

67.03％

69.11％

68.31％

69.23％

69.10％

67.58％

syn-ISF

69.21％

67.80％

69.40％

68.50％

67.64％

74.30％

71.20％

67.92％

67.90％

68.30％

syn-RF

67.95％

67.89％

68.98％

68.21％

68.23％

68.90％

67.99％

68.21％

68.01％

69.35％

syn-KNN

69.21％

69.13％

69.58％

69.21％

71.11％

68.21％

69.90％

68.81％

68.31％

68.47％

The average of the ten predictions in Table 1 was calculated, as shown in FIG. 4, using an average accuracy of syn-KNN of 69.19%, an average accuracy of ori-KNN of 68.26%, an average accuracy of syn-RF of 68.37%, an average accuracy of ori-RF of 67.98%, an average quasi-removal rate of syn-ISF of 69.22%, and an average accuracy of ori-ISF of 67.62%. The synthetic data has a certain improvement effect on the prediction accuracy, and the verification result shows that the synthetic data output by the data generation model basically has the characteristics of real data and can replace the real data to be used for the study of automatic driving.

In summary, the method has at least the following advantages:

(1) Data security: the synthetic data (e.g., automobile sensor data: image, radar data, and lidar data) generated by the method may be substituted for the real data. The composite data may be used to train an autopilot system without involving personal privacy information of the real user. In contrast, conventional privacy preserving methods may desensitize real data, but the desensitized data may lose useful information, affecting the performance of the autopilot system.

(2) Data sharing: the resulting data generated by the present method may be used for sharing to multiple research institutions or vehicle manufacturers to advance the development of automated driving technology. Thus, the real sensitive data does not need to be shared, and the potential privacy disclosure risk is avoided. Traditional privacy preserving methods may limit sharing of data, reducing efficiency of collaboration and research.

(3) Data richness: the method can generate diversified synthetic data, including various traffic scenes and driving behaviors. This helps to increase the robustness and generalization capability of the autopilot system. Traditional privacy protection methods often only desensitize the data, but fail to increase the diversity of the data.

With the above-described preferred embodiments according to the present invention as an illustration, the above-described descriptions can be used by persons skilled in the relevant art to make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description, but must be determined as the scope of the claims.

Claims

1. A method of processing autopilot data, comprising:

s1, acquiring a real data set of automatic driving;

s2, preprocessing the real data set to obtain real characteristic data;

s3, constructing a data generation model, and training the data generation model by adopting the real characteristic data;

s4, synthesizing a synthesized data set which retains the characteristics of the real data by using the trained data generation model.

2. The method for processing automated driving data according to claim 1, wherein,

the preprocessing of the real data set comprises:

3. The method for processing automated driving data according to claim 2, wherein,

the data is encoded as: converting the text data in the real data into numerical values;

the data were normalized to: and normalizing the numbers in the real data.

4. The method for processing automated driving data according to claim 1, wherein,

constructing a data generation model, comprising:

establishing an optimized objective function min _G max _D V(D,G)；

Selecting an activation function and a loss function;

model parameters are optimized.

5. The method for processing automated driving data according to claim 4, wherein,

the data generation model includes: the system comprises a generating network and a judging network, wherein the generating network is used for generating synthetic data, and the judging network is used for judging the authenticity of the generated synthetic data;

the generating network comprises three hidden layers, and the node numbers of the three hidden layers are respectively set to be 150, 200 and 150; the discrimination network comprises two hidden layers, and the number of nodes of the two hidden layers is 200.

6. The method for processing automated driving data according to claim 5, wherein,

training the data generation model, including:

7. The method for processing automated driving data according to claim 3, wherein,

the data encoding includes:

after each category is encoded, the M categories are combined.

8. The method for processing automated driving data according to claim 3, wherein,

the data normalization adopts a minimum and maximum normalization processing mode, and the real data is normalized to be between [ -1,1 ].

9. The method for processing autopilot data of claim 4 wherein, in the optimization objective function,wherein x represents real data, z represents noise data, E _x Representing the expectation of real data, E _z The expected noise data is represented, D represents a discrimination network, G represents a generation network, and y represents a discrimination condition.

10. The method for processing automatic driving data according to claim 1, further comprising: s5, verifying and evaluating the synthesized data.