WO2019171120A1

WO2019171120A1 - Method for controlling driving vehicle and method and device for inferring mislabeled data

Info

Publication number: WO2019171120A1
Application number: PCT/IB2018/051392
Authority: WO
Inventors: Hirotaka Wada; Yasuyo Kotake
Original assignee: Omron Corporation
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2019-09-12

Abstract

The disclosure provides a method for controlling driving of a vehicle and a method and device for inferring mislabeled data. The method includes that: all data or part of the data in a driving information data set related to driving information of the vehicle is divided into a plurality of data subsets; a plurality of corresponding driving control models are generated on the basis of the plurality of data subsets respectively; each of the plurality of driving control models is evaluated by using the data subsets in the plurality of data subsets, and the data subset including mislabeled data in the plurality of data subsets is inferred according to generated evaluation results; the driving information data set is processed on the basis of the inferred data subset including the mislabeled data; and driving of the vehicle is controlled on the basis of the processed driving information data set.

Description

Method for Controlling Driving Vehicle and Method and Device for

Inferring Mislabeled Data

Technical Field

The disclosure relates to the field of automobiles, and more particularly to a method for controlling driving a vehicle and a method and device for inferring mislabeled data in a data set.

Background

When driving a vehicle, various kinds of driving information data probably influencing the driving of the vehicle may usually be acquired from a surrounding of the vehicle, the driving information data indicating information for driving the vehicle. However, driving information data which should be labeled correctly may be mislabeled sometimes, for example, a setting error made by a checking device or an input error made by a person. For obtaining a high-accuracy control result, a large amount of driving information data is usually required, but if there is a large amount of the driving information data, it is more difficult to find out the data mislabeled therein.

In the prior art, driving information data is usually checked to find out the data mislabeled manually or by an automated tool. However, there is still a problem in the prior art that the data mislabeled cannot be calculated accurately and rapidly.

Summary

^• The technical problem to be solved

A method for controlling driving a vehicle is provided according to an embodiment of the disclosure, so as to at least solve the problem in the prior that the mislabeled data cannot be calculated accurately and rapidly.

^• Means for solving the technical problem

According to an aspect of an embodiment of the disclosure, a method for controlling driving a vehicle is provided, which includes that all data or part of the data in a driving information data set related to driving information of the vehicle is divided into a plurality of data subsets, wherein each piece of data in the driving information data set is pre-pasted with a label; a plurality of corresponding driving control models are generated on the basis of the plurality of data subsets respectively; each of the plurality of driving control models is evaluated by using the data subsets, which are not used to generate the driving control model currently to be evaluated, in the plurality of data subsets, and the data subset including mislabeled data in the plurality of data subsets is inferred according to generated evaluation results; the driving information data set is processed on the basis of the inferred data subset including the mislabeled data; and the driving of the vehicle is controlled on the basis of the processed driving information data set.

By the foregoing method, all the data or part of the data in the data set is divided into N data subsets, N corresponding driving control models are generated on the basis of the N data subsets respectively, and each driving control model in the N driving control models is evaluated by using the data subsets which are not used to generate the driving control model currently to be evaluated, so that the data subset including mislabeled data in the N data subsets may be inferred more accurately and rapidly.

According to another aspect of the embodiment of the disclosure, a method for inferring mislabeled data in a data set is further provided, which includes that: all data or part of the data in the data set is divided into a plurality of data subsets, wherein each piece of data in the data set is pre-pasted with a label; a plurality of corresponding machine learning models are generated on the basis of the plurality of data subsets respectively; and each machine learning model in the plurality of machine learning models is evaluated by using the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets, and the data subset including the mislabeled data in the plurality of data subsets is inferred according to generated evaluation results.

In an exemplary embodiment, evaluating each of the plurality of machine learning models by using the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets comprises: taking each of the data subsets, which are not used to generate a machine learning model currently to be evaluated, in the plurality of data subsets as an evaluation data subset; inputting the data in the evaluation data subset into each machine learning model in the plurality of machine learning models to obtain detection labels corresponding to the input data; comparing the pre-pasted labels of the input data with the detection labels to determine types of the input data; and making statistics to the types of the input data, and determining the evaluation results of each of the plurality of machine learning models with respect to the evaluation data subset according to statistical results.

By the foregoing steps, the data in each data subset (except the data subset used to generate the driving control model currently to be evaluated) is input into the driving control models respectively to obtain the detection labels of the data, and the detection labels are compared with the pre-pasted labels, so that the evaluation results of each driving control model may be calculated for each data subset more easily.

In an exemplary embodiment, the types of the input data comprise: data of which the pre-pasted label is a first label and the detection label is also the first label; data of which the pre-pasted label is a second label and the detection label is also the second label; data of which the pre-pasted label is the second label and the detection label is the first label; and data of which the pre-pasted label is the first label and the detection label is the second label, wherein the first label and the second label have opposite meanings.

By the foregoing steps, after the detection labels are determined, the data is classified according to the determined detection labels and the pre-pasted labels, and then statistics may be made to amounts of data of different types more rapidly, so that the evaluation results may be calculated more rapidly according to the amounts of the data of different types.

In an exemplary embodiment, making the statistics to the types of the input data and determining the evaluation results of each of the plurality of machine learning models with respect to the evaluation data subset according to the statistical results comprises at least one of the following operations:

under the condition that the evaluation results are false negative rates (excessive detection rates), for each of the plurality of machine learning models, calculating, in the evaluation data subset, a ratio of an amount of data of which the pre-pasted labels are the first labels but the detection labels are the second labels to an amount of data of which the pre-pasted labels are the first labels, and taking the calculated ratio as a false negative rate of the current machine learning model;

under the condition that the evaluation results are false positive rates (omission rates) for each of the plurality of machine learning models, calculating, in the evaluation data subset, a ratio of an amount of data of which the pre-pasted labels are the second labels but the detection labels are the first labels to an amount of data of which the pre-pasted labels are the second labels in the evaluation data subsets, and taking the calculated ratio as a false positive rate of the current machine learning model; and

under the condition that the evaluation results are mislabeling rates, for each of the plurality of machine learning models, calculating, in the evaluation data subset, a ratio of an amount of data of which the pre-pasted labels and detection labels are inconsistent in the evaluation data subsets to a total amount of the data, and determining the calculated ratio as a mislabeling rate of the current driving control model.

By the foregoing steps, under the condition that the evaluation results are false negative rates, false positive rates or mislabeling rates, the false negative rates, the false positive rates and the mislabeling rates may be calculated by adopting different calculation formulae according to the amounts of the data of different types, so that the data subset including the mislabeled data may be accurately inferred according to the false negative rates, the false positive rates or the mislabeling rates.

In an exemplary embodiment, the false negative rate is a rate of excessive detection, and the false positive rate is a rate of omission.

In an exemplary embodiment, inferring the data subset containing the mislabeled data in the plurality of data subsets according to the generated evaluation results comprises: inferring the data subset containing the mislabeled data in the plurality of data subsets according to a distribution characteristic of evaluation results, which each is more than a first threshold value, in the generated evaluation results.

By the foregoing steps, after the evaluation result of each machine training model with respect to each data subset is calculated, whether the data subset corresponding to the driving control model includes the mislabeled data or the data subset used to evaluate the driving control model includes the mislabeled data may be inferred according to the distribution characteristic of the evaluation results.

According to another aspect of the embodiment of the disclosure, a device for inferring mislabeled data in a data set is further provided, which includes: a data division portion configured to divide all data or part of the data in the data set into a plurality of data subsets, wherein each piece of data in the data set is pre-pasted with a label; a model generating portion configured to generate a plurality of corresponding machine learning models on the basis of the plurality of data subsets respectively; and an inferring portion configured to evaluate each machine learning model in the plurality of machine learning models by using the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets, and infer the data subset including the mislabeled data in the plurality of data subsets according to generated evaluation results.

By the foregoing device, the data set may be divided into the plurality of data subsets, the corresponding machine learning models are generated by using the plurality of data subsets respectively, and each machine learning model is evaluated by using the other data subsets except the data subset used to generate the currently evaluated driving control model, so that the data subset including the mislabeled data may be inferred more accurately according to the evaluation results.

According to another aspect of the embodiment of the disclosure, a computer program is further provided, which is executed by a processor to implement the method of the foregoing technical solution.

According to another aspect of the embodiment of the disclosure, a computer-readable storage medium is further provided, in which a computer program is stored, the computer program being executed by a processor to implement the method of the foregoing technical solution.

^• Technical effect

According to the method provided by the embodiment of the disclosure for controlling driving the vehicle, the driving information data set is divided into the plurality of data subsets, the corresponding driving control models are generated, each driving control model is evaluated by using the other data subsets except the data subset used to generate the driving control model currently to be evaluated, then the data subset including the mislabeled data is inferred by using the obtained evaluation results, and finally the driving information data set is processed to control the driving of the vehicle on the basis of the inferred data subset including the mislabeled data, so that the problem of incapability of controlling the driving of the vehicle on time is solved, and a beneficial effect of controlling driving the vehicle more accurately is achieved.

According to the method provided by the embodiment of the disclosure for controlling driving the vehicle, the data set is divided into the plurality of data subsets, the corresponding machine training models are generated, each machine training model is evaluated by using the other data subsets except the data subset used to generate the machine training model currently to be evaluated, and then the data subset including the mislabeled data is inferred by using the obtained evaluation results, so that the problem of incapability of accurately and rapidly recognizing the mislabeled data when the mislabeled data in the data set is inferred is solved, and a beneficial effects of inferring the data subset including the mislabeled data more accurately and more rapidly is achieved.

In addition, by using the method or device for inferring the mislabeled data in the data set, the data subset including the mislabeled data may be inferred rapidly and accurately, the mislabeled data may finally be inferred by iteration, and the driving control models may be retrained by using the data set from which the mislabeled data is removed or updated, so that the machine training models generated by using training data are optimized by improving quality of the training data.

Brief Description of the Drawings

The drawings described herein are used to provide a further understanding of the disclosure and constitute a part of the present application. The schematic embodiments of the disclosure and the descriptions thereof are used to explain the disclosure, and do not constitute improper limitations to the disclosure. In the drawings:

Fig. 1 A is an exemplary flowchart of a method for controlling driving a vehicle according to an embodiment of the disclosure;

Fig. 1 B is an exemplary flowchart of a method for inferring mislabeled data in a data set according to an embodiment of the disclosure;

Fig. 2 is an exemplary flowchart of another method for inferring mislabeled data in a data set according to an embodiment of the disclosure;

Fig. 3 is an exemplary schematic diagram of dividing a data set into a plurality of data subsets according to an embodiment of the disclosure;

Fig. 4 is an exemplary flowchart of a method for inferring reliability of training data according to an embodiment of the disclosure;

Fig. 5 is a schematic view of an example of an information processing device according to an embodiment of the disclosure;

Fig. 6 is an exemplary structure diagram of a device for inferring mislabeled data in a data set according to an embodiment of the disclosure; and

Fig. 7 is an exemplary structure diagram of a device for inferring reliability of training data according to an embodiment of the disclosure.

Detailed Description of the Embodiments

In order to allow a person skilled in the art to have a better understanding of the present invention, the embodiments of the present invention will be clearly and completely described in conjunction with the drawings of the present invention in the following. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. On the basis of the embodiments of the present invention, all of the other embodiments which can be obtained by a person skilled in the art without involving any inventive efforts fall within the range set forth by the present invention.

In recent years, in some fields, a large amount of training data may be used for training a machine learning model, so that the application of machine learning technology is rapidly developed. However, in the prior art, training data used for training a machine learning model includes mislabeled data sometimes, for example, a setting error made by a checking device or a manmade input error, so that the machine learning model trained by using the mislabeled training data may not be optimal. In order to train an optimal machine learning model, a large amount of training data for machine learning model training is required. However, if there is a large amount of training data, it is more difficult to find mislabeled training data. In the prior art, training data is usually checked manually or the training data is checked through an automated tool to find out the mislabeled data. However, the problem that the mislabeled data may not be inferred accurately and rapidly is not solved in these methods.

In order to infer the mislabeled training data more accurately and more rapidly, in an exemplary embodiment of the disclosure, all data or part of the data in a data set is divided into a plurality of data subsets, a plurality of corresponding machine learning models are generated on the basis of the plurality of data subsets respectively, then each machine learning model in the plurality of machine learning models is evaluated by using the data subsets which are not used to generate the machine learning model currently to be evaluated, and the data subset including mislabeled data is inferred according to obtained evaluation results.

In such a manner, by the method for inferring the mislabeled data in the data set provided by the embodiment of the disclosure, the mislabeled training data in the data set for training machine learning models may be inferred more accurately and more rapidly.

After the data subset including the mislabeled data is inferred, the inferred data subset including the mislabeled data may be manually checked, labels of the mislabeled data are modified, and the machine learning models are updated with the modified data subsets. Or, the data subset including the mislabeled data may be removed from the data set, and the machine learning models are updated with the data set from which the data subset including the mislabeled data is removed. Or, the inferred data subset including the mislabeled data may further be taken as a data set, and the method for inferring the mislabeled data in the data set is repeatedly executed until the mislabeled data in the data set is inferred.

In such a manner, quality of the training data may be improved, so that an optimal machine learning model may be trained by using the high-quality training data.

The method for inferring the mislabeled data in the data set may be applied to various scenarios, and for example, may be applied to the fields of automatic driving, medical treatment and health, retailing, aerospace, transportation and the like.

In an exemplary embodiment of the disclosure, the method for inferring the mislabeled data in the data set is applied to an automatic driving system. Fig. 1 A is a flowchart of a method for controlling driving a vehicle according to an embodiment of the disclosure. As shown in Fig. 1 A, the method includes the following steps.

In Step S100, all data or part of the data in a driving information data set related to driving information of the vehicle is divided into a plurality of data subsets, wherein each piece of data in the driving information data set is pre-pasted with a label.

The automatic driving system may include a Central Processing Unit (CPU), a braking system, an acceleration system, a steering system, a navigation system and a sensing system. The navigation system is configured to receive data about geographical position information (for example, Global Positioning System (GPS) data, the received data may be used for determining a current position of the vehicle), and determine an overall driving route of the vehicle according to the current position of the vehicle and a target position set by a user. The sensing system includes more than one sensor, and is configured to sense sensing information of obstacles in front of, behind and on a left side and right side of the vehicle, a traffic signal in front of the vehicle, road signs in front of and on the right side of the vehicle and the like, and send the detected sensing information to the central processing unit.

After receiving the sensing information, the central processing unit divides all the data or part of the data in the driving information data set related to the driving information of the vehicle into the plurality of data subsets, wherein each piece of data in the driving information data set is pre-pasted with a label.

In Step S102, a plurality of corresponding driving control models are generated on the basis of the plurality of data subsets respectively.

A supervised learning may be a process for searching for a function complying with the known data in the whole mapping space according to the known data. Specifically, for each data subset, model parameters complying with the data subset is searched, and finally found function is the driving control model trained by the data subset.

Some machine learning algorithms listed in the prior art are adopted to solve the supervised learning problem, for example, Naive Bayes for a classification problem, logistic regression and a support vector machine, and will not be elaborated herein.

In Step S104, each of the plurality of driving control models is evaluated by using the data subsets, which are not used to generate the driving control model currently to be evaluated, in the plurality of data subsets, and the data subset including mislabeled data in the plurality of data subsets is inferred according to generated evaluation results.

In Step S106, the driving information data set is processed on the basis of the inferred data subset including the mislabeled data, and the driving of the vehicle is controlled on the basis of the processed driving information data set.

In an embodiment of the disclosure, the mislabeled data or the data subset including the mislabeled data is removed from the driving information data set, a control instruction is generated according to the sensing information and the driving information data set from which the mislabeled data (or the data subset) is removed, and the braking system, the steering system, the acceleration system and the like are controlled by means of the control instruction, that is, each part of the vehicle is controlled to control a direction and speed of the vehicle by means of a reliable control instruction.

It should be noted that it is the vehicle described in the embodiment, but the vehicle may include, but not limited to, any type of vehicle such as an automobile, a ship, an airplane and a train. In the embodiment, the driving information data with high reliability is used, so that the vehicle may be operated more accurately according to a calculated result.

In another exemplary embodiment, the method for inferring the mislabeled data in the data set may be applied to the field of medical treatment and health, for example, drug discovery, gene testing, personalized healthcare or precision surgical operations. In an embodiment, a surgical operation is mainly taken as an example. In clinical surgery, real-time interactive quantitative analyses are usually needed to be performed on the three-dimensional volume, distance, angle, blood vessel diameter etc. of human organs by using images, so as to perform a full quantitative three-dimensional assessment before surgery. However, in practice, deviations sometimes may occur to the accuracy of such three-dimensional assessment. When the method for inferring the mislabeled data in the data set provided in the embodiment of the disclosure is applied to three-dimensional evaluation of the organ, mislabeled data in three-dimensional data output by using image data may be inferred. Specifically, for data, acquired by using image data, of the organ of the human body, if it includes the mislabeled data, output of a machine (for example, a surgical operation robot) may be unexpected, and using the output as the action of the machine will be dangerous. In order to avoid such risk, under the condition that the output three-dimensional data includes the mislabeled data, it may be submitted to a doctor for final confirmation, thereby generating accurate three-dimensional evaluation and making the surgical operation more rapid, more accurate and safer.

Fig. 1 B is an exemplary flowchart of a method for inferring mislabeled data in a data set according to an embodiment of the disclosure. As shown in Fig. 1 B, the method includes the following steps.

In Step S10, all data or part of the data in a data set is divided into a plurality of data subsets, wherein each piece of data in the data set is pre-pasted with a label.

In Step S12, a plurality of corresponding machine learning models are generated on the basis of the plurality of data subsets respectively.

A machine learning model is generated according to training data in a data subset in the plurality of data subsets. With adoption of the same method, the machine learning models corresponding to the other data subsets are trained by using the other data subsets, respectively.

In Step S14, each of the plurality of machine learning models is evaluated by using the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets.

The plurality of data subsets are taken as a plurality of training data subsets, and are also taken as a plurality of evaluation data subsets. However, an evaluation data subset therein may not be used to evaluate the machine learning model generated by using the evaluation data subset itself, that is, each machine learning model in the plurality of machine learning models may be evaluated by using the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets.

Taking an evaluation data subset and any machine learning model, which is generated not by using the evaluation data subset, in the plurality of machine learning models as an example, the data in the evaluation data subset is input into the machine learning model respectively to obtain detection labels corresponding to the input data; and the pre-pasted labels of the input data are compared with the detection labels to determine types of the input data. For example, the data is inferred to be data of which the pre-pasted label is negative (i.e., the second label) and the detection label is also negative, or data of which the pre-pasted label is positive (i.e., the first label) and the detection label is also positive, or data of which the pre-pasted label is negative but the detection label is positive, or data of which the pre-pasted label is positive but the detection label is negative.

Statistics is made to the different types of data respectively according to the determined data types. Then, evaluation results are calculated according to statistical amounts of the different types of data. An example of the evaluation result may be a false negative rate, a false positive rate or a mislabeling rate.

Under the condition that the evaluation result is a false negative rate, a ratio of the amount of the data of which the pre-pasted labels are positive but the detection labels are negative to the amount of the data of which the pre-pasted labels are positive is calculated in the evaluation data subset, and the calculated ratio is taken as the false negative rate of the current machine learning model.

Under the condition that the evaluation result is a false positive rate, a ratio of the amount of the data of which the pre-pasted labels are negative but the detection labels are positive to the amount of the data of which the pre-pasted labels are negative is calculated in the evaluation data subset, and the calculated ratio is taken as the false positive rate of the current machine learning model.

Under the condition that the evaluation result is a mislabeling rate, a ratio of the amount of the data of which the pre-pasted labels and detection labels are inconsistent in the evaluation data subset to a total amount of the data is calculated in the evaluation data subset, and the calculated ratio is taken as the mislabeling rate of the current machine learning model.

By the method, the evaluation result of each machine learning model is calculated with respect to each evaluation data subset (except the machine learning model generated by using the evaluation data subset), thereby obtaining evaluation results.

In Step S16, the data subset including mislabeled data in the plurality of data subsets is inferred according to evaluation results.

In an exemplary embodiment of the disclosure, the data subset including the mislabeled data may be inferred by inferring an undesired machine learning model. Specifically, the undesired machine learning model in the plurality of machine learning models is inferred according to the evaluation results, and the training data subset used for generating the inferred undesired machine learning model is inferred to be the data subset including the mislabeled data. For example, in the plurality of machine learning models, the machine learning model with a highest frequency of occurrence of evaluation results, which each is more than a first threshold, is inferred as the undesired machine learning model.

In another exemplary embodiment of the disclosure, the data subset including the mislabeled data may be inferred by inferring an evaluation data subset including the mislabeled data. For example, in N data subsets as evaluation data subsets, the data subset with a highest frequency of occurrence of evaluation results, which each is more than a first threshold, is inferred as the wrong evaluation data subset.

In Step S18, corresponding processing is performed on the inferred data subset including the mislabeled data.

In an exemplary embodiment of the disclosure, the data subset including the mislabeled data may be removed from the data set, and the plurality of machine learning models are updated with the data set from which the data subset including the mislabeled data is removed. Or, the inferred data subset including the mislabeled data is manually checked, labels of the mislabeled data are modified, and the machine learning models are updated with the modified data subset. Or, the inferred data subset including the mislabeled data is taken as a new data set, and Step S10 to Step S16 are repeated until the mislabeled data is inferred. In the prior art, whether a label is mistakenly pasted or not is finally required to be determined by a user. However, if there exists a large amount of data in the data set, it is extremely difficult to find an error. By the method provided by the disclosure, the mislabeled data may be inferred easily and accurately.

Fig. 2 is an exemplary flowchart of another method for inferring mislabeled data in a data set according to an embodiment of the disclosure. As shown in Fig. 2, the method includes the following steps.

In Step S20, a data set is prepared.

A data set including a large amount of data is prepared, wherein the data in the data set may be in various forms, and for example, may be image data obtained by shooting an object (for example, a part or a product), or may be waveform data such as output of a motor and blood pressure of a person.

Each piece of data in the data set is pasted with a label, and a value of the label may be negative or positive in an example. Under the condition that the data is image data, the label “positive” may represent positive, and the label “negative” may represent no good. Under the condition that the data is waveform data, the label “positive” may represent that the waveform data is normal, and the label“negative” represents that the waveform data is abnormal. Of course, in another embodiment, the pasted label of the data may also be a label representing another meaning, and a specific content and meaning of the label will not be limited in the exemplary embodiment of the disclosure.

The data in the data set includes correctly labeled data and mislabeled data, that is, the labels assigned to the data through a certain recognition model or manually are positive and negative. Here, there is made such a hypothesis that most of the data is correctly labeled data and an amount of the mislabeled data is very small. A relationship between a true value as a reference and a pre-pasted label is shown in Table 1 .

Table 1

As shown in Table 1 , there exist four possibilities for the relationship between the true value of the data in the data set and the pre-pasted label: the first is that the true value of the data is positive and the pasted label is also positive; the second is that the true value of the data is positive but the pasted label is negative; the third is that the true value is negative and the pasted label is positive; and the fourth is that the true value of the data is negative but the pasted label is negative.

Under the conditions of the first and the fourth, the true value and the label are consistent, that is, the data is assigned with a correct label. On the other aspect, under the conditions of the second and the third, the true value and the label are inconsistent, that is, the data is assigned with a wrong label.

The second condition in Table 1 represents that data which is positive is determined to be no good, so that this means“false negative”, that is, an algorithm makes an excessively strict judgment. On the other aspect, the third condition represents that data which is no good is determined to be positive, so that this means “false positive”, that is, the algorithm does not notice the no good which should be noticed.

In Step S22, the data set is divided into a plurality of data subsets.

As shown in Fig. 3, the data set S is divided into a plurality of data subsets S1 , S2, S3...Sn which are mutually exclusive. As mentioned above, an amount of correctly labeled data in the data set should be much larger than an amount of mislabeled data, that is, there are a small number of data subsets including the data of the second type or the third type. However, the specific data subset including the data of the second type or the third type in S1 ~Sn is not known in advance, so that how to infer the data subset including the mislabeled data in S1 ~Sn is a key point of the disclosure, and will be described below in detail.

In Step S24, the plurality of data subsets are taken as a plurality of training data subsets, and a plurality of corresponding machine learning models are generated by using the plurality of training data subsets respectively.

The data subsets S1 ~Sn are used as the training data subsets, and n machine learning models are generated. The machine learning model generated by using training data in S1 is set to be M1 , the machine learning model generated by using training data in S2 is set to be M2, ..., and the machine learning model generated by using training data in Sn is set to be Mn. Here, a label (i.e., pre-pasted label) assigned to each piece of training data in the corresponding data subsets S1 ~Sn in advance is determined as a true value of the training data.

In Step S26, the plurality of data subsets are taken as a plurality of evaluation data subsets, each machine learning model in the plurality of machine learning models is evaluated by using the evaluation data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of evaluation data subsets, and the data subset including mislabeled data is inferred according to evaluation results.

In an exemplary embodiment of the disclosure, the n machine learning models are evaluated by utilizing the data subsets S1 ~Sn as the evaluation data subsets respectively. For example, the data subset S1 is input into the machine learning model M2 to evaluate an evaluation result of the machine learning model M2 with respect to the data subset S1 ; the data subset S1 is input into the machine learning model M3 to evaluate an evaluation result of the machine learning model M3 with respect to the data subset S1 ; ...; and the data subset Sn is input into the machine learning model M3 to evaluate an evaluation result of the machine learning model M3 with respect to the data subset S1. It is important to note that the data subset S1 may not be used to evaluate the machine learning model M1 generated according to the data subset S1. By parity of reasoning, the evaluation results of each machine training model with respect to each data subset are obtained.

False negative rates, false positive rates, mislabeling rates or the like may be calculated as the evaluation results.

In an exemplary embodiment of the disclosure, a false negative rate is taken as an example. A false negative rate refers to, for a machine learning model, in all output of the machine learning model with respect to an evaluation data subset, a ratio of a total amount of data of which true values are positive but labels are negative to a total amount of data of which true values are positive, i.e., a total amount of data of the second type/(a total amount of data of the first type + the total amount of the data of the second type). Specifically, data in an evaluation data subset is input into a machine learning model to be evaluated respectively to obtain labels (i.e., detection labels) of all the data in the evaluation data subset, and in the evaluation data, a ratio of a total amount of data of which true values are positive but the labels are negative to a total amount of data of which true values are positive subset is taken as a false negative rate of the machine learning model to be evaluated. If there is less mislabeled data, the false negative rate is closer to 0, and if there is more mislabeled data, the false negative rate is closer to 1.

False negative rates of all the machine learning models with respect to each evaluation data subset may be calculated by adopting the foregoing formula for calculating the false negative rate, and calculation results are shown in an exemplary Table 2.

Table 2

The ratio 0.53 in Table 2 represents a false negative rate when the machine learning model M2 generated by taking S2 as a training data subset is evaluated with the evaluation data subset S1 , i.e., a false negative rate of M2 with respect to S1 ; similarly, the ratio 0.11 represents a false negative rate when the machine learning model M3 generated by taking S3 as a training data subset is evaluated with the evaluation data subset S1 ; and the ratio 0.21 represents a false negative rate when the machine learning model M1 generated by taking S1 as a training data subset is evaluated with the evaluation data subset S2. It should be noted that the false negative rate mentioned herein is calculated by taking the labels assigned to the data in the data subsets S1 ~Sn in advance as true values and inferring whether the labels assigned by the machine learning models are correct or not according to the true values.

After the false negative rates of each machine learning model with respect to each evaluation data subset are calculated, the data subset including the mislabeled data may be inferred. Accuracy of a machine learning model generated by using data including mislabeled training data may usually be reduced. That is, under the condition that a machine learning model outputs many wrong results for data in an evaluation data subset, a probability that training data used to train the machine learning model is unsuitable is high. Therefore, in Table 2, a probability that the data subset S2 generating the machine learning model M2 includes the mislabeled data is high. In other words, compared with the other data subsets, a probability that the data subset with a very high false negative rate (i.e., the false negative rate is higher than a specified threshold value) includes excessively detected data is high.

How to calculate the false negative rates and infer the data subset including the excessively detected data according to calculation results is described above with the false negative rate as an example. A method for calculating false positive rates and a method for inferring the data subset including omitted data by using the false positive rates are similar to the method using the false negative rates. A false positive rate refers to, for a machine learning model, in all output of the machine learning model with respect to an evaluation data subset, a ratio of a total amount of data of which true values are negative but labels are positive to a total amount of data of which true values are negative, i.e. a total amount of data of the third data type/(a total amount of data of the first data type + the total amount of the data of the second data type). Similarly, compared with the other data subsets, a probability that the data subset with a very high false positive rate includes the omitted data is high. It is important to note that the data subset including the excessively detected data and the data subset including the omitted data are not always consistent.

Besides the false positive rates and the false negative rates for evaluation, mislabeling rates may also be calculated to evaluate the machine learning models. That is, the machine learning models are not always evaluated through false negative and false positive, and a proportion of the mislabeled data in the whole data set may also be inferred.

In Step S28, corresponding processing is performed on the inferred data subset including the mislabeled data.

In an exemplary embodiment of the disclosure, corresponding processing may be performed on the data subset including the mislabeled data in the following three manners: a first manner in which the data subset including the mislabeled data is manually checked, and the machine learning models are updated after the labels are modified; a second manner in which the data subset including the mislabeled data is removed from the data subsets S1 ~Sn, and the learning models are updated with the data set from which the data subset including the mislabeled data is removed; and a third manner in which the data subset including the mislabeled data is further divided into a plurality of data subsets with relatively small data amounts, and Step S20 to Step S26 are repeatedly executed until undesired data is inferred.

Of course, Step S20 to Step S26 may also be repeatedly executed sometimes only until an approximate position of the mislabeled data is estimated rather than the mislabeled data is inferred, that is, only a relatively small data subset (including a small amount of data) including the mislabeled data is inferred.

In another exemplary embodiment of the disclosure, the processing method in the first manner or the second manner may be executed when the mislabeled data is inferred so as to improve quality of the training data.

By the method provided by the embodiment, it is easy to infer the mislabeled data. For example, compared with the condition of checking the whole data set, an amount of data required to be checked is reduced. Moreover, in the embodiment, the machine learning models are further regenerated by using a data set obtained by modifying or deleting the mislabeled data; or the operation of further dividing the inferred data subset including the mislabeled data into a plurality of data subsets is repeatedly executed to infer the mislabeled data; or errors are classified (for example, false negative or false positive or mislabeling), and for each type of errors which are classified, data subsets including the mislabeled data are inferred. By these processing manners, the quality of the training data in the data set is improved.

An exemplary embodiment of the disclosure further provides a method for inferring reliability of training data.

When an algorithm of a machine learning model is applied, there may exist the condition that training data which should be correct is mislabeled sometimes, for example, a setting error by a checking device and an input error by a person. If a large amount of training data is used, even if individual data is mislabeled in the training data, an approximately correct result may be obtained for noise. However, in practice, application of obtaining a large amount of data as training data is limited. Therefore, how to infer mislabeled data in the training data and accordingly infer reliability of the training data is particularly important.

Fig. 4 is an exemplary flowchart of a method for inferring reliability of training data according to an embodiment of the disclosure. As shown in Fig. 4, the method includes the following steps.

In Step S40, a data set is prepared.

The data set is prepared, each piece of data in the data set is manually or automatically assigned with a label value as an initial value of a label of the data in advance, but the pre-assigned label value may be inconsistent with a true value of the data. A relationship between a true value taken as a reference and a pre-assigned value is shown in Table 3.

Table 3

As shown in Table 3, there exist four possibilities for the relationship between the true value of the training data in the data set and the pre-assigned value: the first is that the true value of the data is positive and the pre-assigned value is also positive; the second is that the true value of the data is positive but the pre-assigned value is negative; the third is that the true value is negative and the pre-assigned value is positive; and the fourth is that the true value of the data is negative but the pre-assigned value is negative.

Under the conditions of the first and the fourth, the true value and the pre-assigned value are consistent, that is, the data is assigned with a correct label. On the other aspect, under the conditions of the second and the third, the true value and the pre-assigned value are inconsistent, that is, the data is assigned with a wrong label.

In Step S42, the data set is randomly divided into a plurality of data subsets, and the data subset including mislabeled data in the plurality of data subsets is inferred.

The data set taken as a training data set and also taken as an evaluation data set is divided into the plurality of data subsets, wherein the training data set includes a plurality of data subsets which are taken as training data subsets, and the evaluation data set includes the same a plurality of data subsets which are taken as the evaluation data subsets.

In a case that training data in the training data set includes the mislabeled data, it leads to a result that: if there is a large amount of data of the second type in the training data set, a probability that a machine learning model generated by the training data is excessively detected is increased, that is, the data labeled with a label “positive” is likely to be determined to be the data labeled with a label“negative”; and if there is a large amount of data of the third type in the training data set, a probability that the machine learning model is omitted is increased, that is, the data labeled with NG is likely to be determined to be the data with a label“positive”. In a case that evaluation data in an evaluation data set is mislabeled but the machine learning model is correct, it leads to a result that: if there is a large amount of data of the second type in the training data set, the probability that the machine learning model is omitted is increased; and if there is a large amount of data of the third type in the training data set, the probability that the machine learning model is excessively detected is increased.

According to the foregoing principle, for each machine learning model, it is only necessary to calculate a false positive rate or false negative rate with respect to each evaluation data subset (except the evaluation data subset used to generate the machine learning model currently to be evaluated), and then the machine learning model with a high false positive rate or false negative rate may be determined, thereby inferring that the data subset corresponding to the machine learning model is a data subset including the mislabeled data.

Therefore, the method for inferring the mislabeled data in the data set in Fig. 1 or Fig. 2 may be executed to infer the data subset including the mislabeled data in the plurality of data subsets. That is, the label values, calculated by the machine learning models, of the data are compared with the pre-assigned label values, and the data subset including the mislabeled data is inferred according to a comparing result.

In Step S44, reliability of training data is inferred.

If more data subsets including the mislabeled data are inferred in Step S42, it is indicated that more data subsets are abnormal, and at this moment, it may be inferred that the quality of the training data in the data set is poorer, that is, the reliability of the training data is lower; and on the contrary, if fewer data subsets including the mislabeled data are inferred, it is indicated that fewer data subsets are abnormal, and at this moment, it may be inferred that the quality of the training data in the data set is higher, that is, the reliability of the training data is higher.

In an embodiment of the disclosure, the labels of the mislabeled data may also be removed or modified under the condition that the label values output by the machine learning models are inconsistent with the pre-assigned label values, that is, there exist errors, the machine learning models are re-updated with the modified data set until the errors between the label values output by the machine learning models and the pre-assigned label values are 0, and when the errors are 0, the data in the data set is taken as real training data. By the method, the reliability of the training data may be improved.

Fig. 5 is a schematic view of an example of an information processing device according to an embodiment of the disclosure. The information processing device may be, for example, a PC (Personal Computer) or an embedded device. As shown in Fig. 5, the PC 500 can include a CPU 510 for performing overall control, a read only memory (ROM) 520 for storing system software, a random access memory (RAM) 530 for storing written-in/read-out data, a storage unit 540 for storing various programs and data, an input/output unit 550 being used as an input/output interface, and a communication unit 560 for implementing a communication function. Alternatively, the CPU 510 can be replaced by a processor, for example a microprocessor MCU or a Field-Programmable Gate Array (FRGA). The input/output unit 550 can include various interfaces, such as an input/output interface (I/O interface), a universal serial bus (USB) port (can be included as one port of the ports of an I/O interface), and a network interface. It can be understood for a person skilled in the art that the structure shown in Fig. 5 is merely illustrative, and does not limit the hardware configuration of the system for inferring the reliability of data. For example, the PC 300 can further include more or fewer components than those shown in Fig. 5, or have a configuration different from that shown in Fig. 5.

It should be noted that the described CPU 510 can include one or more processor(s), the one or more processor(s) and/or other data processing circuits in the disclosure can generally be referred to as“data processing circuit”. The data processing circuit can be wholly or partly embodied as software, hardware, firmware or any other combinations. In addition, the data processing circuit can be a single independent processing module, or wholly or partly integrated into any one of the other components in the PC 500.

The storage unit 540 can be used for storing software programs of application software and modules, as a program instruction/data storage device described in the disclosure later, the program instruction/data storage device corresponding to the method for inferring the reliability of the data. The CPU 510 operates the software programs and modules stored in the storage unit 540 so as to implement the described method for inferring the reliability of data. The storage unit 540 can include a non-volatile memory, such as one or more magnetic memory, flash memory or other non-volatile solid state memory. In some examples, the storage unit 540 can further include memories which are remotely provided with respect to the CPU 510, and these remote memories can be connected to the PC 500 by means of a network. The examples of the described network include, but are not limited to, Internet, Intranet, LAN, mobile communication network, and the combinations thereof. The communication unit 560 is used for receiving or sending data through a network. The specific examples of the described network can include the wireless network provided by the communication provider of the PC 500. In an example, the communication unit 560 includes a network interface controller (NIC), and the NIC can be connected to other network devices by a base station so as to communicate with the Internet. In an example, the communication unit 560 can be a radio frequency (RF) module, which communicates with the Internet in a wireless manner.

Fig. 6 is an exemplary structure diagram of a device for inferring mislabeled data in a data set according to an embodiment of the disclosure. As shown in Fig. 6, the device includes a data division portion 60, a model generating portion 62 and an inferring portion 64. The data division portion 60, the model generating portion 62 and the inferring portion 64 may be implemented by executing the program stored on the PC 500 as shown in Fig.5.

The data division portion 60 is configured to divide all data or part of the data in a data set into a plurality of data subsets, wherein each piece of data in the data set is pre-pasted with a label; the model generating portion 62 is configured to generate a plurality of corresponding machine learning models on the basis of the plurality of data subsets respectively; and the inferring portion 64 is configured to evaluate each machine learning model in the plurality of machine learning models by using the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets, and infer the data subset including the mislabeled data in the plurality of data subsets according to generated evaluation results.

Fig. 7 is an exemplary structure diagram of a device for inferring reliability of training data according to an embodiment of the disclosure. As shown in Fig. 7, the device includes a division portion 70, a generation portion 72, a recognizer 74 and an updating portion 76.

The division portion 70 is configured to divide a data set into a plurality of data subsets, and the generation portion 72 generates a plurality of corresponding machine learning models according to the plurality of data subsets respectively. The division portion 70 inputs the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets generated into each machine learning model in the plurality of machine learning models generated by the generation portion 72 respectively, and thus, the generation portion 72 obtains label values of all the data in the data set according to output results of the machine learning models.

The recognizer 74 compares the calculated label value of each piece of data with a pre-assigned label value of the data, and calculates a false positive rate or false negative rate of each machine learning model with respect to each evaluation subset (except the data subset generating the machine learning model) according to a comparison result. Under the condition that all of the false positive rates or false negative rates of a machine learning model with respect to all the evaluation subsets (except the data subset generating the machine learning model) are relatively high, it is inferred that the data subset generating the machine learning model is an abnormal data set. Under the condition that a number of abnormal data subsets included in the data set is larger than a preset threshold value, it is inferred that reliability of training data in the data set is relatively low.

When comparing the calculated label value of each piece of data with the pre-assigned label value of the data, the recognizer 74 may further determine whether the two are consistent or not, and if NO, determines that there exists an error. Under the condition that there exists the error, the recognizer 74 may remove the data from the data set or modify a label of the data by manual checking.

The updating portion 76 is configured to update the plurality of machine learning models by using the updated data set.

The device may further repeatedly execute the functions executed by the foregoing parts by using the updated a plurality of machine learning models until errors are 0 (that is, there are no errors).

When realized in a form of a software functional unit and sold or used as an individual product, said device for inferring the mislabeled data or part of it can be stored in a computer readable storage medium. On the basis of this understanding, the technical solution of the disclosure essentially can be, or the part of the technical solution which makes a contribution over the prior art or the whole technical solution or a part of the technical solution can be embodied in a form of a software product, and such computer software program is stored in a storage medium and comprises several instructions for enabling a computer device (which may be a PC computer, a server, or Internet equipment) to perform all the steps or part of the steps of the method according to each embodiments of the disclosure. Said storage medium includes various media capable of storing program codes, such as a USB flash disk, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk or a compact disk, and may also includes a data flow that can be downloaded from a server or a cloud.

The above are only the preferred embodiments of the present disclosure. For those skilled in the art, various improvement and modifications can be made without departing from the principle of the present disclosure, and such improvement and modification are intended to be included within the scope of protection of the present disclosure.

Reference signs in the accompanying drawings

30: Input portion 32: Calculation portion

500: PC 510: CPU

520: ROM 530: RAM

540: Storage unit 550: I/O unit

560: Communication unit 60: Data division portion

62: Model generating portion 64: Inferring portion

70: Division portion 72: Generation portion

74: Recognizer 76: Updating portion

Claims

CLAIMS:

1. A method for controlling driving a vehicle, comprising:

dividing all data or part of the data in a driving information data set related to driving information of the vehicle into a plurality of data subsets, wherein each piece of data in the driving information data set is pre-pasted with a label;

generating a plurality of corresponding driving control models on the basis of the plurality of data subsets respectively;

evaluating each of the plurality of driving control models by using data subsets, which are not used to generate a driving control model currently to be evaluated, in the plurality of data subsets, and inferring a data subset comprising mislabeled data in the plurality of data subsets according to generated evaluation results; and

processing the driving information data set on the basis of the inferred data subset containing the mislabeled data, and controlling driving the vehicle on the basis of the processed driving information data set.

2. A method for inferring mislabeled data in a data set, comprising:

dividing all data or part of the data in the data set into a plurality of data subsets, wherein each piece of data in the data set is pre-pasted with a label;

generating a plurality of corresponding machine learning models on the basis of the plurality of data subsets respectively; and

evaluating each of the plurality of machine learning models by using the data subsets, which are not used to generate a machine learning model currently to be evaluated, in the plurality of data subsets, and inferring a data subset containing the mislabeled data in the plurality of data subsets according to generated evaluation results.

3. The method of claim 1 , wherein evaluating each of the plurality of machine learning models by using the data subsets, which are not used to generate the machine learning model currently to be evaluated, in the plurality of data subsets comprises:

taking each of the data subsets, which are not used to generate a machine learning model currently to be evaluated, in the plurality of data subsets as an evaluation data subset;

inputting the data in the evaluation data subset into each machine learning model in the plurality of machine learning models to obtain detection labels corresponding to the input data; comparing the pre-pasted labels of the input data with the detection labels to determine types of the input data; and

making statistics to the types of the input data, and determining the evaluation results of each of the plurality of machine learning models with respect to the evaluation data subset according to statistical results.

4. The method of claim 3, wherein the types of the input data comprise:

data of which the pre-pasted label is a first label and the detection label is also the first label;

data of which the pre-pasted label is a second label and the detection label is also the second label;

data of which the pre-pasted label is the second label and the detection label is the first label; and

data of which the pre-pasted label is the first label and the detection label is the second label,

wherein the first label and the second label have opposite meanings.

5. The method of claim 3 or 4, wherein making the statistics to the types of the input data and determining the evaluation results of each of the plurality of machine learning models with respect to the evaluation data subset according to the statistical results comprises at least one of the following operations:

under the condition that the evaluation results are false negative rates, for each of the plurality of machine learning models, calculating, in the evaluation data subset, a ratio of an amount of data of which the pre-pasted labels are the first labels but the detection labels are the second labels to an amount of data of which the pre-pasted labels are the first labels, and taking the calculated ratio as a false negative rate of the current machine learning model;

under the condition that the evaluation results are false positive rates, for each of the plurality of machine learning models, calculating, in the evaluation data subset, a ratio of an amount of data of which the pre-pasted labels are the second labels but the detection labels are the first labels to an amount of data of which the pre-pasted labels are the second labels in the evaluation data subsets, and taking the calculated ratio as a false positive rate of the current machine learning model; and

6. The method of claim 5, wherein the false negative rate is a rate of excessive detection, and the false positive rate is a rate of omission.

7. The method of any one of claims 3 to 6, wherein inferring the data subset containing the mislabeled data in the plurality of data subsets according to the generated evaluation results comprises: inferring the data subset containing the mislabeled data in the plurality of data subsets according to a distribution characteristic of evaluation results, which each is more than a first threshold value, in the generated evaluation results.

8. A device for inferring mislabeled data in a data set, comprising:

a data division portion configured to divide all data or part of the data in the data set into a plurality of data subsets, wherein each piece of data in the data set is pre-pasted with a label;

a model generating portion configured to generate a plurality of corresponding machine learning models on the basis of the plurality of data subsets respectively; and

an inferring portion configured to evaluate each of the plurality of machine learning models by using the data subsets, which are not used to generate a machine learning model currently to be evaluated, in the plurality of data subsets, and to infer a data subset containing the mislabeled data in the plurality of data subsets according to generated evaluation results.

9. A computer program, executed by a processor to implement the method of any one of claims 1 -7.

10. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the method of any one of claims 1 -7.