WO2023029065A1

WO2023029065A1 - Method and apparatus for evaluating data set quality, computer device, and storage medium

Info

Publication number: WO2023029065A1
Application number: PCT/CN2021/117109
Authority: WO
Inventors: 马影; 周晓勇; 魏国富; 刘胜; 夏玉明
Original assignee: 上海观安信息技术股份有限公司
Priority date: 2021-08-30
Filing date: 2021-09-08
Publication date: 2023-03-09
Also published as: CN113448955B; CN113448955A

Abstract

The present invention relates to the field of information technology, and disclosed are a method and apparatus for evaluating data set quality, a computer device, and a storage medium, mainly aiming to improve the precision and efficiency in evaluating data set quality. The method comprises: acquiring data to be evaluated in a data set; compiling attribute features of said data under a plurality of evaluation dimensions; and, on the basis of the attribute features under the plurality of evaluation dimensions, evaluating the quality of said data, and obtaining quality evaluation results of said data under the plurality of evaluation dimensions. The present invention is applicable to the evaluation of data set quality.

Description

Data set quality assessment method, device, computer equipment and storage medium

technical field

The present invention relates to the field of information technology, in particular to a data set quality evaluation method, device, computer equipment and storage medium.

Background technique

Data is the basis for the development and application of artificial intelligence. Data sets are crucial to artificial intelligence algorithms. Using different quality data sets will result in different model parameters after training, resulting in different execution effects, which in turn affect the security of artificial intelligence algorithms. If criminals use attack methods to maliciously modify and add data sets, it will lead to model prediction errors. Therefore, how to effectively detect and evaluate the quality of data sets has become an urgent problem to be solved in artificial intelligence security.

Currently, the quality of datasets is usually assessed by technicians based on their experience. However, this quality assessment method relies more on the work experience of technicians, and the assessment results are greatly affected by human subjective factors. Therefore, it is very likely that it will not be able to accurately assess the quality of the data set, which will cause artificial intelligence security accidents. In addition, this It is a way of artificially evaluating the quality of data sets, which is inefficient and increases the workload of technical personnel.

Contents of the invention

The invention provides a data set quality assessment method, device, computer equipment and storage medium, mainly aiming at improving the assessment accuracy and assessment efficiency of the data set quality.

According to a first aspect of the present invention, a data set quality assessment method is provided, including:

Obtain the data to be evaluated in the dataset;

Statistically counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions;

Performing quality assessment on the data to be assessed based on attribute characteristics under the multiple assessment dimensions, to obtain quality assessment results of the data to be assessed under the multiple assessment dimensions respectively.

According to a second aspect of the present invention, a data set quality assessment device is provided, including:

an acquisition unit, configured to acquire the data to be evaluated in the data set;

A statistical unit, configured to separately count attribute characteristics of the data to be evaluated under multiple evaluation dimensions;

The evaluation unit is configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions.

According to a third aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:

Obtain the data to be evaluated in the dataset;

Performing quality assessment on the data to be assessed based on attribute characteristics under the multiple assessment dimensions, and obtaining quality assessment results of the data to be assessed under the multiple assessment dimensions respectively.

According to a fourth aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the program:

Obtain the data to be evaluated in the dataset;

A data set quality evaluation method, device, computer equipment and storage medium provided by the present invention, compared with the current method of evaluating the quality of the data set by technicians based on their own experience, the present invention can obtain the waiting data in the data set Evaluate the data; and separately count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, perform quality assessment on the data to be evaluated, and obtain the The quality evaluation results of the data to be evaluated under the multiple evaluation dimensions, so that by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that It can improve the evaluation accuracy and evaluation efficiency of the data set quality, and effectively guarantee the safety of the data set in the process of artificial intelligence development.

Description of drawings

The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

FIG. 1 shows a flowchart of a data set quality assessment method provided by an embodiment of the present invention;

FIG. 2 shows a flowchart of another data set quality assessment method provided by an embodiment of the present invention;

Fig. 3 shows a schematic structural diagram of a data set quality assessment device provided by an embodiment of the present invention;

Fig. 4 shows a schematic structural diagram of another data set quality assessment device provided by an embodiment of the present invention;

FIG. 5 shows a schematic diagram of a physical structure of a computer device provided by an embodiment of the present invention.

Detailed ways

Hereinafter, the present invention will be described in detail with reference to the drawings and examples. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

In order to solve the above problems, an embodiment of the present invention provides a data set quality assessment method, as shown in Figure 1, the method includes:

101. Obtain the data to be evaluated in the data set.

Wherein, the data set includes a training data set and a prediction data set, and the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set. In order to overcome the defects of low evaluation accuracy and evaluation efficiency of data set quality in the prior art, the embodiment of the present invention develops a set of data set quality evaluation tools, which can The quality of the data set is automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and at the same time ensuring the safety of the data set in the process of artificial intelligence development. The embodiments of the present invention are mainly applied to scenarios where the quality of a data set is evaluated in multiple dimensions. The execution subject of the embodiment of the present invention is a device or device capable of evaluating the quality of a data set, which may be specifically set on the server side.

For the embodiment of the present invention, the training data set and the prediction data set that need to be evaluated for quality are collected in advance, and the data in the training data set and the prediction data set may specifically be structured data or unstructured data, such as image data. After obtaining the training data set or prediction data set to be evaluated, the technician can click the file upload button on the data set quality evaluation tool interface to upload the training data set or prediction data set to be evaluated to the data set quality evaluation tool, so that The dataset quality assessment tool performs multi-dimensional quality assessment on the dataset to be assessed.

102. Collect the attribute characteristics of the data to be evaluated under multiple evaluation dimensions respectively.

Among them, the multiple assessment dimensions include data scale assessment dimension, data balance assessment dimension, data accuracy assessment dimension, data pollution assessment dimension and data bias assessment dimension. It should be noted that the assessment dimensions in the embodiments of the present invention are not limited to In addition to the evaluation dimensions listed above, other evaluation dimensions can also be included, which can be set according to actual business needs. Further, the attribute characteristics of the data to be evaluated under the data size evaluation dimension include: the total amount of data, the number of features, the memory size occupied by the data, whether the data has labels, etc.; the attribute characteristics of the data to be evaluated under the evaluation dimension under data balance include The proportion of data volume under various tags; the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include: the total amount of data, the total amount of missing tags, whether various tags are abnormal, etc.; the data to be evaluated under the data pollution evaluation dimension The attribute characteristics include: the amount of noise data, the amount of confrontation data, etc.; the attribute characteristics of the data to be evaluated under the data bias evaluation dimension include the bias characteristics corresponding to the data to be evaluated.

For the embodiment of the present invention, different statistical methods can be used to separately count the attribute characteristics of the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension and data bias evaluation dimension, The specific statistical methods adopted for different evaluation dimensions are different, see steps 202-206 for details.

103. Perform quality assessment on the data to be assessed based on the attribute characteristics under the multiple assessment dimensions, and obtain quality assessment results of the data to be assessed respectively under the multiple assessment dimensions.

For the embodiment of the present invention, the evaluation standards corresponding to different evaluation dimensions are different. In the process of evaluating the quality of the data to be evaluated by using attribute characteristics under multiple evaluation dimensions, if the data to be evaluated does not meet the evaluation standard corresponding to any evaluation dimension, If it is determined that the data to be evaluated has quality problems, the data set cannot be used to train or predict the model, and technicians need to re-collect the data set or perform data cleaning on the data set with quality problems. For example, under the data balance evaluation dimension of the data to be evaluated, the “Yes” label corresponds to 90% of the data volume, and the “No” label corresponds to 10%. If the difference reaches 80%, since the difference of the proportion of the data volume is greater than the difference of 60% of the proportion of the preset data volume, it is determined that the data to be evaluated does not meet the data balance evaluation standard. If this data does not meet the data balance evaluation standard Training the model with data is likely to affect the execution effect of the model, and cannot guarantee the safety of artificial intelligence algorithms. Similarly, evaluation data can also be treated from the dimensions of data scale evaluation, data accuracy evaluation, data pollution evaluation, and data bias evaluation Carry out quality assessment. If the data to be evaluated does not meet the evaluation standards corresponding to the above dimensions, it is determined that the data to be evaluated has quality problems, and it cannot be used to train or predict the model. For the quality evaluation process of different evaluation dimensions, see steps 202-206. .

The method for assessing the quality of a data set provided by the embodiment of the present invention, compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and count them separately The attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained. The quality evaluation results under multiple evaluation dimensions are described, so by counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that the quality of the data set can be improved. Evaluation accuracy and evaluation efficiency effectively guarantee the safety of data sets in the process of artificial intelligence development.

Further, in order to better illustrate the quality assessment process of the above data set, as a refinement and extension of the above embodiment, the embodiment of the present invention provides another method for detecting contaminated sample data, as shown in FIG. 2 , the The methods described include:

201. Acquire the data to be evaluated in the data set.

Wherein, the data set includes a training data set and a prediction data set, and the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set. For the embodiment of the present invention, before evaluating the quality of the data set, it is necessary to obtain the data set to be evaluated, and the specific method of obtaining the data set is exactly the same as that of step 101, and will not be repeated here.

202. Count the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics under the data pollution evaluation dimension, and obtain the data to be evaluated in the The quality assessment results under the data pollution assessment dimension.

For the embodiment of the present invention, if the data to be evaluated is structured data in the training data set, in the process of performing pollution evaluation on the data to be evaluated, it is necessary to detect whether there is noise data in the data to be evaluated, because noise data interferes with model training Larger, it is easy to affect the execution effect of the model. For the specific process of identifying noise data, as an optional implementation, step 202 specifically includes: according to the structured data and its corresponding label category, using a preset interpolation algorithm to simulate Combining the function curve corresponding to the structured data; using the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data; based on the corresponding predicted label category and the structured data label category to determine whether the structured data is noise data.

Specifically, there is a large amount of structured data in the training data set, each structured data is used as a sample point (x, y), and these sample points (x ₁ ,y ₁ ),(x ₂ ,y ₂ ), …(x _n ,y _n ), to fit a function curve y=f(x), because the structured data to be evaluated may have data defects, so before using a large number of sample points to fit the function curve, it is necessary to facilitate The preset interpolation algorithm performs interpolation processing on the structured data. The preset interpolation algorithm may specifically be a preset Kriging interpolation algorithm. First, determine the structured data to be interpolated, and calculate the difference between the structured data to be interpolated and these known The distance between the structured data of the classification result, and determine the data weight corresponding to the structured data of the known classification structure according to the distance, and then calculate the value to be interpolated according to the weight corresponding to the structured data of the known classification result and the classification result The classification results corresponding to the structured data. Wherein, the classification result may specifically be a classification probability value corresponding to the structured data.

For example, the structured data of the known classification results are x1, x2, x3, and the structured data to be interpolated is determined to be x4. Since the classification result corresponding to the structural data x4 to be interpolated is unknown, the known classification results can be used Structured data x1, x2, x3, estimate the classification result corresponding to the structured data x4 to be interpolated, specifically, calculate the difference between the structured data x4 to be interpolated and the structured data x1, x2, x3 of known classification results The larger the distance, the farther the structured data of the known classification result is from the structured data to be interpolated, and the impact on the structured data to be interpolated is smaller, so the corresponding data weight is smaller; On the contrary, the smaller the distance, the closer the structured data of the known classification result is to the structured data to be interpolated, and the greater the impact on the structured data to be interpolated, so the corresponding data weight is greater. Further, After determining the data weights corresponding to the structured data with known classification results, the data weights corresponding to the structured data with known classification results are multiplied by the classification results to obtain the classification results corresponding to the structured data to be interpolated. Then, the structured data to be interpolated after the classification result is determined is inserted into the structured data of the known classification result, thereby solving the problem of missing data.

Furthermore, the above-mentioned structured data are collectively used as sample points, and these sample points are used for curve fitting to obtain a function curve y=f(x). Since the classification results corresponding to the structured data in the training data set are known, that is, they belong to The probability value of the real label category is known, and then the function curve can be used to predict the above structured data, and the prediction result corresponding to the structured data is obtained, that is, the probability value belonging to the predicted label category. Further, the probability value of the structured data belonging to the predicted label category is subtracted from the probability value belonging to the real label category to obtain the probability difference corresponding to the structured data. If the probability difference is greater than the preset probability difference, the structured data is determined are noisy data. For example, the probability value of structured data A belonging to the real label category is 0.87, the probability value of belonging to the predicted label category is 0.27, and the probability difference is 0.87-0.27=0.5. Since the probability difference is greater than the preset probability difference 0.2, it is determined that the structured data Data A is noise data. Therefore, it can be determined whether each structured data in the training data set is noise data in the manner described above.

Further, after determining the noise data in the training data set, the structured data in the training data set needs to be polluted. Based on this, the method includes: separately counting the first data volume corresponding to the noise data, and the structure The second data volume corresponding to the structured data, and according to the first data volume and the second data volume, calculate the first data ratio between the noise data and the structured data; if the first data If the ratio is greater than the preset noise data ratio, it is determined that the structured data does not meet the data pollution evaluation standard. Wherein, the preset noise data ratio may be set according to actual service requirements.

For example, if the preset noise data ratio is set to be 10%, after determining the noise data existing in the training data set, the total amount of statistical noise data (first data volume) is 200, and the total amount of structured data in the training data set ( The second data amount) is 1000, thus the first data ratio between the first data amount corresponding to the noise data and the second data amount corresponding to the structured data can be calculated as 200/1000=20%, due to the first The data ratio of 20% is greater than the preset noise data ratio of 10%, so it is determined that the training data set does not meet the data pollution evaluation criteria, that is, the training data set has quality problems and cannot be used for model training.

In a specific application scenario, if the data to be evaluated is unstructured data in the prediction data set, in the process of performing pollution evaluation on the data to be evaluated, it is necessary to detect whether there is confrontation data in the data to be evaluated, that is, to detect whether there is The data maliciously created by the attacker, because once such confrontation data is mixed into the prediction data set, it will directly affect the prediction result of the model. For the specific process of identifying confrontation data, as an optional implementation method, step 202 specifically includes: using The first preset compressor and the second preset compressor compress the unstructured data to obtain first compressed data and second compressed data corresponding to the unstructured data; data, the first compressed data, and the second compressed data to obtain the original prediction result corresponding to the unstructured data, the first prediction result corresponding to the first compressed data, and the second compressed data A second prediction result corresponding to the data; respectively calculating a first difference between the original prediction result and the first prediction result, and a second difference between the original prediction result and the second prediction result ; Based on the first difference and the second difference, determine whether the unstructured data is adversarial data.

Among them, the first preset compressor and the second preset compressor can compress the features in the unstructured data to reduce the input of unnecessary features and reduce the dimension corresponding to the unstructured data. The first preset compressor The compressed features are different from those compressed by the second preset compressor. For example, if the input unstructured data includes 10 features, that is, the input dimension corresponding to the unstructured data is 10, the first preset compressor can The first feature and the second feature in the unstructured data are compressed, and the second preset compressor can compress the third feature and the fourth feature in the unstructured data. It should be noted that the number of compressors used in the embodiment of the present invention is not limited to two, and the number of compressors can be set according to actual service requirements and the number of features. In addition, the original prediction result, the first prediction result and the second prediction result in the embodiment of the present invention are the probability values that the unstructured data belongs to the corresponding label category.

For example, the unstructured data A is respectively input into the first preset compressor and the second preset compressor for feature compression processing, and the first compressed data and the second compressed data corresponding to the unstructured data A are obtained, and then the The unstructured data A, the first compressed data and the second compressed data were respectively input into the built model for prediction, and the original prediction result corresponding to the structured data A was 0.78, the first prediction result corresponding to the first compressed data was 0.56, and the first prediction result corresponding to the first compressed data was 0.56. The second predicted result corresponding to the second compressed data is 0.63. Further, the original predicted result is subtracted from the first predicted result to obtain the first difference of 0.22, and the original predicted result is subtracted from the second predicted result to obtain the second difference. The value is 0.15, and then a maximum difference is selected from the first difference and the second difference to compare with the preset difference. If the maximum difference is greater than the preset difference, then it is determined that the unstructured data A is confrontational data; If the maximum difference is less than the preset difference, it is determined that the unstructured data A is not confrontational data. For example, if the preset difference is set to 0.2, since the maximum difference 0.22 is greater than the preset difference 0.2, it is determined that the unstructured data is against data. All adversarial data present in the prediction data set can thus be determined in the manner described above.

Further, after determining the confrontation data in the prediction data set, it is necessary to perform pollution assessment on the unstructured data in the prediction data set, based on this, the method includes: separately counting the third data volume corresponding to the confrontation data, and a fourth data volume corresponding to the unstructured data, and calculate a second data ratio between the confrontation data and the unstructured data according to the third data volume and the fourth data volume; if If the second data ratio is greater than the preset confrontation data ratio, it is determined that the unstructured data does not meet the data pollution evaluation standard. Wherein, the preset confrontation data ratio can be set according to actual business requirements.

For example, set the preset ratio of confrontational data to 10%, after determining the confrontational data existing in the predicted data set, the total amount of statistical confrontational data (the third data volume) is 300, and the total amount of unstructured data in the forecasted data set (The fourth data volume) is 1000, thus the second data ratio between the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data can be calculated as 300/1000=30%, due to the The second data ratio of 30% is greater than the preset confrontation data ratio of 10%, so it is determined that the prediction data set does not meet the data pollution evaluation criteria, that is, the prediction data set has quality problems and cannot be used for model prediction.

203. Count the attribute characteristics of the data to be evaluated under the data bias evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics under the data bias evaluation dimension, and obtain the data to be evaluated in the Quality evaluation results under the data bias evaluation dimension.

For the embodiment of the present invention, if the data to be evaluated is structured data in the training data set, in the process of bias evaluation of the data to be evaluated, it is necessary to detect whether each feature corresponding to the data to be evaluated is a bias feature, because the existence of bias features , may lead to discrimination in the decision-making results of artificial intelligence. For the specific process of determining the bias characteristics, as an optional implementation, step 203 specifically includes: determining each feature corresponding to the structured data; using the preset bias corpus Preliminary detection of the bias features existing in the various features, and respectively excluding the bias features and their corresponding structured data from the various features and the training data set, to obtain the excluded features and the excluded structured data. Data; according to the excluded structured data, analyze whether each feature after the exclusion is a bias feature. Among them, the prediction bias corpus stores a large number of bias features, such as gender, age, region, income, etc.

For example, the features corresponding to the structured data in the training data set include: education level, work, income, medical history, and gender. Match the above features corresponding to the structured data with each feature in the preset bias corpus. Through matching, it can be found that, Among the features corresponding to the structured data, the income feature and the gender feature are bias features. In order to improve the detection accuracy of the bias feature, it is necessary to further analyze whether other features are bias features, and exclude the structured data corresponding to the bias feature from the structured data. Then use the excluded structured data to analyze whether the remaining features are bias features.

Further, regarding the specific process of analyzing whether the remaining features are biased features, as an optional implementation, the method includes: if there is a corresponding label category in the excluded structured data, then After each feature is combined with each label category, multiple combination results are obtained; the feature values corresponding to each feature after the exclusion are determined, and according to the multiple combination results, the corresponding feature values of each feature value under different label classifications are analyzed. First data volume distribution information: based on the first data volume distribution information, determine whether each feature after the exclusion is a biased feature.

For example, the label categories of the excluded structured data include "yes" and "no", and the excluded features include "education level" and "job". Combining the above features with the label categories, we get "education level- Yes", "Education Level-No", "Work-Yes" and "Work-No", and then determine that the eigenvalues corresponding to the education level include undergraduates and above, undergraduates, and below undergraduates, and the eigenvalues corresponding to work include working and no work , further, first analyze the amount of data in the structured data whose label category is "yes" and the educational level is above undergraduate, undergraduate, and below. For example, the data volume of statistical education is 1000 respectively. People, 200 people and 800 people, the total amount of structured data with the label category "Yes" is 2000 people. It can be seen that the proportions of unstructured data with a bachelor degree or above, bachelor degree and below are 50%, 10% and 40%, because the difference between the data volume of undergraduates and above is 40%, which exceeds the preset proportion of 20%, so it can be determined that the characteristic education level is a biased feature. In this way, it can be determined whether the remaining features corresponding to the structured data with label categories are biased features.

Further, regarding the specific process of analyzing whether the remaining features are biased features, as an optional implementation, the method further includes: if the excluded structured data does not have a corresponding label category, using the preset The clustering algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the corresponding eigenvalues of each eigenvalue under different classifications. Second data volume distribution information: based on the second data volume distribution information, determine whether each feature after the exclusion is a bias feature. Wherein, the preset clustering algorithm may specifically be a DBSCAN clustering algorithm.

Specifically, the excluded structured data may not have corresponding label categories. At this time, label categories and features cannot be combined to analyze the distribution of the first data volume corresponding to each feature value under different label categories. In the case of the label category corresponding to the structured data, you can cluster the excluded structured data, and then analyze the distribution of the second data volume corresponding to each feature value under different categories, and use the DBSCAN clustering algorithm to exclude In the process of clustering the final structured data, first set the neighborhood radius corresponding to the structured data and the threshold of the amount of structured data in the field, then select a structured data A arbitrarily, and calculate the arrival of each structured data The distance of the structured data, according to the calculated distances, determine the structured data B, C, and D included in the neighborhood corresponding to the structured data A, if the structured data included in the neighborhood corresponding to the structured data A If the quantity is greater than the structured data volume threshold, then determine structured data A as the core point, and build cluster C1 with structured data A as the core point, find out all the points reachable from structured data A density, structured data A adjacent All the structured data in the domain are the density-reachable points of structured data A, and they all belong to C1. In addition, determine the structured data included in the neighborhood corresponding to structured data B, such as including structured data E, because structured data The density of data A can reach the density of structured data B, and the density of structured data B can reach the density of structured data E, so the density of structured data A can reach the density of structured data E, that is, structured data E also belongs to C1, so according to the above method, it can Find all the structured data in cluster C1, and continue to search for other data in the excluded structured data. According to the above method, cluster C2 can be obtained, and the clustering of the excluded structured data can be completed by dividing into multiple clusters. Divide the excluded structured data into multiple categories, and then analyze the distribution of the second data volume corresponding to each feature value under different classifications. The method of analyzing the distribution of the second data volume is the same as that of the first data volume. This will not be repeated here.

204. Count the attribute characteristics of the data to be evaluated in the data size evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics in the data size evaluation dimension, and obtain the data to be evaluated in the The quality evaluation results under the data scale evaluation dimension.

Among them, the attribute characteristics of the data to be evaluated under the data size evaluation dimension include the total amount, the number of features, the memory size occupied by the data, and whether the data has labels, etc. For the embodiment of the present invention, in the process of evaluating the scale of the data to be evaluated, it is necessary to count the total amount of data corresponding to the data to be evaluated, the number of features, the size of the memory occupied, and whether the data has a label. If there is a missing label, the missing How much is the data volume of the tag. For example, the amount of data corresponding to the data to be evaluated is 300, the occupied memory size is 11.3KB, the number of features is 13, and the data has labels.

Further, after the statistics of the total quantity, number of features, memory size occupied, and whether there are labels corresponding to the data to be evaluated are completed, the scale evaluation of the data to be evaluated is required. It should be noted that for different model algorithms, the standards for data size evaluation are different. For example, for translation models, if the total amount of data corresponding to the data to be evaluated is less than 10 million, it is determined that the data to be evaluated does not meet the data size evaluation standard, that is, it does not To be able to directly use this data set for training or prediction, the number needs to be increased.

205. Count the attribute characteristics of the data to be evaluated in the data balance evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics in the data balance evaluation dimension, and obtain the data to be evaluated in The quality evaluation result under the data balance evaluation dimension.

For the embodiment of the present invention, the balance of the data set is an important factor affecting the effect of the artificial intelligence algorithm. The more uniform the data set, the smaller the distribution deviation of the data to be evaluated, and the better the operating effect of the artificial intelligence algorithm. On the contrary, the distribution deviation of the data to be evaluated is The bigger it is, the worse the AI algorithm will perform. For example, the label categories corresponding to the data to be evaluated include "yes" and "no", respectively count the amount of data with the label category "yes" and the data volume with the label category "no", and calculate the proportion of the number of different label categories , such as calculating the proportion of the data volume corresponding to the "Yes" label is 90%, the proportion of data volume corresponding to the "No" label is 10%, and the difference between the data volume proportions of the two types of labels reaches 80%. The difference in the ratio is greater than 60% of the difference in the ratio of the preset data volume, so it is determined that the data to be evaluated does not meet the data balance evaluation standard.

206. Count the attribute characteristics of the data to be evaluated in the data accuracy evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics in the data accuracy evaluation dimension, and obtain the data to be evaluated in The quality evaluation result under the data accuracy evaluation dimension.

Among them, the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include the total amount of data, the total amount of missing tags, and whether various tags are abnormal. For example, the total number of missing labels in the data to be evaluated is counted, and then the ratio of the total amount of data with missing labels to the total amount of data to be evaluated is calculated. If the ratio is greater than the preset ratio, it is determined that the data to be evaluated does not meet the accuracy requirements. gender assessment criteria. Another example is to count the data volumes corresponding to the labels "Yes" and "No". If the data volume under a certain label is less than the preset data volume, it is determined that the label is abnormal, and then it can be determined that there are abnormal labels in the data to be evaluated. Accuracy Evaluation Criteria.

After multi-dimensional evaluation of the data to be evaluated, the quality evaluation results of the data to be evaluated in multiple evaluation dimensions are obtained, and then a quality evaluation report corresponding to the data to be evaluated is generated for reference by technical personnel. It should be noted that the execution order of the above steps 202-206 is not limited to the order shown in FIG. It can be executed in parallel, which is not limited by the present invention.

Another data set quality assessment method provided by the embodiment of the present invention, compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and Statistics of the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics of the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained. The quality evaluation results under the multiple evaluation dimensions, so by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the quality of the data set The evaluation accuracy and evaluation efficiency effectively ensure the safety of data sets in the process of artificial intelligence development.

Further, as a specific implementation of FIG. 1 , an embodiment of the present invention provides a data set quality assessment device. As shown in FIG. 3 , the device includes: an acquisition unit 31 , a statistical unit 32 and an evaluation unit 33 .

The obtaining unit 31 may be used to obtain the data to be evaluated in the data set.

The statistics unit 32 may be used to separately calculate attribute characteristics of the data to be evaluated under multiple evaluation dimensions.

The evaluation unit 33 may be configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions respectively .

In a specific application scenario, the statistical unit 32 can specifically be used to separately count the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension, and data bias evaluation dimension properties below.

The evaluation unit 33 can be specifically configured to evaluate attributes based on the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension, and the data bias evaluation dimension features, performing quality assessment on the data to be evaluated, and obtaining the data to be evaluated in the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension and Quality assessment results under the data bias assessment dimension

Further, the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension are counted, as shown in FIG. 4 , the statistical unit 32 includes: a fitting module 321 , a prediction module 322 and a judgment module 323 .

The fitting module 321 can be configured to use a preset interpolation algorithm to fit a function curve corresponding to the structured data according to the structured data and its corresponding tag category.

The prediction module 322 may be configured to use the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data.

The determination module 323 may be configured to determine whether the structured data is noise data based on the predicted label category and the label category corresponding to the structured data.

Based on this, the evaluation unit 33 includes: a first calculation module 331 and a first determination module 332 .

The first calculation module 331 may be configured to separately count the first data volume corresponding to the noise data and the second data volume corresponding to the structured data, and calculate according to the first data volume and the second data volume , calculating a first data ratio between the noise data and the structured data.

The first determining module 332 may be configured to determine that the structured data does not satisfy the data pollution evaluation standard when the first data ratio is greater than a preset noise data ratio.

Further, the statistical unit 32 further includes: a compression module 324 and a second calculation module 325 to make statistics on the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension.

The compression module 324 may be configured to compress the unstructured data by using the first preset compressor and the second preset compressor respectively, to obtain the first compressed data and the second compressed data corresponding to the unstructured data. Two compressed data.

The prediction module 322 may be configured to respectively predict the unstructured data, the first compressed data, and the second compressed data to obtain an original prediction result corresponding to the unstructured data, and the first A first prediction result corresponding to the compressed data, and a second prediction result corresponding to the second compressed data.

The second calculation module 325 may be configured to calculate a first difference between the original prediction result and the first prediction result, and a first difference between the original prediction result and the second prediction result, respectively. Two difference.

The determining module 323 may be configured to determine whether the unstructured data is confrontational data based on the first difference and the second difference.

The first calculation module 331 can also be used to separately count the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data, and according to the third data volume and the The fourth data volume is to calculate a second data ratio between the confrontation data and the unstructured data.

The first determining module 332 may also be configured to determine that the unstructured data does not meet the data pollution evaluation standard if the second data ratio is greater than a preset confrontation data ratio.

In a specific application scenario, in order to count the attribute characteristics of the data to be evaluated under the data bias evaluation dimension, the statistics unit 32 further includes: a second determination module 326 , an exclusion module 327 and an analysis module 328 .

The second determination module 326 may be used to determine each feature corresponding to the structured data.

The exclusion module 327 can be used to preliminarily detect the biased features in the various features by using the preset biased corpus, and exclude the biased features and their corresponding structured features from the various features and the training data set. data to obtain the excluded features and the excluded structured data.

The analysis module 328 can be configured to analyze whether each feature after exclusion is a bias feature according to the structured data after exclusion.

In a specific application scenario, in order to analyze whether each feature after the exclusion is a bias feature, the analysis module 328 can be used to, if there is a corresponding label category in the structured data after exclusion, then Each feature of each feature is combined with each label category to obtain multiple combination results; determine the feature value corresponding to each feature after the exclusion, and according to the multiple combination results, analyze the No. 1 corresponding to each feature value under different label classifications A data volume distribution information; based on the first data volume distribution information, it is determined whether each feature after the exclusion is a biased feature.

In a specific application scenario, in order to analyze whether each feature after the exclusion is a bias feature, the analysis module 328 can also be used to use the preset aggregation The class algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the first eigenvalues corresponding to each eigenvalue under different classifications. 2. Data volume distribution information: Based on the second data volume distribution information, determine whether each feature after the exclusion is a biased feature.

It should be noted that, for other corresponding descriptions of the functional modules involved in the data set quality assessment apparatus provided by the embodiment of the present invention, reference may be made to the corresponding description of the method shown in FIG. 1 , which will not be repeated here.

Based on the method shown in Figure 1 above, correspondingly, an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining the waiting Evaluate the data; respectively count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; perform quality assessment on the data to be evaluated based on the attribute characteristics under the multiple evaluation dimensions, and obtain the data to be evaluated respectively in Quality evaluation results under the multiple evaluation dimensions.

Based on the above-mentioned embodiment of the method shown in FIG. 1 and the device shown in FIG. 3 , the embodiment of the present invention also provides a physical structure diagram of a computer device, as shown in FIG. 5 , the computer device includes: a processor 41, A memory 42 and a computer program stored in the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on the bus 43 . When the processor 41 executes the program, the following steps are implemented: respectively counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; based on the attribute characteristics under the multiple evaluation dimensions, performing The quality assessment is to obtain the quality assessment results of the data to be assessed under the plurality of assessment dimensions respectively.

Through the technical solution of the present invention, Fangming can obtain the data to be evaluated in the data set; and separately count the attribute characteristics of the data to be evaluated in multiple evaluation dimensions; at the same time, based on the data in the multiple evaluation dimensions attribute characteristics, performing quality assessment on the data to be evaluated, and obtaining quality evaluation results of the data to be evaluated in the plurality of evaluation dimensions, and thus by counting the attribute characteristics of the data to be evaluated in the plurality of evaluation dimensions, The quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and effectively ensuring the safety of the data set in the process of artificial intelligence development.

Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.

The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

A data set quality assessment method, characterized in that, comprising:

Obtain the data to be evaluated in the dataset;

Statistically counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions;

Performing quality assessment on the data to be assessed based on attribute characteristics under the multiple assessment dimensions, to obtain quality assessment results of the data to be assessed under the multiple assessment dimensions respectively.
The method according to claim 1, wherein the separately counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions includes:

Respectively count the attribute characteristics of the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension, and data bias evaluation dimension;

Performing quality assessment on the data to be assessed based on the attribute characteristics under the multiple assessment dimensions, and obtaining the quality assessment results of the data to be assessed under the multiple assessment dimensions respectively, including:

Based on the attribute characteristics under the data size assessment dimension, the data balance assessment dimension, the data accuracy assessment dimension, the data pollution assessment dimension and the data bias assessment dimension, the quality of the data to be assessed is evaluated. Evaluation, obtaining the quality evaluations of the data to be evaluated in the data size evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension and the data bias evaluation dimension result.
The method according to claim 2, wherein the data to be evaluated is structured data in a training data set, and the statistics of the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension include:

Fitting a function curve corresponding to the structured data by using a preset interpolation algorithm according to the structured data and its corresponding tag category;

Predicting the structured data by using the function curve to obtain a predicted label category corresponding to the structured data;

Based on the predicted label category and the label category corresponding to the structured data, determine whether the structured data is noise data;

Based on the attribute characteristics under the data pollution assessment dimension, perform quality assessment on the data to be assessed, and obtain the quality assessment results of the data to be assessed under the data pollution assessment dimension, including:

Counting the first data volume corresponding to the noise data and the second data volume corresponding to the structured data, and calculating the noise data and the structured data volume according to the first data volume and the second data volume a first data ratio between the data;

If the first data ratio is greater than the preset noise data ratio, it is determined that the structured data does not meet the data pollution evaluation standard.
The method according to claim 2, wherein the data to be evaluated is unstructured data in a prediction data set, and the statistics of the attribute characteristics of the data to be evaluated under the dimension of data pollution evaluation include:

performing compression processing on the unstructured data by using a first preset compressor and a second preset compressor respectively, to obtain first compressed data and second compressed data corresponding to the unstructured data;

Predicting the unstructured data, the first compressed data, and the second compressed data respectively, to obtain an original prediction result corresponding to the unstructured data, and a first prediction result corresponding to the first compressed data , and a second prediction result corresponding to the second compressed data;

calculating a first difference between the original predictor and the first predictor, and a second difference between the original predictor and the second predictor, respectively;

determining whether the unstructured data is adversarial data based on the first difference and the second difference;

Based on the attribute characteristics under the data pollution assessment dimension, the quality assessment is performed on the data to be assessed, and the quality assessment result of the data to be assessed under the data pollution assessment dimension is obtained, including:

Counting the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data respectively, and calculating the relationship between the confrontation data and the fourth data volume according to the third data volume and the fourth data volume a second data ratio between the unstructured data;

If the second data ratio is greater than the preset confrontation data ratio, it is determined that the unstructured data does not meet the data pollution evaluation standard.
The method according to claim 2, wherein the data to be evaluated is structured data in a training data set, and the attribute characteristics of the data to be evaluated under the data bias evaluation dimension are counted, including:

determining each feature corresponding to the structured data;

Using the preset bias corpus to preliminarily detect the bias features existing in the various features, and exclude the bias features and their corresponding structured data from the various features and the training data set, and obtain the excluded features and Excluded structured data;

According to the excluded structured data, it is analyzed whether each feature after the exclusion is a bias feature.
The method according to claim 5, wherein, according to the excluded structured data, analyzing whether each feature after the exclusion is a bias feature comprises:

If there is a corresponding label category in the excluded structured data, combining each feature after the exclusion with each label category to obtain multiple combination results;

determining the eigenvalues corresponding to the excluded features, and analyzing the first data volume distribution information corresponding to the eigenvalues under different label classifications according to the plurality of combination results;

Based on the first data volume distribution information, it is determined whether each feature after the exclusion is a bias feature.
The method according to claim 5, wherein, according to the excluded structured data, analyzing whether each feature after the exclusion is a bias feature comprises:

If there is no corresponding label category for the excluded structured data, cluster processing is performed on the excluded structured data using a preset clustering algorithm to obtain structured data under different classifications;

Determining the eigenvalues corresponding to the excluded features, and analyzing the second data volume distribution information corresponding to the eigenvalues under different classifications;

Based on the second data volume distribution information, it is determined whether each feature after the exclusion is a bias feature.
A data set quality assessment device is characterized in that it comprises:

an acquisition unit, configured to acquire the data to be evaluated in the data set;

A statistical unit, configured to separately count attribute characteristics of the data to be evaluated under multiple evaluation dimensions;

The evaluation unit is configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the computer program is executed by the processor, it implements any one of claims 1 to 7. steps of the method described above.
A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are realized.