WO2023029065A1 - Method and apparatus for evaluating data set quality, computer device, and storage medium - Google Patents

Method and apparatus for evaluating data set quality, computer device, and storage medium Download PDF

Info

Publication number
WO2023029065A1
WO2023029065A1 PCT/CN2021/117109 CN2021117109W WO2023029065A1 WO 2023029065 A1 WO2023029065 A1 WO 2023029065A1 CN 2021117109 W CN2021117109 W CN 2021117109W WO 2023029065 A1 WO2023029065 A1 WO 2023029065A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
evaluation
evaluated
structured
dimension
Prior art date
Application number
PCT/CN2021/117109
Other languages
French (fr)
Chinese (zh)
Inventor
马影
周晓勇
魏国富
刘胜
夏玉明
Original Assignee
上海观安信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海观安信息技术股份有限公司 filed Critical 上海观安信息技术股份有限公司
Publication of WO2023029065A1 publication Critical patent/WO2023029065A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • the present invention relates to the field of information technology, in particular to a data set quality evaluation method, device, computer equipment and storage medium.
  • Data is the basis for the development and application of artificial intelligence. Data sets are crucial to artificial intelligence algorithms. Using different quality data sets will result in different model parameters after training, resulting in different execution effects, which in turn affect the security of artificial intelligence algorithms. If criminals use attack methods to maliciously modify and add data sets, it will lead to model prediction errors. Therefore, how to effectively detect and evaluate the quality of data sets has become an urgent problem to be solved in artificial intelligence security.
  • the invention provides a data set quality assessment method, device, computer equipment and storage medium, mainly aiming at improving the assessment accuracy and assessment efficiency of the data set quality.
  • a data set quality assessment method including:
  • a data set quality assessment device including:
  • an acquisition unit configured to acquire the data to be evaluated in the data set
  • a statistical unit configured to separately count attribute characteristics of the data to be evaluated under multiple evaluation dimensions
  • the evaluation unit is configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:
  • a computer device including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the program:
  • a data set quality evaluation method, device, computer equipment and storage medium provided by the present invention compared with the current method of evaluating the quality of the data set by technicians based on their own experience, the present invention can obtain the waiting data in the data set Evaluate the data; and separately count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, perform quality assessment on the data to be evaluated, and obtain the The quality evaluation results of the data to be evaluated under the multiple evaluation dimensions, so that by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that It can improve the evaluation accuracy and evaluation efficiency of the data set quality, and effectively guarantee the safety of the data set in the process of artificial intelligence development.
  • FIG. 1 shows a flowchart of a data set quality assessment method provided by an embodiment of the present invention
  • FIG. 2 shows a flowchart of another data set quality assessment method provided by an embodiment of the present invention
  • Fig. 3 shows a schematic structural diagram of a data set quality assessment device provided by an embodiment of the present invention
  • Fig. 4 shows a schematic structural diagram of another data set quality assessment device provided by an embodiment of the present invention.
  • FIG. 5 shows a schematic diagram of a physical structure of a computer device provided by an embodiment of the present invention.
  • an embodiment of the present invention provides a data set quality assessment method, as shown in Figure 1, the method includes:
  • the data set includes a training data set and a prediction data set
  • the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set.
  • the embodiment of the present invention develops a set of data set quality evaluation tools, which can The quality of the data set is automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and at the same time ensuring the safety of the data set in the process of artificial intelligence development.
  • the embodiments of the present invention are mainly applied to scenarios where the quality of a data set is evaluated in multiple dimensions.
  • the execution subject of the embodiment of the present invention is a device or device capable of evaluating the quality of a data set, which may be specifically set on the server side.
  • the training data set and the prediction data set that need to be evaluated for quality are collected in advance, and the data in the training data set and the prediction data set may specifically be structured data or unstructured data, such as image data.
  • the technician can click the file upload button on the data set quality evaluation tool interface to upload the training data set or prediction data set to be evaluated to the data set quality evaluation tool, so that The dataset quality assessment tool performs multi-dimensional quality assessment on the dataset to be assessed.
  • the multiple assessment dimensions include data scale assessment dimension, data balance assessment dimension, data accuracy assessment dimension, data pollution assessment dimension and data bias assessment dimension. It should be noted that the assessment dimensions in the embodiments of the present invention are not limited to In addition to the evaluation dimensions listed above, other evaluation dimensions can also be included, which can be set according to actual business needs.
  • the attribute characteristics of the data to be evaluated under the data size evaluation dimension include: the total amount of data, the number of features, the memory size occupied by the data, whether the data has labels, etc.; the attribute characteristics of the data to be evaluated under the evaluation dimension under data balance include The proportion of data volume under various tags; the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include: the total amount of data, the total amount of missing tags, whether various tags are abnormal, etc.; the data to be evaluated under the data pollution evaluation dimension
  • the attribute characteristics include: the amount of noise data, the amount of confrontation data, etc.; the attribute characteristics of the data to be evaluated under the data bias evaluation dimension include the bias characteristics corresponding to the data to be evaluated.
  • different statistical methods can be used to separately count the attribute characteristics of the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension and data bias evaluation dimension,
  • the specific statistical methods adopted for different evaluation dimensions are different, see steps 202-206 for details.
  • the evaluation standards corresponding to different evaluation dimensions are different.
  • the data to be evaluated does not meet the evaluation standard corresponding to any evaluation dimension, If it is determined that the data to be evaluated has quality problems, the data set cannot be used to train or predict the model, and technicians need to re-collect the data set or perform data cleaning on the data set with quality problems.
  • the “Yes” label corresponds to 90% of the data volume, and the “No” label corresponds to 10%.
  • evaluation data can also be treated from the dimensions of data scale evaluation, data accuracy evaluation, data pollution evaluation, and data bias evaluation Carry out quality assessment. If the data to be evaluated does not meet the evaluation standards corresponding to the above dimensions, it is determined that the data to be evaluated has quality problems, and it cannot be used to train or predict the model. For the quality evaluation process of different evaluation dimensions, see steps 202-206. .
  • the method for assessing the quality of a data set provided by the embodiment of the present invention, compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and count them separately The attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained.
  • the quality evaluation results under multiple evaluation dimensions are described, so by counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that the quality of the data set can be improved. Evaluation accuracy and evaluation efficiency effectively guarantee the safety of data sets in the process of artificial intelligence development.
  • the embodiment of the present invention provides another method for detecting contaminated sample data, as shown in FIG. 2 , the The methods described include:
  • the data set includes a training data set and a prediction data set
  • the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set.
  • the specific method of obtaining the data set is exactly the same as that of step 101, and will not be repeated here.
  • step 202 specifically includes: according to the structured data and its corresponding label category, using a preset interpolation algorithm to simulate Combining the function curve corresponding to the structured data; using the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data; based on the corresponding predicted label category and the structured data label category to determine whether the structured data is noise data.
  • the preset interpolation algorithm performs interpolation processing on the structured data.
  • the preset interpolation algorithm may specifically be a preset Kriging interpolation algorithm.
  • the classification result may specifically be a classification probability value corresponding to the structured data.
  • the structured data of the known classification results are x1, x2, x3, and the structured data to be interpolated is determined to be x4. Since the classification result corresponding to the structural data x4 to be interpolated is unknown, the known classification results can be used Structured data x1, x2, x3, estimate the classification result corresponding to the structured data x4 to be interpolated, specifically, calculate the difference between the structured data x4 to be interpolated and the structured data x1, x2, x3 of known classification results The larger the distance, the farther the structured data of the known classification result is from the structured data to be interpolated, and the impact on the structured data to be interpolated is smaller, so the corresponding data weight is smaller; On the contrary, the smaller the distance, the closer the structured data of the known classification result is to the structured data to be interpolated, and the greater the impact on the structured data to be interpolated, so the corresponding data weight is greater.
  • the data weights corresponding to the structured data with known classification results are multiplied by the classification results to obtain the classification results corresponding to the structured data to be interpolated. Then, the structured data to be interpolated after the classification result is determined is inserted into the structured data of the known classification result, thereby solving the problem of missing data.
  • the probability value of structured data A belonging to the real label category is 0.87
  • the probability value of belonging to the predicted label category is 0.27
  • the method includes: separately counting the first data volume corresponding to the noise data, and the structure The second data volume corresponding to the structured data, and according to the first data volume and the second data volume, calculate the first data ratio between the noise data and the structured data; if the first data If the ratio is greater than the preset noise data ratio, it is determined that the structured data does not meet the data pollution evaluation standard.
  • the preset noise data ratio may be set according to actual service requirements.
  • the preset noise data ratio is set to be 10%
  • the total amount of statistical noise data (first data volume) is 200
  • the total amount of structured data in the training data set ( The second data amount) is 1000
  • the data ratio of 20% is greater than the preset noise data ratio of 10%, so it is determined that the training data set does not meet the data pollution evaluation criteria, that is, the training data set has quality problems and cannot be used for model training.
  • the data to be evaluated is unstructured data in the prediction data set
  • it is necessary to detect whether there is confrontation data in the data to be evaluated that is, to detect whether there is The data maliciously created by the attacker, because once such confrontation data is mixed into the prediction data set, it will directly affect the prediction result of the model.
  • step 202 specifically includes: using The first preset compressor and the second preset compressor compress the unstructured data to obtain first compressed data and second compressed data corresponding to the unstructured data; data, the first compressed data, and the second compressed data to obtain the original prediction result corresponding to the unstructured data, the first prediction result corresponding to the first compressed data, and the second compressed data A second prediction result corresponding to the data; respectively calculating a first difference between the original prediction result and the first prediction result, and a second difference between the original prediction result and the second prediction result ; Based on the first difference and the second difference, determine whether the unstructured data is adversarial data.
  • the first preset compressor and the second preset compressor can compress the features in the unstructured data to reduce the input of unnecessary features and reduce the dimension corresponding to the unstructured data.
  • the first preset compressor The compressed features are different from those compressed by the second preset compressor. For example, if the input unstructured data includes 10 features, that is, the input dimension corresponding to the unstructured data is 10, the first preset compressor can The first feature and the second feature in the unstructured data are compressed, and the second preset compressor can compress the third feature and the fourth feature in the unstructured data.
  • the number of compressors used in the embodiment of the present invention is not limited to two, and the number of compressors can be set according to actual service requirements and the number of features.
  • the original prediction result, the first prediction result and the second prediction result in the embodiment of the present invention are the probability values that the unstructured data belongs to the corresponding label category.
  • the unstructured data A is respectively input into the first preset compressor and the second preset compressor for feature compression processing, and the first compressed data and the second compressed data corresponding to the unstructured data A are obtained, and then the The unstructured data A, the first compressed data and the second compressed data were respectively input into the built model for prediction, and the original prediction result corresponding to the structured data A was 0.78, the first prediction result corresponding to the first compressed data was 0.56, and the first prediction result corresponding to the first compressed data was 0.56.
  • the second predicted result corresponding to the second compressed data is 0.63. Further, the original predicted result is subtracted from the first predicted result to obtain the first difference of 0.22, and the original predicted result is subtracted from the second predicted result to obtain the second difference.
  • the value is 0.15, and then a maximum difference is selected from the first difference and the second difference to compare with the preset difference. If the maximum difference is greater than the preset difference, then it is determined that the unstructured data A is confrontational data; If the maximum difference is less than the preset difference, it is determined that the unstructured data A is not confrontational data. For example, if the preset difference is set to 0.2, since the maximum difference 0.22 is greater than the preset difference 0.2, it is determined that the unstructured data is against data. All adversarial data present in the prediction data set can thus be determined in the manner described above.
  • the method includes: separately counting the third data volume corresponding to the confrontation data, and a fourth data volume corresponding to the unstructured data, and calculate a second data ratio between the confrontation data and the unstructured data according to the third data volume and the fourth data volume; if If the second data ratio is greater than the preset confrontation data ratio, it is determined that the unstructured data does not meet the data pollution evaluation standard.
  • the preset confrontation data ratio can be set according to actual business requirements.
  • step 203 specifically includes: determining each feature corresponding to the structured data; using the preset bias corpus Preliminary detection of the bias features existing in the various features, and respectively excluding the bias features and their corresponding structured data from the various features and the training data set, to obtain the excluded features and the excluded structured data. Data; according to the excluded structured data, analyze whether each feature after the exclusion is a bias feature.
  • the prediction bias corpus stores a large number of bias features, such as gender, age, region, income, etc.
  • the features corresponding to the structured data in the training data set include: education level, work, income, medical history, and gender. Match the above features corresponding to the structured data with each feature in the preset bias corpus. Through matching, it can be found that, Among the features corresponding to the structured data, the income feature and the gender feature are bias features. In order to improve the detection accuracy of the bias feature, it is necessary to further analyze whether other features are bias features, and exclude the structured data corresponding to the bias feature from the structured data. Then use the excluded structured data to analyze whether the remaining features are bias features.
  • the method includes: if there is a corresponding label category in the excluded structured data, then After each feature is combined with each label category, multiple combination results are obtained; the feature values corresponding to each feature after the exclusion are determined, and according to the multiple combination results, the corresponding feature values of each feature value under different label classifications are analyzed.
  • First data volume distribution information based on the first data volume distribution information, determine whether each feature after the exclusion is a biased feature.
  • the label categories of the excluded structured data include “yes” and “no”
  • the excluded features include “education level” and "job”.
  • the above features we get "education level- Yes”, “Education Level-No”, “Work-Yes” and “Work-No”, and then determine that the eigenvalues corresponding to the education level include undergraduates and above, undergraduates, and below undergraduates, and the eigenvalues corresponding to work include working and no work , further, first analyze the amount of data in the structured data whose label category is "yes" and the educational level is above undergraduate, undergraduate, and below.
  • the data volume of statistical education is 1000 respectively. People, 200 people and 800 people, the total amount of structured data with the label category "Yes" is 2000 people.
  • the proportions of unstructured data with a bachelor degree or above, bachelor degree and below are 50%, 10% and 40%, because the difference between the data volume of undergraduates and above is 40%, which exceeds the preset proportion of 20%, so it can be determined that the characteristic education level is a biased feature. In this way, it can be determined whether the remaining features corresponding to the structured data with label categories are biased features.
  • the method further includes: if the excluded structured data does not have a corresponding label category, using the preset The clustering algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the corresponding eigenvalues of each eigenvalue under different classifications.
  • Second data volume distribution information based on the second data volume distribution information, determine whether each feature after the exclusion is a bias feature.
  • the preset clustering algorithm may specifically be a DBSCAN clustering algorithm.
  • the excluded structured data may not have corresponding label categories.
  • label categories and features cannot be combined to analyze the distribution of the first data volume corresponding to each feature value under different label categories.
  • you can cluster the excluded structured data and then analyze the distribution of the second data volume corresponding to each feature value under different categories, and use the DBSCAN clustering algorithm to exclude
  • the process of clustering the final structured data first set the neighborhood radius corresponding to the structured data and the threshold of the amount of structured data in the field, then select a structured data A arbitrarily, and calculate the arrival of each structured data
  • the distance of the structured data determine the structured data B, C, and D included in the neighborhood corresponding to the structured data A, if the structured data included in the neighborhood corresponding to the structured data A If the quantity is greater than the structured data volume threshold, then determine structured data A as the core point, and build cluster C1 with structured data A as the core point, find out all the points reachable from structured data A density, structured
  • the density of data A can reach the density of structured data B, and the density of structured data B can reach the density of structured data E, so the density of structured data A can reach the density of structured data E, that is, structured data E also belongs to C1, so according to the above method, it can Find all the structured data in cluster C1, and continue to search for other data in the excluded structured data.
  • cluster C2 can be obtained, and the clustering of the excluded structured data can be completed by dividing into multiple clusters. Divide the excluded structured data into multiple categories, and then analyze the distribution of the second data volume corresponding to each feature value under different classifications. The method of analyzing the distribution of the second data volume is the same as that of the first data volume. This will not be repeated here.
  • the attribute characteristics of the data to be evaluated under the data size evaluation dimension include the total amount, the number of features, the memory size occupied by the data, and whether the data has labels, etc.
  • the attribute characteristics of the data to be evaluated under the data size evaluation dimension include the total amount, the number of features, the memory size occupied by the data, and whether the data has labels, etc.
  • the amount of data corresponding to the data to be evaluated is 300
  • the occupied memory size is 11.3KB
  • the number of features is 13, and the data has labels.
  • the scale evaluation of the data to be evaluated is required.
  • the standards for data size evaluation are different. For example, for translation models, if the total amount of data corresponding to the data to be evaluated is less than 10 million, it is determined that the data to be evaluated does not meet the data size evaluation standard, that is, it does not To be able to directly use this data set for training or prediction, the number needs to be increased.
  • the balance of the data set is an important factor affecting the effect of the artificial intelligence algorithm.
  • the more uniform the data set the smaller the distribution deviation of the data to be evaluated, and the better the operating effect of the artificial intelligence algorithm.
  • the distribution deviation of the data to be evaluated is The bigger it is, the worse the AI algorithm will perform.
  • the label categories corresponding to the data to be evaluated include “yes” and “no”, respectively count the amount of data with the label category "yes” and the data volume with the label category “no”, and calculate the proportion of the number of different label categories , such as calculating the proportion of the data volume corresponding to the "Yes” label is 90%, the proportion of data volume corresponding to the "No” label is 10%, and the difference between the data volume proportions of the two types of labels reaches 80%.
  • the difference in the ratio is greater than 60% of the difference in the ratio of the preset data volume, so it is determined that the data to be evaluated does not meet the data balance evaluation standard.
  • the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include the total amount of data, the total amount of missing tags, and whether various tags are abnormal. For example, the total number of missing labels in the data to be evaluated is counted, and then the ratio of the total amount of data with missing labels to the total amount of data to be evaluated is calculated. If the ratio is greater than the preset ratio, it is determined that the data to be evaluated does not meet the accuracy requirements. gender assessment criteria. Another example is to count the data volumes corresponding to the labels "Yes" and "No". If the data volume under a certain label is less than the preset data volume, it is determined that the label is abnormal, and then it can be determined that there are abnormal labels in the data to be evaluated. Accuracy Evaluation Criteria.
  • the quality evaluation results of the data to be evaluated in multiple evaluation dimensions are obtained, and then a quality evaluation report corresponding to the data to be evaluated is generated for reference by technical personnel. It should be noted that the execution order of the above steps 202-206 is not limited to the order shown in FIG. It can be executed in parallel, which is not limited by the present invention.
  • Another data set quality assessment method provided by the embodiment of the present invention compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and Statistics of the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics of the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained.
  • the quality evaluation results under the multiple evaluation dimensions so by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the quality of the data set
  • the evaluation accuracy and evaluation efficiency effectively ensure the safety of data sets in the process of artificial intelligence development.
  • an embodiment of the present invention provides a data set quality assessment device. As shown in FIG. 3 , the device includes: an acquisition unit 31 , a statistical unit 32 and an evaluation unit 33 .
  • the obtaining unit 31 may be used to obtain the data to be evaluated in the data set.
  • the statistics unit 32 may be used to separately calculate attribute characteristics of the data to be evaluated under multiple evaluation dimensions.
  • the evaluation unit 33 may be configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions respectively .
  • the statistical unit 32 can specifically be used to separately count the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension, and data bias evaluation dimension properties below.
  • the evaluation unit 33 can be specifically configured to evaluate attributes based on the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension, and the data bias evaluation dimension features, performing quality assessment on the data to be evaluated, and obtaining the data to be evaluated in the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension and Quality assessment results under the data bias assessment dimension
  • the statistical unit 32 includes: a fitting module 321 , a prediction module 322 and a judgment module 323 .
  • the fitting module 321 can be configured to use a preset interpolation algorithm to fit a function curve corresponding to the structured data according to the structured data and its corresponding tag category.
  • the prediction module 322 may be configured to use the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data.
  • the determination module 323 may be configured to determine whether the structured data is noise data based on the predicted label category and the label category corresponding to the structured data.
  • the evaluation unit 33 includes: a first calculation module 331 and a first determination module 332 .
  • the first calculation module 331 may be configured to separately count the first data volume corresponding to the noise data and the second data volume corresponding to the structured data, and calculate according to the first data volume and the second data volume , calculating a first data ratio between the noise data and the structured data.
  • the first determining module 332 may be configured to determine that the structured data does not satisfy the data pollution evaluation standard when the first data ratio is greater than a preset noise data ratio.
  • the statistical unit 32 further includes: a compression module 324 and a second calculation module 325 to make statistics on the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension.
  • the compression module 324 may be configured to compress the unstructured data by using the first preset compressor and the second preset compressor respectively, to obtain the first compressed data and the second compressed data corresponding to the unstructured data. Two compressed data.
  • the prediction module 322 may be configured to respectively predict the unstructured data, the first compressed data, and the second compressed data to obtain an original prediction result corresponding to the unstructured data, and the first A first prediction result corresponding to the compressed data, and a second prediction result corresponding to the second compressed data.
  • the second calculation module 325 may be configured to calculate a first difference between the original prediction result and the first prediction result, and a first difference between the original prediction result and the second prediction result, respectively. Two difference.
  • the determining module 323 may be configured to determine whether the unstructured data is confrontational data based on the first difference and the second difference.
  • the first calculation module 331 can also be used to separately count the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data, and according to the third data volume and the The fourth data volume is to calculate a second data ratio between the confrontation data and the unstructured data.
  • the first determining module 332 may also be configured to determine that the unstructured data does not meet the data pollution evaluation standard if the second data ratio is greater than a preset confrontation data ratio.
  • the statistics unit 32 in order to count the attribute characteristics of the data to be evaluated under the data bias evaluation dimension, the statistics unit 32 further includes: a second determination module 326 , an exclusion module 327 and an analysis module 328 .
  • the second determination module 326 may be used to determine each feature corresponding to the structured data.
  • the exclusion module 327 can be used to preliminarily detect the biased features in the various features by using the preset biased corpus, and exclude the biased features and their corresponding structured features from the various features and the training data set. data to obtain the excluded features and the excluded structured data.
  • the analysis module 328 can be configured to analyze whether each feature after exclusion is a bias feature according to the structured data after exclusion.
  • the analysis module 328 can be used to, if there is a corresponding label category in the structured data after exclusion, then Each feature of each feature is combined with each label category to obtain multiple combination results; determine the feature value corresponding to each feature after the exclusion, and according to the multiple combination results, analyze the No. 1 corresponding to each feature value under different label classifications A data volume distribution information; based on the first data volume distribution information, it is determined whether each feature after the exclusion is a biased feature.
  • the analysis module 328 can also be used to use the preset aggregation
  • the class algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the first eigenvalues corresponding to each eigenvalue under different classifications.
  • Data volume distribution information Based on the second data volume distribution information, determine whether each feature after the exclusion is a biased feature.
  • an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining the waiting Evaluate the data; respectively count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; perform quality assessment on the data to be evaluated based on the attribute characteristics under the multiple evaluation dimensions, and obtain the data to be evaluated respectively in Quality evaluation results under the multiple evaluation dimensions.
  • the embodiment of the present invention also provides a physical structure diagram of a computer device, as shown in FIG. 5 , the computer device includes: a processor 41, A memory 42 and a computer program stored in the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on the bus 43 .
  • the processor 41 executes the program, the following steps are implemented: respectively counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; based on the attribute characteristics under the multiple evaluation dimensions, performing The quality assessment is to obtain the quality assessment results of the data to be assessed under the plurality of assessment dimensions respectively.
  • Fangming can obtain the data to be evaluated in the data set; and separately count the attribute characteristics of the data to be evaluated in multiple evaluation dimensions; at the same time, based on the data in the multiple evaluation dimensions attribute characteristics, performing quality assessment on the data to be evaluated, and obtaining quality evaluation results of the data to be evaluated in the plurality of evaluation dimensions, and thus by counting the attribute characteristics of the data to be evaluated in the plurality of evaluation dimensions,
  • the quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and effectively ensuring the safety of the data set in the process of artificial intelligence development.
  • each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here
  • the steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation.
  • the present invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the field of information technology, and disclosed are a method and apparatus for evaluating data set quality, a computer device, and a storage medium, mainly aiming to improve the precision and efficiency in evaluating data set quality. The method comprises: acquiring data to be evaluated in a data set; compiling attribute features of said data under a plurality of evaluation dimensions; and, on the basis of the attribute features under the plurality of evaluation dimensions, evaluating the quality of said data, and obtaining quality evaluation results of said data under the plurality of evaluation dimensions. The present invention is applicable to the evaluation of data set quality.

Description

数据集质量评估方法、装置、计算机设备及存储介质Data set quality assessment method, device, computer equipment and storage medium 技术领域technical field
本发明涉及信息技术领域,尤其是涉及一种数据集质量评估方法、装置、计算机设备及存储介质。The present invention relates to the field of information technology, in particular to a data set quality evaluation method, device, computer equipment and storage medium.
背景技术Background technique
数据是人工智能开发和应用的基础,数据集对于人工智能算法来说至关重要,使用不同质量的数据集会在训练之后得到不同的模型参数,产生不同的执行效果,进而影响人工智能算法的安全性,如果非法分子利用攻击手段对数据集进行恶意修改和添加,会导致模型预测出错,因此如何有效地对数据集质量进行检测和评估,成为了人工智能安全亟需解决的问题。Data is the basis for the development and application of artificial intelligence. Data sets are crucial to artificial intelligence algorithms. Using different quality data sets will result in different model parameters after training, resulting in different execution effects, which in turn affect the security of artificial intelligence algorithms. If criminals use attack methods to maliciously modify and add data sets, it will lead to model prediction errors. Therefore, how to effectively detect and evaluate the quality of data sets has become an urgent problem to be solved in artificial intelligence security.
目前,通常由技术人员根据各自的经验对数据集的质量进行评估。然而,这种质量评估方式,较为依赖技术人员的工作经验,评估结果受人为主观因素的影响较大,因此很可能无法对数据集的质量作出准确评估,进而造成人工智能安全事故,此外,这种人为评估数据集质量的方式,评估效率较低,而且增加了技术人员的工作量。Currently, the quality of datasets is usually assessed by technicians based on their experience. However, this quality assessment method relies more on the work experience of technicians, and the assessment results are greatly affected by human subjective factors. Therefore, it is very likely that it will not be able to accurately assess the quality of the data set, which will cause artificial intelligence security accidents. In addition, this It is a way of artificially evaluating the quality of data sets, which is inefficient and increases the workload of technical personnel.
发明内容Contents of the invention
本发明提供了一种数据集质量评估方法、装置、计算机设备及存储介质,主要在于能够提高数据集质量的评估精度和评估效率。The invention provides a data set quality assessment method, device, computer equipment and storage medium, mainly aiming at improving the assessment accuracy and assessment efficiency of the data set quality.
根据本发明的第一个方面,提供一种数据集质量评估方法,包括:According to a first aspect of the present invention, a data set quality assessment method is provided, including:
获取数据集中的待评估数据;Obtain the data to be evaluated in the dataset;
分别统计所述待评估数据在多个评估维度下的属性特征;Statistically counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions;
基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。Performing quality assessment on the data to be assessed based on attribute characteristics under the multiple assessment dimensions, to obtain quality assessment results of the data to be assessed under the multiple assessment dimensions respectively.
根据本发明的第二个方面,提供一种数据集质量评估装置,包括:According to a second aspect of the present invention, a data set quality assessment device is provided, including:
获取单元,用于获取数据集中的待评估数据;an acquisition unit, configured to acquire the data to be evaluated in the data set;
统计单元,用于分别统计所述待评估数据在多个评估维度下的属性特征;A statistical unit, configured to separately count attribute characteristics of the data to be evaluated under multiple evaluation dimensions;
评估单元,用于基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。The evaluation unit is configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions.
根据本发明的第三个方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:According to a third aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:
获取数据集中的待评估数据;Obtain the data to be evaluated in the dataset;
分别统计所述待评估数据在多个评估维度下的属性特征;Statistically counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions;
基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估, 得到所述待评估数据分别在所述多个评估维度下的质量评估结果。Performing quality assessment on the data to be assessed based on attribute characteristics under the multiple assessment dimensions, and obtaining quality assessment results of the data to be assessed under the multiple assessment dimensions respectively.
根据本发明的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下步骤:According to a fourth aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the program:
获取数据集中的待评估数据;Obtain the data to be evaluated in the dataset;
分别统计所述待评估数据在多个评估维度下的属性特征;Statistically counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions;
基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。Performing quality assessment on the data to be assessed based on attribute characteristics under the multiple assessment dimensions, to obtain quality assessment results of the data to be assessed under the multiple assessment dimensions respectively.
本发明提供的一种数据集质量评估方法、装置、计算机设备及存储介质,与目前由技术人员根据各自的经验对数据集的质量进行评估的方式相比,本方明能够获取数据集中的待评估数据;并分别统计所述待评估数据在多个评估维度下的属性特征;与此同时,基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果,由此通过统计待评估数据在多个评估维度下的属性特征,能够从多个评估维度对数据集的质量进行自动评估,从而能够提高数据集质量的评估精度和评估效率,有效地保证了人工智能开发过程中数据集的安全。A data set quality evaluation method, device, computer equipment and storage medium provided by the present invention, compared with the current method of evaluating the quality of the data set by technicians based on their own experience, the present invention can obtain the waiting data in the data set Evaluate the data; and separately count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, perform quality assessment on the data to be evaluated, and obtain the The quality evaluation results of the data to be evaluated under the multiple evaluation dimensions, so that by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that It can improve the evaluation accuracy and evaluation efficiency of the data set quality, and effectively guarantee the safety of the data set in the process of artificial intelligence development.
附图说明Description of drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:
图1示出了本发明实施例提供的一种数据集质量评估方法流程图;FIG. 1 shows a flowchart of a data set quality assessment method provided by an embodiment of the present invention;
图2示出了本发明实施例提供的另一种数据集质量评估方法流程图;FIG. 2 shows a flowchart of another data set quality assessment method provided by an embodiment of the present invention;
图3示出了本发明实施例提供的一种数据集质量评估装置的结构示意图;Fig. 3 shows a schematic structural diagram of a data set quality assessment device provided by an embodiment of the present invention;
图4示出了本发明实施例提供的另一种数据集质量评估装置的结构示意图;Fig. 4 shows a schematic structural diagram of another data set quality assessment device provided by an embodiment of the present invention;
图5示出了本发明实施例提供的一种计算机设备的实体结构示意图。FIG. 5 shows a schematic diagram of a physical structure of a computer device provided by an embodiment of the present invention.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present invention will be described in detail with reference to the drawings and examples. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
目前,通常由技术人员根据各自的经验对数据集的质量进行评估。然而,这种质量评估方式,较为依赖技术人员的工作经验,评估结果受人为主观因素的影响较大,因此很可能无法对数据集的质量作出准确评估,进而造成人工智能安全事故,此外,这种人为评估数据集质量的方式,评估效率较低,而且增加了技术人员的工作量。Currently, the quality of datasets is usually assessed by technicians based on their experience. However, this quality assessment method relies more on the work experience of technicians, and the assessment results are greatly affected by human subjective factors. Therefore, it is very likely that it will not be able to accurately assess the quality of the data set, which will cause artificial intelligence security accidents. In addition, this It is a way of artificially evaluating the quality of data sets, which is inefficient and increases the workload of technical personnel.
为了解决上述问题,本发明实施例提供了一种数据集质量评估方法,如图1所示,所述方法包括:In order to solve the above problems, an embodiment of the present invention provides a data set quality assessment method, as shown in Figure 1, the method includes:
101、获取数据集中的待评估数据。101. Obtain the data to be evaluated in the data set.
其中,数据集包括训练数据集和预测数据集,待评估数据具体可以为训练数据集中的各个样本数据,或者预测数据集中的各个预测数据。为了克服现有技术中数据集质量的评估精度和评估效率较低的缺陷,本发明实施例开发了一套数据集质量评估工具,通过统计待评估数据在多个评估维度下的属性特征,能够从多个评估维度对数据集的质量进行自动评估,进而提高了数据集质量的评估精度和评估效率,同时保证了人工智能开发过程中数据集的安全。本发明实施例主要应用于对数据集质量进行多维度评估的场景。本发明实施例的执行主体为能够对数据集质量进行评估的装置或者设备,具体可以设置在服务器一侧。Wherein, the data set includes a training data set and a prediction data set, and the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set. In order to overcome the defects of low evaluation accuracy and evaluation efficiency of data set quality in the prior art, the embodiment of the present invention develops a set of data set quality evaluation tools, which can The quality of the data set is automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and at the same time ensuring the safety of the data set in the process of artificial intelligence development. The embodiments of the present invention are mainly applied to scenarios where the quality of a data set is evaluated in multiple dimensions. The execution subject of the embodiment of the present invention is a device or device capable of evaluating the quality of a data set, which may be specifically set on the server side.
对于本发明实施例,预先收集需要进行质量评估的训练数据集和预测数据集,该训练数据集和预测数据集中的数据具体可以为结构化数据,也可以为非结构化数据,如图像数据。在获取待评估的训练数据集或者预测数据集之后,技术人员可以点击数据集质量评估工具界面的文件上传按钮,将待评估的训练数据集或者预测数据集上传至数据集质量评估工具中,以便数据集质量评估工具对待评估的数据集进行多维度的质量评估。For the embodiment of the present invention, the training data set and the prediction data set that need to be evaluated for quality are collected in advance, and the data in the training data set and the prediction data set may specifically be structured data or unstructured data, such as image data. After obtaining the training data set or prediction data set to be evaluated, the technician can click the file upload button on the data set quality evaluation tool interface to upload the training data set or prediction data set to be evaluated to the data set quality evaluation tool, so that The dataset quality assessment tool performs multi-dimensional quality assessment on the dataset to be assessed.
102、分别统计所述待评估数据在多个评估维度下的属性特征。102. Collect the attribute characteristics of the data to be evaluated under multiple evaluation dimensions respectively.
其中,多个评估维度包括数据规模评估维度、数据均衡性评估维度、数据准确性评估维度、数据污染评估维度和数据偏见评估维度,需要说明的是,本发明实施例中的评估维度并不局限于上述列举出来的评估维度,还可以包括其他评估维度,具体可以根据实际的业务需求进行设定。进一步地,待评估数据在数据规模评估维度下的属性特征包括:数据总量、特征数量、数据所占内存大小、数据是否有标签等;待评估数据在数据均衡下评估维度下的属性特征包括各类标签下的数据量占比;待评估数据在数据准确性评估维度下的属性特征包括:数据总量、标签缺失总量、各类标签是否异常等;待评估数据在数据污染评估维度下的属性特征包括:噪声数据量、对抗数据量等;待评估数据在数据偏见评估维度下的属性特征包括待评估数据对应的偏见特征。Among them, the multiple assessment dimensions include data scale assessment dimension, data balance assessment dimension, data accuracy assessment dimension, data pollution assessment dimension and data bias assessment dimension. It should be noted that the assessment dimensions in the embodiments of the present invention are not limited to In addition to the evaluation dimensions listed above, other evaluation dimensions can also be included, which can be set according to actual business needs. Further, the attribute characteristics of the data to be evaluated under the data size evaluation dimension include: the total amount of data, the number of features, the memory size occupied by the data, whether the data has labels, etc.; the attribute characteristics of the data to be evaluated under the evaluation dimension under data balance include The proportion of data volume under various tags; the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include: the total amount of data, the total amount of missing tags, whether various tags are abnormal, etc.; the data to be evaluated under the data pollution evaluation dimension The attribute characteristics include: the amount of noise data, the amount of confrontation data, etc.; the attribute characteristics of the data to be evaluated under the data bias evaluation dimension include the bias characteristics corresponding to the data to be evaluated.
对于本发明实施例,可以采用不同的统计方式,分别统计待评估数据在数据规模评估维度、数据均衡性评估维度、数据准确性评估维度、数据污染评估维度和数据偏见评估维度下的属性特征,针对不同评估维度所采用的具体统计方式不同,具体见步骤202-206。For the embodiment of the present invention, different statistical methods can be used to separately count the attribute characteristics of the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension and data bias evaluation dimension, The specific statistical methods adopted for different evaluation dimensions are different, see steps 202-206 for details.
103、基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。103. Perform quality assessment on the data to be assessed based on the attribute characteristics under the multiple assessment dimensions, and obtain quality assessment results of the data to be assessed respectively under the multiple assessment dimensions.
对于本发明实施例,不同评估维度对应的评估标准不同,在利用多个评估维度下的属性特征对待评估数据进行质量评估的过程中,如果待评估数据不满足任意一个评估维度对应的评估标准,则确定待评估数据存在质量问题,无法利用该数据集对模型进行训练或者预测,技术人员需要重新收集数据集或者对存在质量问题的数据集进行数据清洗。例如,待评估数据在数据均衡性评估维度下的“是”标签对应的数据量占比为90%,“否”标签对应的数据量占比为10%,两类标签的数据量占比之差达到80%,由于该数据量占比之差大于预设数据量占比之差60%,因此确定待评估数据不满足数据均衡性评估标准,如果利用这种不满足数据均衡性评估标准的数据对模型进行训练,很可能影响模型的执行效果,无法保证人工智能算法的安全,同理还可以从数据规模评估维度、数据准确性评估维度、数据污染评估维度、数据偏见评估维度对待评估数据进行质量评估,若待评估数据不满足上述维度对应的评估标准,则确定待评估数据存在质量问题,无法利用其对模型进行训练或者预测,针对不同评估维度的质量评估过程具体见步骤202-206。For the embodiment of the present invention, the evaluation standards corresponding to different evaluation dimensions are different. In the process of evaluating the quality of the data to be evaluated by using attribute characteristics under multiple evaluation dimensions, if the data to be evaluated does not meet the evaluation standard corresponding to any evaluation dimension, If it is determined that the data to be evaluated has quality problems, the data set cannot be used to train or predict the model, and technicians need to re-collect the data set or perform data cleaning on the data set with quality problems. For example, under the data balance evaluation dimension of the data to be evaluated, the “Yes” label corresponds to 90% of the data volume, and the “No” label corresponds to 10%. If the difference reaches 80%, since the difference of the proportion of the data volume is greater than the difference of 60% of the proportion of the preset data volume, it is determined that the data to be evaluated does not meet the data balance evaluation standard. If this data does not meet the data balance evaluation standard Training the model with data is likely to affect the execution effect of the model, and cannot guarantee the safety of artificial intelligence algorithms. Similarly, evaluation data can also be treated from the dimensions of data scale evaluation, data accuracy evaluation, data pollution evaluation, and data bias evaluation Carry out quality assessment. If the data to be evaluated does not meet the evaluation standards corresponding to the above dimensions, it is determined that the data to be evaluated has quality problems, and it cannot be used to train or predict the model. For the quality evaluation process of different evaluation dimensions, see steps 202-206. .
本发明实施例提供的一种数据集质量评估方法,与目前由技术人员根据各自的经验对数据集的质量进行评估的方式相比,本方明能够获取数据集中的待评估数据;并分别统计所述待评估数据在多个评估维度下的属性特征;与此同时,基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果,由此通过统计待评估数据在多个评估维度下的属性特征,能够从多个评估维度对数据集的质量进行自动评估,从而能够提高数据集质量的评估精度和评估效率,有效地保证了人工智能开发过程中数据集的安全。The method for assessing the quality of a data set provided by the embodiment of the present invention, compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and count them separately The attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained. The quality evaluation results under multiple evaluation dimensions are described, so by counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that the quality of the data set can be improved. Evaluation accuracy and evaluation efficiency effectively guarantee the safety of data sets in the process of artificial intelligence development.
进一步的,为了更好的说明上述数据集的质量评估过程,作为对上述实施例的细化和扩展,本发明实施例提供了另一种污染样本数据的检测方法,如图2所示,所述方法包括:Further, in order to better illustrate the quality assessment process of the above data set, as a refinement and extension of the above embodiment, the embodiment of the present invention provides another method for detecting contaminated sample data, as shown in FIG. 2 , the The methods described include:
201、获取数据集中的待评估数据。201. Acquire the data to be evaluated in the data set.
其中,数据集包括训练数据集和预测数据集,待评估数据具体可以为训练数据集中的各个样本数据,或者预测数据集中的各个预测数据。对于本发明实施例,在对数据集进行质量评估之前,需要获取待评估的数据集,数据集的具体获取方式与步骤101完全相同,在此不再赘述。Wherein, the data set includes a training data set and a prediction data set, and the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set. For the embodiment of the present invention, before evaluating the quality of the data set, it is necessary to obtain the data set to be evaluated, and the specific method of obtaining the data set is exactly the same as that of step 101, and will not be repeated here.
202、统计所述待评估数据在数据污染评估维度下的属性特征,并基于所述数据污染评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据在所述数据污染评估维度下的质量评估结果。202. Count the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics under the data pollution evaluation dimension, and obtain the data to be evaluated in the The quality assessment results under the data pollution assessment dimension.
对于本发明实施例,如果待评估数据是训练数据集中的结构化数据,则在对待评估数据进行污染评估的过程中,需要检测待评估数据中是否存在噪声数据,因为噪声数据对模型训练的干扰较大,容易影响模型的执行效果,针对识别噪声数据的具体过程,作为一种可选实施方式,步骤202具体包括:根据所述结构化数据及其对应的标签类别,利用预设插值算法拟合所述结构化数据对应的函数曲线;利用所述函数曲线对所述结构化数据进行预测,得到所述结构化数据对应的预测标签类别;基于所述预测标签类别和所述结构化数据对应的标签类别,判定所述结构化数据是否为噪声数据。For the embodiment of the present invention, if the data to be evaluated is structured data in the training data set, in the process of performing pollution evaluation on the data to be evaluated, it is necessary to detect whether there is noise data in the data to be evaluated, because noise data interferes with model training Larger, it is easy to affect the execution effect of the model. For the specific process of identifying noise data, as an optional implementation, step 202 specifically includes: according to the structured data and its corresponding label category, using a preset interpolation algorithm to simulate Combining the function curve corresponding to the structured data; using the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data; based on the corresponding predicted label category and the structured data label category to determine whether the structured data is noise data.
具体地,训练数据集中存在大量的结构化数据,将每一个结构化数据作为一个样本点(x,y),并利用这些样本点(x 1,y 1),(x 2,y 2),…(x n,y n),来拟合一个函数曲线y=f(x),由于待评估的结构化数据可能存在数据缺陷的现象,因此在利用大量样本点拟合函数曲线之前,需要利于预设插值算法对结构化数据进行插值处理,该预设插值算法具体可以为预设克里金插值算法,首先,确定待插值的结构化数据,并计算待插值的结构化数据与这些已知分类结果的结构化数据之间的距离,并根据该距离确定已知分类结构的结构化数据对应的数据权重,之后根据已知分类结果的结构化数据对应的权重和该分类结果,计算待插值的结构化数据对应的分类结果。其中,该分类结果具体可以为结构化数据对应的分类概率值。 Specifically, there is a large amount of structured data in the training data set, each structured data is used as a sample point (x, y), and these sample points (x 1 ,y 1 ),(x 2 ,y 2 ), …(x n ,y n ), to fit a function curve y=f(x), because the structured data to be evaluated may have data defects, so before using a large number of sample points to fit the function curve, it is necessary to facilitate The preset interpolation algorithm performs interpolation processing on the structured data. The preset interpolation algorithm may specifically be a preset Kriging interpolation algorithm. First, determine the structured data to be interpolated, and calculate the difference between the structured data to be interpolated and these known The distance between the structured data of the classification result, and determine the data weight corresponding to the structured data of the known classification structure according to the distance, and then calculate the value to be interpolated according to the weight corresponding to the structured data of the known classification result and the classification result The classification results corresponding to the structured data. Wherein, the classification result may specifically be a classification probability value corresponding to the structured data.
例如,已知分类结果的结构化数据为x1,x2,x3,确定待插值的结构化数据为x4,由于待插值的结构数据x4对应的分类结果是未知的,因此可以利用已知分类结果的结构化数据x1,x2,x3,预估待插值的结构化数据x4对应的分类结果,具体地,分别计算待插值的结构化数据x4与已知分类结果的结构化数据x1,x2,x3之间的距离,距离越大,说明该已知分类结果的结构化数据与待插值的结构化数据相距越远,其对待插值的结构化数据的影响较小,因此其对应的数据权重越小;相反距离越小,说明该已知分类结果的结构化数据与待插值的结构化数据相距越近,其对待插值的结构化数据的影响较大,因此其对应的数据权重越大,进一步地,在确定已知分类结果的结构化数据对应的数据权重之后,将已知分类结果的结构化数据对应的数据权重和分类结果相乘,得到待插值的结构化数据对应的分类结果。接着将确定完分类结果的待插值的结构化数据插入至已知分类结果的结构化数据中,由此能够解决数据缺失的问题。For example, the structured data of the known classification results are x1, x2, x3, and the structured data to be interpolated is determined to be x4. Since the classification result corresponding to the structural data x4 to be interpolated is unknown, the known classification results can be used Structured data x1, x2, x3, estimate the classification result corresponding to the structured data x4 to be interpolated, specifically, calculate the difference between the structured data x4 to be interpolated and the structured data x1, x2, x3 of known classification results The larger the distance, the farther the structured data of the known classification result is from the structured data to be interpolated, and the impact on the structured data to be interpolated is smaller, so the corresponding data weight is smaller; On the contrary, the smaller the distance, the closer the structured data of the known classification result is to the structured data to be interpolated, and the greater the impact on the structured data to be interpolated, so the corresponding data weight is greater. Further, After determining the data weights corresponding to the structured data with known classification results, the data weights corresponding to the structured data with known classification results are multiplied by the classification results to obtain the classification results corresponding to the structured data to be interpolated. Then, the structured data to be interpolated after the classification result is determined is inserted into the structured data of the known classification result, thereby solving the problem of missing data.
进一步地,将上述结构化数据共同作为样本点,利用这些样本点进行曲线拟合,得到函数曲线y=f(x),由于训练数据集中结构化数据对应的分类结果 是已知的,即属于真实标签类别的概率值是已知的,接下来可以利用函数曲线对上述的结构化数据进行预测,得到结构化数据对应的预测结果,即属于预测标签类别的概率值。进一步地,将结构化数据属于预测标签类别的概率值与属于真实标签类别的概率值相减,得到结构化数据对应的概率差,如果该概率差大于预设概率差,则确定该结构化数据属于噪声数据。例如,结构化数据A属于真实标签类别的概率值为0.87,属于预测标签类别的概率值为0.27,概率差为0.87-0.27=0.5,由于该概率差大于预设概率差0.2,因此确定结构化数据A为噪声数据。由此按照上述方式能够确定训练数据集中的每个结构化数据是否为噪声数据。Furthermore, the above-mentioned structured data are collectively used as sample points, and these sample points are used for curve fitting to obtain a function curve y=f(x). Since the classification results corresponding to the structured data in the training data set are known, that is, they belong to The probability value of the real label category is known, and then the function curve can be used to predict the above structured data, and the prediction result corresponding to the structured data is obtained, that is, the probability value belonging to the predicted label category. Further, the probability value of the structured data belonging to the predicted label category is subtracted from the probability value belonging to the real label category to obtain the probability difference corresponding to the structured data. If the probability difference is greater than the preset probability difference, the structured data is determined are noisy data. For example, the probability value of structured data A belonging to the real label category is 0.87, the probability value of belonging to the predicted label category is 0.27, and the probability difference is 0.87-0.27=0.5. Since the probability difference is greater than the preset probability difference 0.2, it is determined that the structured data Data A is noise data. Therefore, it can be determined whether each structured data in the training data set is noise data in the manner described above.
进一步地,在确定训练数据集中的噪声数据之后,需要对该训练数据集中的结构化数据进行污染评估,基于此,所述方法包括:分别统计所述噪声数据对应的第一数据量,以及结构化数据对应的第二数据量,并根据所述第一数据量和所述第二数据量,计算所述噪声数据与所述结构化数据之间的第一数据比例;若所述第一数据比例大于预设噪声数据比例,则确定所述结构化数据不满足数据污染评估标准。其中,该预设噪声数据比例可以根据实际的业务需求进行设定。Further, after determining the noise data in the training data set, the structured data in the training data set needs to be polluted. Based on this, the method includes: separately counting the first data volume corresponding to the noise data, and the structure The second data volume corresponding to the structured data, and according to the first data volume and the second data volume, calculate the first data ratio between the noise data and the structured data; if the first data If the ratio is greater than the preset noise data ratio, it is determined that the structured data does not meet the data pollution evaluation standard. Wherein, the preset noise data ratio may be set according to actual service requirements.
例如,设定预设噪声数据比例为10%,在确定训练数据集中存在的噪声数据之后,统计噪声数据的总量(第一数据量)为200个,训练数据集中结构化数据的总量(第二数据量)为1000个,由此能够计算噪声数据对应的第一数据量与结构化数据对应的第二数据量之间的第一数据比例为200/1000=20%,由于该第一数据比例20%大于预设噪声数据比例10%,因此确定该训练数据集不满足数据污染评估标准,即该训练数据集存在质量问题,不能够用于模型训练。For example, if the preset noise data ratio is set to be 10%, after determining the noise data existing in the training data set, the total amount of statistical noise data (first data volume) is 200, and the total amount of structured data in the training data set ( The second data amount) is 1000, thus the first data ratio between the first data amount corresponding to the noise data and the second data amount corresponding to the structured data can be calculated as 200/1000=20%, due to the first The data ratio of 20% is greater than the preset noise data ratio of 10%, so it is determined that the training data set does not meet the data pollution evaluation criteria, that is, the training data set has quality problems and cannot be used for model training.
在具体应用场景中,如果待评估数据是预测数据集中的非结构化数据,则对待评估数据进行污染评估的过程中,需要检测待评估数据中是否存在对抗数据,即检测待评估数据中是否存在攻击者恶意打造的数据,因为预测数据集中一旦混入这种对抗数据,将会直接影响模型的预测结果,针对识别对抗数据的具体过程,作为一种可选实施方式,步骤202具体包括:分别利用第一预设压缩器和第二预设压缩器对所述非结构化数据进行压缩处理,得到所述非结构化数据对应的第一压缩数据和第二压缩数据;分别对所述非结构化数据、所述第一压缩数据和所述第二压缩数据进行预测,得到所述非结构化数据对应的原始预测结果,所述第一压缩数据对应的第一预测结果,以及所述第二压缩数据对应的第二预测结果;分别计算所述原始预测结果与所述第一预测结果之间的第一差值,以及所述原始预测结果与所述第二预测结果之间的第二差值;基于所述第一差值和所述第二差值,判定所述非结构化数据是否为对抗数据。In a specific application scenario, if the data to be evaluated is unstructured data in the prediction data set, in the process of performing pollution evaluation on the data to be evaluated, it is necessary to detect whether there is confrontation data in the data to be evaluated, that is, to detect whether there is The data maliciously created by the attacker, because once such confrontation data is mixed into the prediction data set, it will directly affect the prediction result of the model. For the specific process of identifying confrontation data, as an optional implementation method, step 202 specifically includes: using The first preset compressor and the second preset compressor compress the unstructured data to obtain first compressed data and second compressed data corresponding to the unstructured data; data, the first compressed data, and the second compressed data to obtain the original prediction result corresponding to the unstructured data, the first prediction result corresponding to the first compressed data, and the second compressed data A second prediction result corresponding to the data; respectively calculating a first difference between the original prediction result and the first prediction result, and a second difference between the original prediction result and the second prediction result ; Based on the first difference and the second difference, determine whether the unstructured data is adversarial data.
其中,第一预设压缩器和第二预设压缩器能够对非结构化数据中的特征进行压缩,以减少不必要特征的输入,降低非结构化数据对应的维度,第一预设压缩器所压缩的特征与第二预设压缩器所压缩的特征并不相同,如输入的非结构化数据包括10个特征,即非结构化数据对应的输入维度是10,第一预设压缩器能够对非结构化数据中的第一个特征和第二个特征进行压缩,第二预设压缩器能够对非结构化数据中的第三个特征和第四个特征进行压缩。需要说明的是,本发明实施例中所采用的压缩器的数量并不局限于两个,具体可以根据实际的业务需求和特征数量,设定压缩器的数量。此外,本发明实施例中的原始预测结果、第一预测结果和第二预测结果为非结构化数据属于相应标签类别的概率值。Among them, the first preset compressor and the second preset compressor can compress the features in the unstructured data to reduce the input of unnecessary features and reduce the dimension corresponding to the unstructured data. The first preset compressor The compressed features are different from those compressed by the second preset compressor. For example, if the input unstructured data includes 10 features, that is, the input dimension corresponding to the unstructured data is 10, the first preset compressor can The first feature and the second feature in the unstructured data are compressed, and the second preset compressor can compress the third feature and the fourth feature in the unstructured data. It should be noted that the number of compressors used in the embodiment of the present invention is not limited to two, and the number of compressors can be set according to actual service requirements and the number of features. In addition, the original prediction result, the first prediction result and the second prediction result in the embodiment of the present invention are the probability values that the unstructured data belongs to the corresponding label category.
例如,将非结构化数据A分别输入至第一预设压缩器和第二预设压缩器中进行特征压缩处理,得到非结构化数据A对应的第一压缩数据和第二压缩数据,之后将非结构化数据A、第一压缩数据和第二压缩数据分别输入至构建的模型中进行预测,得到结构化数据A对应的原始预测结果0.78,第一压缩数据对应的第一预测结果0.56,第二压缩数据对应的第二预测结果0.63,进一步地,将原始预测结果与第一预测结果相减,得到第一差值为0.22,将原始预测结果与第二预测结果相减,得到第二差值为0.15,之后从第一差值和第二差值中选择一个最大差值与预设差值进行比较,如果最大差值大于预设差值,则确定非结构化数据A为对抗数据;如果最大差值小于预设差值,则确定非结构化数据A不是对抗数据,如设定预设差值为0.2,由于最大差值0.22大于预设差值0.2,因此确定非结构化数据是对抗数据。由此按照上述方式能够确定预测数据集中存在的所有对抗数据。For example, the unstructured data A is respectively input into the first preset compressor and the second preset compressor for feature compression processing, and the first compressed data and the second compressed data corresponding to the unstructured data A are obtained, and then the The unstructured data A, the first compressed data and the second compressed data were respectively input into the built model for prediction, and the original prediction result corresponding to the structured data A was 0.78, the first prediction result corresponding to the first compressed data was 0.56, and the first prediction result corresponding to the first compressed data was 0.56. The second predicted result corresponding to the second compressed data is 0.63. Further, the original predicted result is subtracted from the first predicted result to obtain the first difference of 0.22, and the original predicted result is subtracted from the second predicted result to obtain the second difference. The value is 0.15, and then a maximum difference is selected from the first difference and the second difference to compare with the preset difference. If the maximum difference is greater than the preset difference, then it is determined that the unstructured data A is confrontational data; If the maximum difference is less than the preset difference, it is determined that the unstructured data A is not confrontational data. For example, if the preset difference is set to 0.2, since the maximum difference 0.22 is greater than the preset difference 0.2, it is determined that the unstructured data is against data. All adversarial data present in the prediction data set can thus be determined in the manner described above.
进一步地,在确定预测数据集中的对抗数据之后,需要对该预测数据集中的非结构化数据进行污染评估,基于此,所述方法包括:分别统计所述对抗数据对应的第三数据量,以及所述非结构化数据对应的第四数据量,并根据所述第三数据量和所述第四数据量,计算所述对抗数据与所述非结构化数据之间的第二数据比例;若所述第二数据比例大于预设对抗数据比例,则确定所述非结构化数据不满足数据污染评估标准。其中,该预设对抗数据比例可以根据实际的业务需求进行设定。Further, after determining the confrontation data in the prediction data set, it is necessary to perform pollution assessment on the unstructured data in the prediction data set, based on this, the method includes: separately counting the third data volume corresponding to the confrontation data, and a fourth data volume corresponding to the unstructured data, and calculate a second data ratio between the confrontation data and the unstructured data according to the third data volume and the fourth data volume; if If the second data ratio is greater than the preset confrontation data ratio, it is determined that the unstructured data does not meet the data pollution evaluation standard. Wherein, the preset confrontation data ratio can be set according to actual business requirements.
例如,设定预设对抗数据比例为10%,在确定预测数据集中存在的对抗数据之后,统计对抗数据的总量(第三数据量)为300个,预测数据集中非结构化数据的总量(第四数据量)为1000个,由此能够计算对抗数据对应的第三数据量与非结构化数据对应的第四数据量之间的第二数据比例为300/1000=30%,由于该第二数据比例30%大于预设对抗数据比例10%,因此确定该预测数据集不满足数据污染评估标准,即该预测数据集存在质量问题,不能够用于模型预测。For example, set the preset ratio of confrontational data to 10%, after determining the confrontational data existing in the predicted data set, the total amount of statistical confrontational data (the third data volume) is 300, and the total amount of unstructured data in the forecasted data set (The fourth data volume) is 1000, thus the second data ratio between the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data can be calculated as 300/1000=30%, due to the The second data ratio of 30% is greater than the preset confrontation data ratio of 10%, so it is determined that the prediction data set does not meet the data pollution evaluation criteria, that is, the prediction data set has quality problems and cannot be used for model prediction.
203、统计所述待评估数据在数据偏见评估维度下的属性特征,并基于所述数据偏见评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据在所述数据偏见评估维度下的质量评估结果。203. Count the attribute characteristics of the data to be evaluated under the data bias evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics under the data bias evaluation dimension, and obtain the data to be evaluated in the Quality evaluation results under the data bias evaluation dimension.
对于本发明实施例,如果待评估数据是训练数据集中的结构化数据,则在对待评估数据进行偏见评估的过程中,需要检测待评估数据对应的各个特征是否为偏见特征,因为偏见特征的存在,可能会导致人工智能的决策结果带有歧视,针对确定偏见特征的具体过程,作为一种可选实施方式,步骤203具体包括:确定所述结构化数据对应的各个特征;利用预设偏见语料库初步检测所述各个特征中存在的偏见特征,并从所述各个特征和所述训练数据集中分别排除所述偏见特征及其对应的结构化数据,得到排除后的各个特征和排除后的结构化数据;根据所述排除后的结构化数据,分析所述排除后的各个特征是否为偏见特征。其中,预测偏见语料库中存储有大量偏见特征,如性别、年龄、地域、收入等。For the embodiment of the present invention, if the data to be evaluated is structured data in the training data set, in the process of bias evaluation of the data to be evaluated, it is necessary to detect whether each feature corresponding to the data to be evaluated is a bias feature, because the existence of bias features , may lead to discrimination in the decision-making results of artificial intelligence. For the specific process of determining the bias characteristics, as an optional implementation, step 203 specifically includes: determining each feature corresponding to the structured data; using the preset bias corpus Preliminary detection of the bias features existing in the various features, and respectively excluding the bias features and their corresponding structured data from the various features and the training data set, to obtain the excluded features and the excluded structured data. Data; according to the excluded structured data, analyze whether each feature after the exclusion is a bias feature. Among them, the prediction bias corpus stores a large number of bias features, such as gender, age, region, income, etc.
例如,训练数据集中的结构化数据对应的特征包括:教育程度、工作、收入、病史、性别,将结构化数据对应的上述特征与预设偏见语料库中的各个特征进行匹配,通过匹配可以发现,结构化数据对应的特征中收入特征和性别特征为偏见特征,为提高偏见特征的检测精度,还需要进一步分析其他特征是否为偏见特征,从结构化数据中排除该偏见特征对应的结构化数据,之后利用排除后的结构化数据,分析剩余的各个特征是否为偏见特征。For example, the features corresponding to the structured data in the training data set include: education level, work, income, medical history, and gender. Match the above features corresponding to the structured data with each feature in the preset bias corpus. Through matching, it can be found that, Among the features corresponding to the structured data, the income feature and the gender feature are bias features. In order to improve the detection accuracy of the bias feature, it is necessary to further analyze whether other features are bias features, and exclude the structured data corresponding to the bias feature from the structured data. Then use the excluded structured data to analyze whether the remaining features are bias features.
进一步地,针对分析剩余的各个特征是否为偏见特征的具体过程,作为一种可选实施方式,所述方法包括:若所述排除后的结构化数据存在相应的标签类别,则将所述排除后的各个特征与各个标签类别进行组合,得到多个组合结果;确定所述排除后的各个特征对应的特征值,并根据所述多个组合结果,分析在不同标签分类下各个特征值对应的第一数据量分布信息;基于所述第一数据量分布信息,判定所述排除后的各个特征是否为偏见特征。Further, regarding the specific process of analyzing whether the remaining features are biased features, as an optional implementation, the method includes: if there is a corresponding label category in the excluded structured data, then After each feature is combined with each label category, multiple combination results are obtained; the feature values corresponding to each feature after the exclusion are determined, and according to the multiple combination results, the corresponding feature values of each feature value under different label classifications are analyzed. First data volume distribution information: based on the first data volume distribution information, determine whether each feature after the exclusion is a biased feature.
例如,排除后的结构化数据存在的标签类别包括“是”和“否”,排除后的各个特征包括“教育程度”和“工作”,将上述特征与标签类别进行组合,得到“教育程度-是”、“教育程度-否”、“工作-是”和“工作-否”,之后确定教育程度对应的特征值包括本科以上、本科、本科以下,工作对应的特征值包括有工作和无工作,进一步地,首先分析标签类别为“是”的结构化数据中教育程度分别为本科以上、本科、本科以下的数据量,如统计教育程度为本科以上、本科、本科以下的数据量分别为1000人,200人和800人,标签类别为“是”的结构化数据总量为2000人,由此可知,本科以上、本科、本科以下非结构化数据的占比分别为50%,10%和40%,由于本科以上的数据量与本科的数据量占比之差为40%超过预设占比之差20%,因此可以确定特征教育程度为偏见特征。由此按照上述方式能够确定存在标签类别的结构化数据对应的剩余特征 是否为偏见特征。For example, the label categories of the excluded structured data include "yes" and "no", and the excluded features include "education level" and "job". Combining the above features with the label categories, we get "education level- Yes", "Education Level-No", "Work-Yes" and "Work-No", and then determine that the eigenvalues corresponding to the education level include undergraduates and above, undergraduates, and below undergraduates, and the eigenvalues corresponding to work include working and no work , further, first analyze the amount of data in the structured data whose label category is "yes" and the educational level is above undergraduate, undergraduate, and below. For example, the data volume of statistical education is 1000 respectively. People, 200 people and 800 people, the total amount of structured data with the label category "Yes" is 2000 people. It can be seen that the proportions of unstructured data with a bachelor degree or above, bachelor degree and below are 50%, 10% and 40%, because the difference between the data volume of undergraduates and above is 40%, which exceeds the preset proportion of 20%, so it can be determined that the characteristic education level is a biased feature. In this way, it can be determined whether the remaining features corresponding to the structured data with label categories are biased features.
进一步地,针对分析剩余的各个特征是否为偏见特征的具体过程,作为一种可选实施方式,所述方法还包括:若所述排除后的结构化数据不存在相应的标签类别,利用预设聚类算法对所述排除后的结构化数据进行聚类处理,得到不同分类下的结构化数据;确定所述排除后的各个特征对应的特征值,并分析在不同分类下各个特征值对应的第二数据量分布信息;基于所述第二数据量分布信息,判定所述排除后的各个特征是否为偏见特征。其中,该预设聚类算法具体可以为DBSCAN聚类算法。Further, regarding the specific process of analyzing whether the remaining features are biased features, as an optional implementation, the method further includes: if the excluded structured data does not have a corresponding label category, using the preset The clustering algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the corresponding eigenvalues of each eigenvalue under different classifications. Second data volume distribution information: based on the second data volume distribution information, determine whether each feature after the exclusion is a bias feature. Wherein, the preset clustering algorithm may specifically be a DBSCAN clustering algorithm.
具体地,排除后的结构化数据可能不存在相应的标签类别,此时便无法将标签类别和特征进行组合,分析不同标签类别下各个特征值对应的第一数据量分布,因此在不知道结构化数据对应的标签类别的情况下,可以通过对排除后的结构化数据进行聚类,之后再分析在不同分类下各个特征值对应的第二数据量分布情况,在利用DBSCAN聚类算法对排除后的结构化数据进行聚类的过程中,首先设定结构化数据对应的邻域半径和领域中结构化数据量阈值,之后任意选择一个结构化数据A,并计算出每一个结构化数据到达该结构化数据的距离,根据计算的各个距离,确定结构化数据A对应的邻域中包括的各个结构化数据B、C、D,如果结构化数据A对应的邻域中包括的结构化数据数量大于结构化数据量阈值,则确定结构化数据A为核心点,并以结构化数据A为核心点建立簇C1,找出所有从结构化数据A密度可达的点,结构化数据A邻域内的所有结构化数据都是结构化数据A密度可达的点,都属于C1,此外,确定结构化数据B对应的邻域中包括的结构化数据,如包括结构化数据E,由于结构化数据A密度可达结构化数据B,结构化数据B密度可达结构化数据E,所以结构化数据A密度可达结构化数据E,即结构化数据E也属于C1,由此按照上述方式能够找到簇C1中的所有的结构化数据,继续在寻找排除后的结构化数据中的其他数据,按照上述方式能够得到簇C2,通过分成多个簇完成对排除后的结构化数据的聚类,即将排除后的结构化数据分成多个类别,之后再分析在不同分类下各个特征值对应的第二数据量分布情况,分析第二数据量分布情况的方式与第一数据量分布情况相同,在此不再赘述。Specifically, the excluded structured data may not have corresponding label categories. At this time, label categories and features cannot be combined to analyze the distribution of the first data volume corresponding to each feature value under different label categories. In the case of the label category corresponding to the structured data, you can cluster the excluded structured data, and then analyze the distribution of the second data volume corresponding to each feature value under different categories, and use the DBSCAN clustering algorithm to exclude In the process of clustering the final structured data, first set the neighborhood radius corresponding to the structured data and the threshold of the amount of structured data in the field, then select a structured data A arbitrarily, and calculate the arrival of each structured data The distance of the structured data, according to the calculated distances, determine the structured data B, C, and D included in the neighborhood corresponding to the structured data A, if the structured data included in the neighborhood corresponding to the structured data A If the quantity is greater than the structured data volume threshold, then determine structured data A as the core point, and build cluster C1 with structured data A as the core point, find out all the points reachable from structured data A density, structured data A adjacent All the structured data in the domain are the density-reachable points of structured data A, and they all belong to C1. In addition, determine the structured data included in the neighborhood corresponding to structured data B, such as including structured data E, because structured data The density of data A can reach the density of structured data B, and the density of structured data B can reach the density of structured data E, so the density of structured data A can reach the density of structured data E, that is, structured data E also belongs to C1, so according to the above method, it can Find all the structured data in cluster C1, and continue to search for other data in the excluded structured data. According to the above method, cluster C2 can be obtained, and the clustering of the excluded structured data can be completed by dividing into multiple clusters. Divide the excluded structured data into multiple categories, and then analyze the distribution of the second data volume corresponding to each feature value under different classifications. The method of analyzing the distribution of the second data volume is the same as that of the first data volume. This will not be repeated here.
204、统计所述待评估数据在数据规模评估维度下的属性特征,并基于所述数据规模评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据在所述数据规模评估维度下的质量评估结果。204. Count the attribute characteristics of the data to be evaluated in the data size evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics in the data size evaluation dimension, and obtain the data to be evaluated in the The quality evaluation results under the data scale evaluation dimension.
其中,待评估数据在数据规模评估维度下的属性特征包括数量总量、特征数量、数据所占内存大小、数据是否有标签等。对于本发明实施例,在对待评估数据进行规模评估的过程中,需要统计待评估数据对应的数据总量,特征数量、所占内存大小,以及数据是否有标签,如果存在标签缺失的情况, 缺失标签的数据量为多少。例如,待评估数据对应的数据量为300,所占内存大小为11.3KB,特征数为13个,数据均有标签。Among them, the attribute characteristics of the data to be evaluated under the data size evaluation dimension include the total amount, the number of features, the memory size occupied by the data, and whether the data has labels, etc. For the embodiment of the present invention, in the process of evaluating the scale of the data to be evaluated, it is necessary to count the total amount of data corresponding to the data to be evaluated, the number of features, the size of the memory occupied, and whether the data has a label. If there is a missing label, the missing How much is the data volume of the tag. For example, the amount of data corresponding to the data to be evaluated is 300, the occupied memory size is 11.3KB, the number of features is 13, and the data has labels.
进一步地,在统计完成待评估数据对应的数量总量、特征数量、所占内存大小,以及是否有标签后,需要对待评估数据进行规模评估。需要说明的是,针对不同模型算法,数据规模评估的标准不同,例如,针对翻译模型,如果待评估数据对应的数据总量小于1000万,则确定待评估数据不满足数据规模评估标准,即不能够直接使用该数据集进行训练或者预测,需要增加数量。Further, after the statistics of the total quantity, number of features, memory size occupied, and whether there are labels corresponding to the data to be evaluated are completed, the scale evaluation of the data to be evaluated is required. It should be noted that for different model algorithms, the standards for data size evaluation are different. For example, for translation models, if the total amount of data corresponding to the data to be evaluated is less than 10 million, it is determined that the data to be evaluated does not meet the data size evaluation standard, that is, it does not To be able to directly use this data set for training or prediction, the number needs to be increased.
205、统计所述待评估数据在数据均衡性评估维度下的属性特征,并基于所述数据均衡性评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据在所述数据均衡性评估维度下的质量评估结果。205. Count the attribute characteristics of the data to be evaluated in the data balance evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics in the data balance evaluation dimension, and obtain the data to be evaluated in The quality evaluation result under the data balance evaluation dimension.
对于本发明实施例,数据集的均衡性是影响人工智能算法效果的一个重要因素,数据集越均匀,待评估数据分布偏差越小,人工智能算法的运行效果越好,相反待评估数据分布偏差越大,人工智能算法的运行效果越不好。例如,待评估数据对应的标签类别包括“是”和“否”,分别统计标签类别为“是”的数据量,以及标签类别为“否”的数据量,并计算不同标签类别的数量占比,如计算“是”标签对应的数据量占比为90%,“否”标签对应的数据量占比为10%,两类标签的数据量占比之差达到80%,由于该数据量占比之差大于预设数据量占比之差60%,因此确定待评估数据不满足数据均衡性评估标准。For the embodiment of the present invention, the balance of the data set is an important factor affecting the effect of the artificial intelligence algorithm. The more uniform the data set, the smaller the distribution deviation of the data to be evaluated, and the better the operating effect of the artificial intelligence algorithm. On the contrary, the distribution deviation of the data to be evaluated is The bigger it is, the worse the AI algorithm will perform. For example, the label categories corresponding to the data to be evaluated include "yes" and "no", respectively count the amount of data with the label category "yes" and the data volume with the label category "no", and calculate the proportion of the number of different label categories , such as calculating the proportion of the data volume corresponding to the "Yes" label is 90%, the proportion of data volume corresponding to the "No" label is 10%, and the difference between the data volume proportions of the two types of labels reaches 80%. The difference in the ratio is greater than 60% of the difference in the ratio of the preset data volume, so it is determined that the data to be evaluated does not meet the data balance evaluation standard.
206、统计所述待评估数据在数据准确性评估维度下的属性特征,并基于所述数据准确性评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据在所述数据准确性评估维度下的质量评估结果。206. Count the attribute characteristics of the data to be evaluated in the data accuracy evaluation dimension, and perform quality evaluation on the data to be evaluated based on the attribute characteristics in the data accuracy evaluation dimension, and obtain the data to be evaluated in The quality evaluation result under the data accuracy evaluation dimension.
其中,待评估数据在数据准确性评估维度下的属性特征包括数据总量、标签缺失总量和各类标签是否异常等。例如,统计待评估数据中缺失标签的数量总量,之后计算缺失标签的数据总量与待评估数据对应的数量总量的比值,如果该比值大于预设比值,则确定待评估数据不满足准确性评估标准。再比如,分别统计标签“是”和“否”对应的数据量,如果某一标签下的数据量小于预设数据量,则确定该标签异常,进而能够确定待评估数据存在异常标签,不满足准确性评估标准。Among them, the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include the total amount of data, the total amount of missing tags, and whether various tags are abnormal. For example, the total number of missing labels in the data to be evaluated is counted, and then the ratio of the total amount of data with missing labels to the total amount of data to be evaluated is calculated. If the ratio is greater than the preset ratio, it is determined that the data to be evaluated does not meet the accuracy requirements. gender assessment criteria. Another example is to count the data volumes corresponding to the labels "Yes" and "No". If the data volume under a certain label is less than the preset data volume, it is determined that the label is abnormal, and then it can be determined that there are abnormal labels in the data to be evaluated. Accuracy Evaluation Criteria.
在对待评估数据进行多维度评估后,得到待评估数据分别在多个评估维度下的质量评估结果,进而生成待评估数据对应的质量评估报告,供技术人员参考。需要说明的是,以上202-206各个步骤的执行顺序并不以图2及前述的顺序为限,在具体应用中,202-206各个步骤可以根据实际情况按照其他顺序执行,当然202-206还可以并行执行,本发明对此不做限制。After multi-dimensional evaluation of the data to be evaluated, the quality evaluation results of the data to be evaluated in multiple evaluation dimensions are obtained, and then a quality evaluation report corresponding to the data to be evaluated is generated for reference by technical personnel. It should be noted that the execution order of the above steps 202-206 is not limited to the order shown in FIG. It can be executed in parallel, which is not limited by the present invention.
本发明实施例提供的另一种数据集质量评估方法,与目前由技术人员根据各自的经验对数据集的质量进行评估的方式相比,本方明能够获取数据集中的待评估数据;并分别统计所述待评估数据在多个评估维度下的属性特征;与此同时,基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果,由此通过统计待评估数据在多个评估维度下的属性特征,能够从多个评估维度对数据集的质量进行自动评估,从而能够提高数据集质量的评估精度和评估效率,有效地保证了人工智能开发过程中数据集的安全。Another data set quality assessment method provided by the embodiment of the present invention, compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and Statistics of the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics of the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained. The quality evaluation results under the multiple evaluation dimensions, so by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the quality of the data set The evaluation accuracy and evaluation efficiency effectively ensure the safety of data sets in the process of artificial intelligence development.
进一步地,作为图1的具体实现,本发明实施例提供了一种数据集质量评估装置,如图3所示,所述装置包括:获取单元31、统计单元32和评估单元33。Further, as a specific implementation of FIG. 1 , an embodiment of the present invention provides a data set quality assessment device. As shown in FIG. 3 , the device includes: an acquisition unit 31 , a statistical unit 32 and an evaluation unit 33 .
所述获取单元31,可以用于获取数据集中的待评估数据。The obtaining unit 31 may be used to obtain the data to be evaluated in the data set.
所述统计单元32,可以用于分别统计所述待评估数据在多个评估维度下的属性特征。The statistics unit 32 may be used to separately calculate attribute characteristics of the data to be evaluated under multiple evaluation dimensions.
所述评估单元33,可以用于基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。The evaluation unit 33 may be configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions respectively .
在具体应用场景中,所述统计单元32,具体可以用于分别统计所述待评估数据在数据规模评估维度、数据均衡性评估维度、数据准确性评估维度、数据污染评估维度、数据偏见评估维度下的属性特征。In a specific application scenario, the statistical unit 32 can specifically be used to separately count the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension, and data bias evaluation dimension properties below.
所述评估单元33,具体可以用于基于所述数据规模评估维度、所述数据均衡性评估维度、所述数据准确性评估维度、所述数据污染评估维度和所述数据偏见评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述数据规模评估维度、所述数据均衡性评估维度、所述数据准确性评估维度、所述数据污染评估维度和所述数据偏见评估维度下的质量评估结果The evaluation unit 33 can be specifically configured to evaluate attributes based on the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension, and the data bias evaluation dimension features, performing quality assessment on the data to be evaluated, and obtaining the data to be evaluated in the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension and Quality assessment results under the data bias assessment dimension
进一步地,统计所述待评估数据在数据污染评估维度下的属性特征,如图4所示,所述统计单元32,包括:拟合模块321、预测模块322和判定模块323。Further, the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension are counted, as shown in FIG. 4 , the statistical unit 32 includes: a fitting module 321 , a prediction module 322 and a judgment module 323 .
所述拟合模块321,可以用于根据所述结构化数据及其对应的标签类别,利用预设插值算法拟合所述结构化数据对应的函数曲线。The fitting module 321 can be configured to use a preset interpolation algorithm to fit a function curve corresponding to the structured data according to the structured data and its corresponding tag category.
所述预测模块322,可以用于利用所述函数曲线对所述结构化数据进行预测,得到所述结构化数据对应的预测标签类别。The prediction module 322 may be configured to use the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data.
所述判定模块323,可以用于基于所述预测标签类别和所述结构化数据对应的标签类别,判定所述结构化数据是否为噪声数据。The determination module 323 may be configured to determine whether the structured data is noise data based on the predicted label category and the label category corresponding to the structured data.
基于此,所述评估单元33,包括:第一计算模块331和第一确定模块332。Based on this, the evaluation unit 33 includes: a first calculation module 331 and a first determination module 332 .
所述第一计算模块331,可以用于分别统计所述噪声数据对应的第一数据量,以及结构化数据对应的第二数据量,并根据所述第一数据量和所述第二数据量,计算所述噪声数据与所述结构化数据之间的第一数据比例。The first calculation module 331 may be configured to separately count the first data volume corresponding to the noise data and the second data volume corresponding to the structured data, and calculate according to the first data volume and the second data volume , calculating a first data ratio between the noise data and the structured data.
所述第一确定模块332,可以用于当所述第一数据比例大于预设噪声数据比例时确定所述结构化数据不满足数据污染评估标准。The first determining module 332 may be configured to determine that the structured data does not satisfy the data pollution evaluation standard when the first data ratio is greater than a preset noise data ratio.
进一步地,统计所述待评估数据在数据污染评估维度下的属性特征,所述统计单元32,还包括:压缩模块324和第二计算模块325。Further, the statistical unit 32 further includes: a compression module 324 and a second calculation module 325 to make statistics on the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension.
所述压缩模块324,可以用于分别利用第一预设压缩器和第二预设压缩器对所述非结构化数据进行压缩处理,得到所述非结构化数据对应的第一压缩数据和第二压缩数据。The compression module 324 may be configured to compress the unstructured data by using the first preset compressor and the second preset compressor respectively, to obtain the first compressed data and the second compressed data corresponding to the unstructured data. Two compressed data.
所述预测模块322,可以用于分别对所述非结构化数据、所述第一压缩数据和所述第二压缩数据进行预测,得到所述非结构化数据对应的原始预测结果,所述第一压缩数据对应的第一预测结果,以及所述第二压缩数据对应的第二预测结果。The prediction module 322 may be configured to respectively predict the unstructured data, the first compressed data, and the second compressed data to obtain an original prediction result corresponding to the unstructured data, and the first A first prediction result corresponding to the compressed data, and a second prediction result corresponding to the second compressed data.
所述第二计算模块325,可以用于分别计算所述原始预测结果与所述第一预测结果之间的第一差值,以及所述原始预测结果与所述第二预测结果之间的第二差值。The second calculation module 325 may be configured to calculate a first difference between the original prediction result and the first prediction result, and a first difference between the original prediction result and the second prediction result, respectively. Two difference.
所述判定模块323,可以用于基于所述第一差值和所述第二差值,判定所述非结构化数据是否为对抗数据。The determining module 323 may be configured to determine whether the unstructured data is confrontational data based on the first difference and the second difference.
所述第一计算模块331,还可以用于分别统计所述对抗数据对应的第三数据量,以及所述非结构化数据对应的第四数据量,并根据所述第三数据量和所述第四数据量,计算所述对抗数据与所述非结构化数据之间的第二数据比例。The first calculation module 331 can also be used to separately count the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data, and according to the third data volume and the The fourth data volume is to calculate a second data ratio between the confrontation data and the unstructured data.
所述第一确定模块332,还可以用于若所述第二数据比例大于预设对抗数据比例,则确定所述非结构化数据不满足数据污染评估标准。The first determining module 332 may also be configured to determine that the unstructured data does not meet the data pollution evaluation standard if the second data ratio is greater than a preset confrontation data ratio.
在具体应用场景中,为了统计所述待评估数据在数据偏见评估维度下的属性特征,所述统计单元32,还包括:第二确定模块326、排除模块327和分析模块328。In a specific application scenario, in order to count the attribute characteristics of the data to be evaluated under the data bias evaluation dimension, the statistics unit 32 further includes: a second determination module 326 , an exclusion module 327 and an analysis module 328 .
所述第二确定模块326,可以用于确定所述结构化数据对应的各个特征。The second determination module 326 may be used to determine each feature corresponding to the structured data.
所述排除模块327,可以用于利用预设偏见语料库初步检测所述各个特征中存在的偏见特征,并从所述各个特征和所述训练数据集中分别排除所述偏见特征及其对应的结构化数据,得到排除后的各个特征和排除后的结构化数据。The exclusion module 327 can be used to preliminarily detect the biased features in the various features by using the preset biased corpus, and exclude the biased features and their corresponding structured features from the various features and the training data set. data to obtain the excluded features and the excluded structured data.
所述分析模块328,可以用于根据所述排除后的结构化数据,分析所述排除后的各个特征是否为偏见特征。The analysis module 328 can be configured to analyze whether each feature after exclusion is a bias feature according to the structured data after exclusion.
在具体应用场景中,为了分析所述排除后的各个特征是否为偏见特征,所述分析模块328,可以用于若所述排除后的结构化数据存在相应的标签类别,则将所述排除后的各个特征与各个标签类别进行组合,得到多个组合结果;确定所述排除后的各个特征对应的特征值,并根据所述多个组合结果,分析在不同标签分类下各个特征值对应的第一数据量分布信息;基于所述第一数据量分布信息,判定所述排除后的各个特征是否为偏见特征。In a specific application scenario, in order to analyze whether each feature after the exclusion is a bias feature, the analysis module 328 can be used to, if there is a corresponding label category in the structured data after exclusion, then Each feature of each feature is combined with each label category to obtain multiple combination results; determine the feature value corresponding to each feature after the exclusion, and according to the multiple combination results, analyze the No. 1 corresponding to each feature value under different label classifications A data volume distribution information; based on the first data volume distribution information, it is determined whether each feature after the exclusion is a biased feature.
在具体应用场景中,为了分析所述排除后的各个特征是否为偏见特征,所述分析模块328,还可以用于若所述排除后的结构化数据不存在相应的标签类别,利用预设聚类算法对所述排除后的结构化数据进行聚类处理,得到不同分类下的结构化数据;确定所述排除后的各个特征对应的特征值,并分析在不同分类下各个特征值对应的第二数据量分布信息;基于所述第二数据量分布信息,判定所述排除后的各个特征是否为偏见特征。In a specific application scenario, in order to analyze whether each feature after the exclusion is a bias feature, the analysis module 328 can also be used to use the preset aggregation The class algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the first eigenvalues corresponding to each eigenvalue under different classifications. 2. Data volume distribution information: Based on the second data volume distribution information, determine whether each feature after the exclusion is a biased feature.
需要说明的是,本发明实施例提供的一种数据集质量评估装置所涉及各功能模块的其他相应描述,可以参考图1所示方法的对应描述,在此不再赘述。It should be noted that, for other corresponding descriptions of the functional modules involved in the data set quality assessment apparatus provided by the embodiment of the present invention, reference may be made to the corresponding description of the method shown in FIG. 1 , which will not be repeated here.
基于上述如图1所示方法,相应的,本发明实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:获取数据集中的待评估数据;分别统计所述待评估数据在多个评估维度下的属性特征;基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。Based on the method shown in Figure 1 above, correspondingly, an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining the waiting Evaluate the data; respectively count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; perform quality assessment on the data to be evaluated based on the attribute characteristics under the multiple evaluation dimensions, and obtain the data to be evaluated respectively in Quality evaluation results under the multiple evaluation dimensions.
基于上述如图1所示方法和如图3所示装置的实施例,本发明实施例还提供了一种计算机设备的实体结构图,如图5所示,该计算机设备包括:处理器41、存储器42、及存储在存储器42上并可在处理器上运行的计算机程序,其中存储器42和处理器41均设置在总线43上。所述处理器41执行所述程序时实现以下步骤:分别统计所述待评估数据在多个评估维度下的属性特征;基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评 估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。Based on the above-mentioned embodiment of the method shown in FIG. 1 and the device shown in FIG. 3 , the embodiment of the present invention also provides a physical structure diagram of a computer device, as shown in FIG. 5 , the computer device includes: a processor 41, A memory 42 and a computer program stored in the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on the bus 43 . When the processor 41 executes the program, the following steps are implemented: respectively counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; based on the attribute characteristics under the multiple evaluation dimensions, performing The quality assessment is to obtain the quality assessment results of the data to be assessed under the plurality of assessment dimensions respectively.
通过本发明的技术方案,本方明能够获取数据集中的待评估数据;并分别统计所述待评估数据在多个评估维度下的属性特征;与此同时,基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果,由此通过统计待评估数据在多个评估维度下的属性特征,能够从多个评估维度对数据集的质量进行自动评估,从而能够提高数据集质量的评估精度和评估效率,有效地保证了人工智能开发过程中数据集的安全。Through the technical solution of the present invention, Fangming can obtain the data to be evaluated in the data set; and separately count the attribute characteristics of the data to be evaluated in multiple evaluation dimensions; at the same time, based on the data in the multiple evaluation dimensions attribute characteristics, performing quality assessment on the data to be evaluated, and obtaining quality evaluation results of the data to be evaluated in the plurality of evaluation dimensions, and thus by counting the attribute characteristics of the data to be evaluated in the plurality of evaluation dimensions, The quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and effectively ensuring the safety of the data set in the process of artificial intelligence development.
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims (10)

  1. 一种数据集质量评估方法,其特征在于,包括:A data set quality assessment method, characterized in that, comprising:
    获取数据集中的待评估数据;Obtain the data to be evaluated in the dataset;
    分别统计所述待评估数据在多个评估维度下的属性特征;Statistically counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions;
    基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。Performing quality assessment on the data to be assessed based on attribute characteristics under the multiple assessment dimensions, to obtain quality assessment results of the data to be assessed under the multiple assessment dimensions respectively.
  2. 根据权利要求1所述的方法,其特征在于,所述分别统计所述待评估数据在多个评估维度下的属性特征,包括:The method according to claim 1, wherein the separately counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions includes:
    分别统计所述待评估数据在数据规模评估维度、数据均衡性评估维度、数据准确性评估维度、数据污染评估维度、数据偏见评估维度下的属性特征;Respectively count the attribute characteristics of the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension, and data bias evaluation dimension;
    所述基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果,包括:Performing quality assessment on the data to be assessed based on the attribute characteristics under the multiple assessment dimensions, and obtaining the quality assessment results of the data to be assessed under the multiple assessment dimensions respectively, including:
    基于所述数据规模评估维度、所述数据均衡性评估维度、所述数据准确性评估维度、所述数据污染评估维度和所述数据偏见评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述数据规模评估维度、所述数据均衡性评估维度、所述数据准确性评估维度、所述数据污染评估维度和所述数据偏见评估维度下的质量评估结果。Based on the attribute characteristics under the data size assessment dimension, the data balance assessment dimension, the data accuracy assessment dimension, the data pollution assessment dimension and the data bias assessment dimension, the quality of the data to be assessed is evaluated. Evaluation, obtaining the quality evaluations of the data to be evaluated in the data size evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension and the data bias evaluation dimension result.
  3. 根据权利要求2所述的方法,其特征在于,所述待评估数据为训练数据集中的结构化数据,统计所述待评估数据在数据污染评估维度下的属性特征,包括:The method according to claim 2, wherein the data to be evaluated is structured data in a training data set, and the statistics of the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension include:
    根据所述结构化数据及其对应的标签类别,利用预设插值算法拟合所述结构化数据对应的函数曲线;Fitting a function curve corresponding to the structured data by using a preset interpolation algorithm according to the structured data and its corresponding tag category;
    利用所述函数曲线对所述结构化数据进行预测,得到所述结构化数据对应的预测标签类别;Predicting the structured data by using the function curve to obtain a predicted label category corresponding to the structured data;
    基于所述预测标签类别和所述结构化数据对应的标签类别,判定所述结构化数据是否为噪声数据;Based on the predicted label category and the label category corresponding to the structured data, determine whether the structured data is noise data;
    基于所述数据污染评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据在所述数据污染评估维度下的质量评估结果,包 括:Based on the attribute characteristics under the data pollution assessment dimension, perform quality assessment on the data to be assessed, and obtain the quality assessment results of the data to be assessed under the data pollution assessment dimension, including:
    分别统计所述噪声数据对应的第一数据量,以及结构化数据对应的第二数据量,并根据所述第一数据量和所述第二数据量,计算所述噪声数据与所述结构化数据之间的第一数据比例;Counting the first data volume corresponding to the noise data and the second data volume corresponding to the structured data, and calculating the noise data and the structured data volume according to the first data volume and the second data volume a first data ratio between the data;
    若所述第一数据比例大于预设噪声数据比例,则确定所述结构化数据不满足数据污染评估标准。If the first data ratio is greater than the preset noise data ratio, it is determined that the structured data does not meet the data pollution evaluation standard.
  4. 根据权利要求2所述的方法,其特征在于,所述待评估数据为预测数据集中的非结构化数据,统计所述待评估数据在数据污染评估维度下的属性特征,包括:The method according to claim 2, wherein the data to be evaluated is unstructured data in a prediction data set, and the statistics of the attribute characteristics of the data to be evaluated under the dimension of data pollution evaluation include:
    分别利用第一预设压缩器和第二预设压缩器对所述非结构化数据进行压缩处理,得到所述非结构化数据对应的第一压缩数据和第二压缩数据;performing compression processing on the unstructured data by using a first preset compressor and a second preset compressor respectively, to obtain first compressed data and second compressed data corresponding to the unstructured data;
    分别对所述非结构化数据、所述第一压缩数据和所述第二压缩数据进行预测,得到所述非结构化数据对应的原始预测结果,所述第一压缩数据对应的第一预测结果,以及所述第二压缩数据对应的第二预测结果;Predicting the unstructured data, the first compressed data, and the second compressed data respectively, to obtain an original prediction result corresponding to the unstructured data, and a first prediction result corresponding to the first compressed data , and a second prediction result corresponding to the second compressed data;
    分别计算所述原始预测结果与所述第一预测结果之间的第一差值,以及所述原始预测结果与所述第二预测结果之间的第二差值;calculating a first difference between the original predictor and the first predictor, and a second difference between the original predictor and the second predictor, respectively;
    基于所述第一差值和所述第二差值,判定所述非结构化数据是否为对抗数据;determining whether the unstructured data is adversarial data based on the first difference and the second difference;
    基于所述数据污染评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据在所述数据污染评估维度下的质量评估结果,包括:Based on the attribute characteristics under the data pollution assessment dimension, the quality assessment is performed on the data to be assessed, and the quality assessment result of the data to be assessed under the data pollution assessment dimension is obtained, including:
    分别统计所述对抗数据对应的第三数据量,以及所述非结构化数据对应的第四数据量,并根据所述第三数据量和所述第四数据量,计算所述对抗数据与所述非结构化数据之间的第二数据比例;Counting the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data respectively, and calculating the relationship between the confrontation data and the fourth data volume according to the third data volume and the fourth data volume a second data ratio between the unstructured data;
    若所述第二数据比例大于预设对抗数据比例,则确定所述非结构化数据不满足数据污染评估标准。If the second data ratio is greater than the preset confrontation data ratio, it is determined that the unstructured data does not meet the data pollution evaluation standard.
  5. 根据权利要求2所述的方法,其特征在于,所述待评估数据为训练数据集中的结构化数据,统计所述待评估数据在数据偏见评估维度下的属性特征,包括:The method according to claim 2, wherein the data to be evaluated is structured data in a training data set, and the attribute characteristics of the data to be evaluated under the data bias evaluation dimension are counted, including:
    确定所述结构化数据对应的各个特征;determining each feature corresponding to the structured data;
    利用预设偏见语料库初步检测所述各个特征中存在的偏见特征,并从所述各个特征和所述训练数据集中分别排除所述偏见特征及其对应的结构化数据,得到排除后的各个特征和排除后的结构化数据;Using the preset bias corpus to preliminarily detect the bias features existing in the various features, and exclude the bias features and their corresponding structured data from the various features and the training data set, and obtain the excluded features and Excluded structured data;
    根据所述排除后的结构化数据,分析所述排除后的各个特征是否为偏见特征。According to the excluded structured data, it is analyzed whether each feature after the exclusion is a bias feature.
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述排除后的结构化数据,分析所述排除后的各个特征是否为偏见特征,包括:The method according to claim 5, wherein, according to the excluded structured data, analyzing whether each feature after the exclusion is a bias feature comprises:
    若所述排除后的结构化数据存在相应的标签类别,则将所述排除后的各个特征与各个标签类别进行组合,得到多个组合结果;If there is a corresponding label category in the excluded structured data, combining each feature after the exclusion with each label category to obtain multiple combination results;
    确定所述排除后的各个特征对应的特征值,并根据所述多个组合结果,分析在不同标签分类下各个特征值对应的第一数据量分布信息;determining the eigenvalues corresponding to the excluded features, and analyzing the first data volume distribution information corresponding to the eigenvalues under different label classifications according to the plurality of combination results;
    基于所述第一数据量分布信息,判定所述排除后的各个特征是否为偏见特征。Based on the first data volume distribution information, it is determined whether each feature after the exclusion is a bias feature.
  7. 根据权利要求5所述的方法,其特征在于,所述根据所述排除后的结构化数据,分析所述排除后的各个特征是否为偏见特征,包括:The method according to claim 5, wherein, according to the excluded structured data, analyzing whether each feature after the exclusion is a bias feature comprises:
    若所述排除后的结构化数据不存在相应的标签类别,利用预设聚类算法对所述排除后的结构化数据进行聚类处理,得到不同分类下的结构化数据;If there is no corresponding label category for the excluded structured data, cluster processing is performed on the excluded structured data using a preset clustering algorithm to obtain structured data under different classifications;
    确定所述排除后的各个特征对应的特征值,并分析在不同分类下各个特征值对应的第二数据量分布信息;Determining the eigenvalues corresponding to the excluded features, and analyzing the second data volume distribution information corresponding to the eigenvalues under different classifications;
    基于所述第二数据量分布信息,判定所述排除后的各个特征是否为偏见特征。Based on the second data volume distribution information, it is determined whether each feature after the exclusion is a bias feature.
  8. 一种数据集质量评估装置,其特征在于,包括:A data set quality assessment device is characterized in that it comprises:
    获取单元,用于获取数据集中的待评估数据;an acquisition unit, configured to acquire the data to be evaluated in the data set;
    统计单元,用于分别统计所述待评估数据在多个评估维度下的属性特征;A statistical unit, configured to separately count attribute characteristics of the data to be evaluated under multiple evaluation dimensions;
    评估单元,用于基于所述多个评估维度下的属性特征,对所述待评估数据进行质量评估,得到所述待评估数据分别在所述多个评估维度下的质量评估结果。The evaluation unit is configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions.
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the computer program is executed by the processor, it implements any one of claims 1 to 7. steps of the method described above.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are realized.
PCT/CN2021/117109 2021-08-30 2021-09-08 Method and apparatus for evaluating data set quality, computer device, and storage medium WO2023029065A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110999774.XA CN113448955B (en) 2021-08-30 2021-08-30 Data set quality evaluation method and device, computer equipment and storage medium
CN202110999774.X 2021-08-30

Publications (1)

Publication Number Publication Date
WO2023029065A1 true WO2023029065A1 (en) 2023-03-09

Family

ID=77818805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117109 WO2023029065A1 (en) 2021-08-30 2021-09-08 Method and apparatus for evaluating data set quality, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN113448955B (en)
WO (1) WO2023029065A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764705A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data quality accessment platform and method
CN108764707A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data assessment system and method
CN110705607A (en) * 2019-09-12 2020-01-17 西安交通大学 Industry multi-label noise reduction method based on cyclic re-labeling self-service method
US20200081865A1 (en) * 2018-09-10 2020-03-12 Google Llc Rejecting Biased Data Using a Machine Learning Model
CN112463773A (en) * 2019-09-06 2021-03-09 佛山市顺德区美的电热电器制造有限公司 Data quality determination method and device
CN112506904A (en) * 2020-12-02 2021-03-16 深圳市酷开网络科技股份有限公司 Data quality evaluation method and device, terminal equipment and storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE333694T1 (en) * 2003-01-18 2006-08-15 Psytechnics Ltd TOOL FOR NON-INVASIVELY DETERMINING THE QUALITY OF A VOICE SIGNAL
WO2019236560A1 (en) * 2018-06-04 2019-12-12 The Regents Of The University Of California Pair-wise or n-way learning framework for error and quality estimation
CN108960087A (en) * 2018-06-20 2018-12-07 中国科学院重庆绿色智能技术研究院 A kind of quality of human face image appraisal procedure and system based on various dimensions evaluation criteria
CN110121110B (en) * 2019-05-07 2021-05-25 北京奇艺世纪科技有限公司 Video quality evaluation method, video quality evaluation apparatus, video processing apparatus, and medium
CN111339215A (en) * 2019-05-31 2020-06-26 北京东方融信达软件技术有限公司 Structured data set quality evaluation model generation method, evaluation method and device
CN110956224B (en) * 2019-08-01 2024-03-08 平安科技(深圳)有限公司 Evaluation model generation and evaluation data processing method, device, equipment and medium
CN111881705B (en) * 2019-09-29 2023-12-12 深圳数字生命研究院 Data processing, training and identifying method, device and storage medium
CN110956613B (en) * 2019-11-07 2023-04-07 成都傅立叶电子科技有限公司 Image quality-based target detection algorithm performance normalization evaluation method and system
CN111523785A (en) * 2020-04-16 2020-08-11 三峡大学 Power system dynamic security assessment method based on generation countermeasure network
CN111639850A (en) * 2020-05-27 2020-09-08 中国电力科学研究院有限公司 Quality evaluation method and system for multi-source heterogeneous data
CN111652381A (en) * 2020-06-04 2020-09-11 深圳前海微众银行股份有限公司 Data set contribution degree evaluation method, device and equipment and readable storage medium
CN112465041B (en) * 2020-12-01 2024-01-05 大连海事大学 AIS data quality assessment method based on analytic hierarchy process
CN113254599B (en) * 2021-06-28 2021-10-08 浙江大学 Multi-label microblog text classification method based on semi-supervised learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764705A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data quality accessment platform and method
CN108764707A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data assessment system and method
US20200081865A1 (en) * 2018-09-10 2020-03-12 Google Llc Rejecting Biased Data Using a Machine Learning Model
CN112463773A (en) * 2019-09-06 2021-03-09 佛山市顺德区美的电热电器制造有限公司 Data quality determination method and device
CN110705607A (en) * 2019-09-12 2020-01-17 西安交通大学 Industry multi-label noise reduction method based on cyclic re-labeling self-service method
CN112506904A (en) * 2020-12-02 2021-03-16 深圳市酷开网络科技股份有限公司 Data quality evaluation method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN113448955B (en) 2021-12-07
CN113448955A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
CN107633265B (en) Data processing method and device for optimizing credit evaluation model
KR102061987B1 (en) Risk Assessment Method and System
WO2019214248A1 (en) Risk assessment method and apparatus, terminal device, and storage medium
AU2018101946A4 (en) Geographical multivariate flow data spatio-temporal autocorrelation analysis method based on cellular automaton
WO2019184557A1 (en) Method and device for locating root cause alarm, and computer-readable storage medium
Kočišová et al. Discriminant analysis as a tool for forecasting company's financial health
TW201734837A (en) Multi-sampling model training method and device
CN107679734A (en) It is a kind of to be used for the method and system without label data classification prediction
KR20160104064A (en) A multidimensional recursive learning process and system used to discover complex dyadic or multiple counterparty relationships
WO2018036402A1 (en) Method and device for determining key variable in model
CN114448657B (en) Distribution communication network security situation awareness and abnormal intrusion detection method
CN110348516B (en) Data processing method, data processing device, storage medium and electronic equipment
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
WO2023029065A1 (en) Method and apparatus for evaluating data set quality, computer device, and storage medium
EP4141693A1 (en) Method and device for obtaining a generated dataset with a predetermined bias for evaluating algorithmic fairness of a machine learning model
CN114785616A (en) Data risk detection method and device, computer equipment and storage medium
CN108629506A (en) Modeling method, device, computer equipment and the storage medium of air control model
CN114936204A (en) Feature screening method and device, storage medium and electronic equipment
CN110570301B (en) Risk identification method, device, equipment and medium
CN114626940A (en) Data analysis method and device and electronic equipment
Wang et al. A knowledge discovery case study of software quality prediction: Isbsg database
CN116484230B (en) Method for identifying abnormal business data and training method of AI digital person
CN112884167B (en) Multi-index anomaly detection method based on machine learning and application system thereof
CN111815442B (en) Link prediction method and device and electronic equipment
CN115242482B (en) Unauthorized access risk detection method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21955596

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE