WO2023029065A1 - Procédé et appareil d'évaluation de qualité d'ensemble de données, dispositif informatique et support de stockage - Google Patents

Procédé et appareil d'évaluation de qualité d'ensemble de données, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2023029065A1
WO2023029065A1 PCT/CN2021/117109 CN2021117109W WO2023029065A1 WO 2023029065 A1 WO2023029065 A1 WO 2023029065A1 CN 2021117109 W CN2021117109 W CN 2021117109W WO 2023029065 A1 WO2023029065 A1 WO 2023029065A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
evaluation
evaluated
structured
dimension
Prior art date
Application number
PCT/CN2021/117109
Other languages
English (en)
Chinese (zh)
Inventor
马影
周晓勇
魏国富
刘胜
夏玉明
Original Assignee
上海观安信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海观安信息技术股份有限公司 filed Critical 上海观安信息技术股份有限公司
Publication of WO2023029065A1 publication Critical patent/WO2023029065A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • the present invention relates to the field of information technology, in particular to a data set quality evaluation method, device, computer equipment and storage medium.
  • Data is the basis for the development and application of artificial intelligence. Data sets are crucial to artificial intelligence algorithms. Using different quality data sets will result in different model parameters after training, resulting in different execution effects, which in turn affect the security of artificial intelligence algorithms. If criminals use attack methods to maliciously modify and add data sets, it will lead to model prediction errors. Therefore, how to effectively detect and evaluate the quality of data sets has become an urgent problem to be solved in artificial intelligence security.
  • the invention provides a data set quality assessment method, device, computer equipment and storage medium, mainly aiming at improving the assessment accuracy and assessment efficiency of the data set quality.
  • a data set quality assessment method including:
  • a data set quality assessment device including:
  • an acquisition unit configured to acquire the data to be evaluated in the data set
  • a statistical unit configured to separately count attribute characteristics of the data to be evaluated under multiple evaluation dimensions
  • the evaluation unit is configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:
  • a computer device including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the program:
  • a data set quality evaluation method, device, computer equipment and storage medium provided by the present invention compared with the current method of evaluating the quality of the data set by technicians based on their own experience, the present invention can obtain the waiting data in the data set Evaluate the data; and separately count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, perform quality assessment on the data to be evaluated, and obtain the The quality evaluation results of the data to be evaluated under the multiple evaluation dimensions, so that by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that It can improve the evaluation accuracy and evaluation efficiency of the data set quality, and effectively guarantee the safety of the data set in the process of artificial intelligence development.
  • FIG. 1 shows a flowchart of a data set quality assessment method provided by an embodiment of the present invention
  • FIG. 2 shows a flowchart of another data set quality assessment method provided by an embodiment of the present invention
  • Fig. 3 shows a schematic structural diagram of a data set quality assessment device provided by an embodiment of the present invention
  • Fig. 4 shows a schematic structural diagram of another data set quality assessment device provided by an embodiment of the present invention.
  • FIG. 5 shows a schematic diagram of a physical structure of a computer device provided by an embodiment of the present invention.
  • an embodiment of the present invention provides a data set quality assessment method, as shown in Figure 1, the method includes:
  • the data set includes a training data set and a prediction data set
  • the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set.
  • the embodiment of the present invention develops a set of data set quality evaluation tools, which can The quality of the data set is automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and at the same time ensuring the safety of the data set in the process of artificial intelligence development.
  • the embodiments of the present invention are mainly applied to scenarios where the quality of a data set is evaluated in multiple dimensions.
  • the execution subject of the embodiment of the present invention is a device or device capable of evaluating the quality of a data set, which may be specifically set on the server side.
  • the training data set and the prediction data set that need to be evaluated for quality are collected in advance, and the data in the training data set and the prediction data set may specifically be structured data or unstructured data, such as image data.
  • the technician can click the file upload button on the data set quality evaluation tool interface to upload the training data set or prediction data set to be evaluated to the data set quality evaluation tool, so that The dataset quality assessment tool performs multi-dimensional quality assessment on the dataset to be assessed.
  • the multiple assessment dimensions include data scale assessment dimension, data balance assessment dimension, data accuracy assessment dimension, data pollution assessment dimension and data bias assessment dimension. It should be noted that the assessment dimensions in the embodiments of the present invention are not limited to In addition to the evaluation dimensions listed above, other evaluation dimensions can also be included, which can be set according to actual business needs.
  • the attribute characteristics of the data to be evaluated under the data size evaluation dimension include: the total amount of data, the number of features, the memory size occupied by the data, whether the data has labels, etc.; the attribute characteristics of the data to be evaluated under the evaluation dimension under data balance include The proportion of data volume under various tags; the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include: the total amount of data, the total amount of missing tags, whether various tags are abnormal, etc.; the data to be evaluated under the data pollution evaluation dimension
  • the attribute characteristics include: the amount of noise data, the amount of confrontation data, etc.; the attribute characteristics of the data to be evaluated under the data bias evaluation dimension include the bias characteristics corresponding to the data to be evaluated.
  • different statistical methods can be used to separately count the attribute characteristics of the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension and data bias evaluation dimension,
  • the specific statistical methods adopted for different evaluation dimensions are different, see steps 202-206 for details.
  • the evaluation standards corresponding to different evaluation dimensions are different.
  • the data to be evaluated does not meet the evaluation standard corresponding to any evaluation dimension, If it is determined that the data to be evaluated has quality problems, the data set cannot be used to train or predict the model, and technicians need to re-collect the data set or perform data cleaning on the data set with quality problems.
  • the “Yes” label corresponds to 90% of the data volume, and the “No” label corresponds to 10%.
  • evaluation data can also be treated from the dimensions of data scale evaluation, data accuracy evaluation, data pollution evaluation, and data bias evaluation Carry out quality assessment. If the data to be evaluated does not meet the evaluation standards corresponding to the above dimensions, it is determined that the data to be evaluated has quality problems, and it cannot be used to train or predict the model. For the quality evaluation process of different evaluation dimensions, see steps 202-206. .
  • the method for assessing the quality of a data set provided by the embodiment of the present invention, compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and count them separately The attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics under the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained.
  • the quality evaluation results under multiple evaluation dimensions are described, so by counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, so that the quality of the data set can be improved. Evaluation accuracy and evaluation efficiency effectively guarantee the safety of data sets in the process of artificial intelligence development.
  • the embodiment of the present invention provides another method for detecting contaminated sample data, as shown in FIG. 2 , the The methods described include:
  • the data set includes a training data set and a prediction data set
  • the data to be evaluated may specifically be each sample data in the training data set, or each prediction data in the prediction data set.
  • the specific method of obtaining the data set is exactly the same as that of step 101, and will not be repeated here.
  • step 202 specifically includes: according to the structured data and its corresponding label category, using a preset interpolation algorithm to simulate Combining the function curve corresponding to the structured data; using the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data; based on the corresponding predicted label category and the structured data label category to determine whether the structured data is noise data.
  • the preset interpolation algorithm performs interpolation processing on the structured data.
  • the preset interpolation algorithm may specifically be a preset Kriging interpolation algorithm.
  • the classification result may specifically be a classification probability value corresponding to the structured data.
  • the structured data of the known classification results are x1, x2, x3, and the structured data to be interpolated is determined to be x4. Since the classification result corresponding to the structural data x4 to be interpolated is unknown, the known classification results can be used Structured data x1, x2, x3, estimate the classification result corresponding to the structured data x4 to be interpolated, specifically, calculate the difference between the structured data x4 to be interpolated and the structured data x1, x2, x3 of known classification results The larger the distance, the farther the structured data of the known classification result is from the structured data to be interpolated, and the impact on the structured data to be interpolated is smaller, so the corresponding data weight is smaller; On the contrary, the smaller the distance, the closer the structured data of the known classification result is to the structured data to be interpolated, and the greater the impact on the structured data to be interpolated, so the corresponding data weight is greater.
  • the data weights corresponding to the structured data with known classification results are multiplied by the classification results to obtain the classification results corresponding to the structured data to be interpolated. Then, the structured data to be interpolated after the classification result is determined is inserted into the structured data of the known classification result, thereby solving the problem of missing data.
  • the probability value of structured data A belonging to the real label category is 0.87
  • the probability value of belonging to the predicted label category is 0.27
  • the method includes: separately counting the first data volume corresponding to the noise data, and the structure The second data volume corresponding to the structured data, and according to the first data volume and the second data volume, calculate the first data ratio between the noise data and the structured data; if the first data If the ratio is greater than the preset noise data ratio, it is determined that the structured data does not meet the data pollution evaluation standard.
  • the preset noise data ratio may be set according to actual service requirements.
  • the preset noise data ratio is set to be 10%
  • the total amount of statistical noise data (first data volume) is 200
  • the total amount of structured data in the training data set ( The second data amount) is 1000
  • the data ratio of 20% is greater than the preset noise data ratio of 10%, so it is determined that the training data set does not meet the data pollution evaluation criteria, that is, the training data set has quality problems and cannot be used for model training.
  • the data to be evaluated is unstructured data in the prediction data set
  • it is necessary to detect whether there is confrontation data in the data to be evaluated that is, to detect whether there is The data maliciously created by the attacker, because once such confrontation data is mixed into the prediction data set, it will directly affect the prediction result of the model.
  • step 202 specifically includes: using The first preset compressor and the second preset compressor compress the unstructured data to obtain first compressed data and second compressed data corresponding to the unstructured data; data, the first compressed data, and the second compressed data to obtain the original prediction result corresponding to the unstructured data, the first prediction result corresponding to the first compressed data, and the second compressed data A second prediction result corresponding to the data; respectively calculating a first difference between the original prediction result and the first prediction result, and a second difference between the original prediction result and the second prediction result ; Based on the first difference and the second difference, determine whether the unstructured data is adversarial data.
  • the first preset compressor and the second preset compressor can compress the features in the unstructured data to reduce the input of unnecessary features and reduce the dimension corresponding to the unstructured data.
  • the first preset compressor The compressed features are different from those compressed by the second preset compressor. For example, if the input unstructured data includes 10 features, that is, the input dimension corresponding to the unstructured data is 10, the first preset compressor can The first feature and the second feature in the unstructured data are compressed, and the second preset compressor can compress the third feature and the fourth feature in the unstructured data.
  • the number of compressors used in the embodiment of the present invention is not limited to two, and the number of compressors can be set according to actual service requirements and the number of features.
  • the original prediction result, the first prediction result and the second prediction result in the embodiment of the present invention are the probability values that the unstructured data belongs to the corresponding label category.
  • the unstructured data A is respectively input into the first preset compressor and the second preset compressor for feature compression processing, and the first compressed data and the second compressed data corresponding to the unstructured data A are obtained, and then the The unstructured data A, the first compressed data and the second compressed data were respectively input into the built model for prediction, and the original prediction result corresponding to the structured data A was 0.78, the first prediction result corresponding to the first compressed data was 0.56, and the first prediction result corresponding to the first compressed data was 0.56.
  • the second predicted result corresponding to the second compressed data is 0.63. Further, the original predicted result is subtracted from the first predicted result to obtain the first difference of 0.22, and the original predicted result is subtracted from the second predicted result to obtain the second difference.
  • the value is 0.15, and then a maximum difference is selected from the first difference and the second difference to compare with the preset difference. If the maximum difference is greater than the preset difference, then it is determined that the unstructured data A is confrontational data; If the maximum difference is less than the preset difference, it is determined that the unstructured data A is not confrontational data. For example, if the preset difference is set to 0.2, since the maximum difference 0.22 is greater than the preset difference 0.2, it is determined that the unstructured data is against data. All adversarial data present in the prediction data set can thus be determined in the manner described above.
  • the method includes: separately counting the third data volume corresponding to the confrontation data, and a fourth data volume corresponding to the unstructured data, and calculate a second data ratio between the confrontation data and the unstructured data according to the third data volume and the fourth data volume; if If the second data ratio is greater than the preset confrontation data ratio, it is determined that the unstructured data does not meet the data pollution evaluation standard.
  • the preset confrontation data ratio can be set according to actual business requirements.
  • step 203 specifically includes: determining each feature corresponding to the structured data; using the preset bias corpus Preliminary detection of the bias features existing in the various features, and respectively excluding the bias features and their corresponding structured data from the various features and the training data set, to obtain the excluded features and the excluded structured data. Data; according to the excluded structured data, analyze whether each feature after the exclusion is a bias feature.
  • the prediction bias corpus stores a large number of bias features, such as gender, age, region, income, etc.
  • the features corresponding to the structured data in the training data set include: education level, work, income, medical history, and gender. Match the above features corresponding to the structured data with each feature in the preset bias corpus. Through matching, it can be found that, Among the features corresponding to the structured data, the income feature and the gender feature are bias features. In order to improve the detection accuracy of the bias feature, it is necessary to further analyze whether other features are bias features, and exclude the structured data corresponding to the bias feature from the structured data. Then use the excluded structured data to analyze whether the remaining features are bias features.
  • the method includes: if there is a corresponding label category in the excluded structured data, then After each feature is combined with each label category, multiple combination results are obtained; the feature values corresponding to each feature after the exclusion are determined, and according to the multiple combination results, the corresponding feature values of each feature value under different label classifications are analyzed.
  • First data volume distribution information based on the first data volume distribution information, determine whether each feature after the exclusion is a biased feature.
  • the label categories of the excluded structured data include “yes” and “no”
  • the excluded features include “education level” and "job”.
  • the above features we get "education level- Yes”, “Education Level-No”, “Work-Yes” and “Work-No”, and then determine that the eigenvalues corresponding to the education level include undergraduates and above, undergraduates, and below undergraduates, and the eigenvalues corresponding to work include working and no work , further, first analyze the amount of data in the structured data whose label category is "yes" and the educational level is above undergraduate, undergraduate, and below.
  • the data volume of statistical education is 1000 respectively. People, 200 people and 800 people, the total amount of structured data with the label category "Yes" is 2000 people.
  • the proportions of unstructured data with a bachelor degree or above, bachelor degree and below are 50%, 10% and 40%, because the difference between the data volume of undergraduates and above is 40%, which exceeds the preset proportion of 20%, so it can be determined that the characteristic education level is a biased feature. In this way, it can be determined whether the remaining features corresponding to the structured data with label categories are biased features.
  • the method further includes: if the excluded structured data does not have a corresponding label category, using the preset The clustering algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the corresponding eigenvalues of each eigenvalue under different classifications.
  • Second data volume distribution information based on the second data volume distribution information, determine whether each feature after the exclusion is a bias feature.
  • the preset clustering algorithm may specifically be a DBSCAN clustering algorithm.
  • the excluded structured data may not have corresponding label categories.
  • label categories and features cannot be combined to analyze the distribution of the first data volume corresponding to each feature value under different label categories.
  • you can cluster the excluded structured data and then analyze the distribution of the second data volume corresponding to each feature value under different categories, and use the DBSCAN clustering algorithm to exclude
  • the process of clustering the final structured data first set the neighborhood radius corresponding to the structured data and the threshold of the amount of structured data in the field, then select a structured data A arbitrarily, and calculate the arrival of each structured data
  • the distance of the structured data determine the structured data B, C, and D included in the neighborhood corresponding to the structured data A, if the structured data included in the neighborhood corresponding to the structured data A If the quantity is greater than the structured data volume threshold, then determine structured data A as the core point, and build cluster C1 with structured data A as the core point, find out all the points reachable from structured data A density, structured
  • the density of data A can reach the density of structured data B, and the density of structured data B can reach the density of structured data E, so the density of structured data A can reach the density of structured data E, that is, structured data E also belongs to C1, so according to the above method, it can Find all the structured data in cluster C1, and continue to search for other data in the excluded structured data.
  • cluster C2 can be obtained, and the clustering of the excluded structured data can be completed by dividing into multiple clusters. Divide the excluded structured data into multiple categories, and then analyze the distribution of the second data volume corresponding to each feature value under different classifications. The method of analyzing the distribution of the second data volume is the same as that of the first data volume. This will not be repeated here.
  • the attribute characteristics of the data to be evaluated under the data size evaluation dimension include the total amount, the number of features, the memory size occupied by the data, and whether the data has labels, etc.
  • the attribute characteristics of the data to be evaluated under the data size evaluation dimension include the total amount, the number of features, the memory size occupied by the data, and whether the data has labels, etc.
  • the amount of data corresponding to the data to be evaluated is 300
  • the occupied memory size is 11.3KB
  • the number of features is 13, and the data has labels.
  • the scale evaluation of the data to be evaluated is required.
  • the standards for data size evaluation are different. For example, for translation models, if the total amount of data corresponding to the data to be evaluated is less than 10 million, it is determined that the data to be evaluated does not meet the data size evaluation standard, that is, it does not To be able to directly use this data set for training or prediction, the number needs to be increased.
  • the balance of the data set is an important factor affecting the effect of the artificial intelligence algorithm.
  • the more uniform the data set the smaller the distribution deviation of the data to be evaluated, and the better the operating effect of the artificial intelligence algorithm.
  • the distribution deviation of the data to be evaluated is The bigger it is, the worse the AI algorithm will perform.
  • the label categories corresponding to the data to be evaluated include “yes” and “no”, respectively count the amount of data with the label category "yes” and the data volume with the label category “no”, and calculate the proportion of the number of different label categories , such as calculating the proportion of the data volume corresponding to the "Yes” label is 90%, the proportion of data volume corresponding to the "No” label is 10%, and the difference between the data volume proportions of the two types of labels reaches 80%.
  • the difference in the ratio is greater than 60% of the difference in the ratio of the preset data volume, so it is determined that the data to be evaluated does not meet the data balance evaluation standard.
  • the attribute characteristics of the data to be evaluated under the data accuracy evaluation dimension include the total amount of data, the total amount of missing tags, and whether various tags are abnormal. For example, the total number of missing labels in the data to be evaluated is counted, and then the ratio of the total amount of data with missing labels to the total amount of data to be evaluated is calculated. If the ratio is greater than the preset ratio, it is determined that the data to be evaluated does not meet the accuracy requirements. gender assessment criteria. Another example is to count the data volumes corresponding to the labels "Yes" and "No". If the data volume under a certain label is less than the preset data volume, it is determined that the label is abnormal, and then it can be determined that there are abnormal labels in the data to be evaluated. Accuracy Evaluation Criteria.
  • the quality evaluation results of the data to be evaluated in multiple evaluation dimensions are obtained, and then a quality evaluation report corresponding to the data to be evaluated is generated for reference by technical personnel. It should be noted that the execution order of the above steps 202-206 is not limited to the order shown in FIG. It can be executed in parallel, which is not limited by the present invention.
  • Another data set quality assessment method provided by the embodiment of the present invention compared with the current method of assessing the quality of the data set by technicians based on their own experience, this method can obtain the data to be evaluated in the data set; and Statistics of the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; at the same time, based on the attribute characteristics of the multiple evaluation dimensions, the quality of the data to be evaluated is evaluated, and the data to be evaluated are respectively obtained.
  • the quality evaluation results under the multiple evaluation dimensions so by counting the attribute characteristics of the data to be evaluated under the multiple evaluation dimensions, the quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the quality of the data set
  • the evaluation accuracy and evaluation efficiency effectively ensure the safety of data sets in the process of artificial intelligence development.
  • an embodiment of the present invention provides a data set quality assessment device. As shown in FIG. 3 , the device includes: an acquisition unit 31 , a statistical unit 32 and an evaluation unit 33 .
  • the obtaining unit 31 may be used to obtain the data to be evaluated in the data set.
  • the statistics unit 32 may be used to separately calculate attribute characteristics of the data to be evaluated under multiple evaluation dimensions.
  • the evaluation unit 33 may be configured to perform quality evaluation on the data to be evaluated based on attribute characteristics under the multiple evaluation dimensions, and obtain quality evaluation results of the data to be evaluated under the multiple evaluation dimensions respectively .
  • the statistical unit 32 can specifically be used to separately count the data to be evaluated in the data scale evaluation dimension, data balance evaluation dimension, data accuracy evaluation dimension, data pollution evaluation dimension, and data bias evaluation dimension properties below.
  • the evaluation unit 33 can be specifically configured to evaluate attributes based on the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension, and the data bias evaluation dimension features, performing quality assessment on the data to be evaluated, and obtaining the data to be evaluated in the data scale evaluation dimension, the data balance evaluation dimension, the data accuracy evaluation dimension, the data pollution evaluation dimension and Quality assessment results under the data bias assessment dimension
  • the statistical unit 32 includes: a fitting module 321 , a prediction module 322 and a judgment module 323 .
  • the fitting module 321 can be configured to use a preset interpolation algorithm to fit a function curve corresponding to the structured data according to the structured data and its corresponding tag category.
  • the prediction module 322 may be configured to use the function curve to predict the structured data to obtain the predicted label category corresponding to the structured data.
  • the determination module 323 may be configured to determine whether the structured data is noise data based on the predicted label category and the label category corresponding to the structured data.
  • the evaluation unit 33 includes: a first calculation module 331 and a first determination module 332 .
  • the first calculation module 331 may be configured to separately count the first data volume corresponding to the noise data and the second data volume corresponding to the structured data, and calculate according to the first data volume and the second data volume , calculating a first data ratio between the noise data and the structured data.
  • the first determining module 332 may be configured to determine that the structured data does not satisfy the data pollution evaluation standard when the first data ratio is greater than a preset noise data ratio.
  • the statistical unit 32 further includes: a compression module 324 and a second calculation module 325 to make statistics on the attribute characteristics of the data to be evaluated under the data pollution evaluation dimension.
  • the compression module 324 may be configured to compress the unstructured data by using the first preset compressor and the second preset compressor respectively, to obtain the first compressed data and the second compressed data corresponding to the unstructured data. Two compressed data.
  • the prediction module 322 may be configured to respectively predict the unstructured data, the first compressed data, and the second compressed data to obtain an original prediction result corresponding to the unstructured data, and the first A first prediction result corresponding to the compressed data, and a second prediction result corresponding to the second compressed data.
  • the second calculation module 325 may be configured to calculate a first difference between the original prediction result and the first prediction result, and a first difference between the original prediction result and the second prediction result, respectively. Two difference.
  • the determining module 323 may be configured to determine whether the unstructured data is confrontational data based on the first difference and the second difference.
  • the first calculation module 331 can also be used to separately count the third data volume corresponding to the confrontation data and the fourth data volume corresponding to the unstructured data, and according to the third data volume and the The fourth data volume is to calculate a second data ratio between the confrontation data and the unstructured data.
  • the first determining module 332 may also be configured to determine that the unstructured data does not meet the data pollution evaluation standard if the second data ratio is greater than a preset confrontation data ratio.
  • the statistics unit 32 in order to count the attribute characteristics of the data to be evaluated under the data bias evaluation dimension, the statistics unit 32 further includes: a second determination module 326 , an exclusion module 327 and an analysis module 328 .
  • the second determination module 326 may be used to determine each feature corresponding to the structured data.
  • the exclusion module 327 can be used to preliminarily detect the biased features in the various features by using the preset biased corpus, and exclude the biased features and their corresponding structured features from the various features and the training data set. data to obtain the excluded features and the excluded structured data.
  • the analysis module 328 can be configured to analyze whether each feature after exclusion is a bias feature according to the structured data after exclusion.
  • the analysis module 328 can be used to, if there is a corresponding label category in the structured data after exclusion, then Each feature of each feature is combined with each label category to obtain multiple combination results; determine the feature value corresponding to each feature after the exclusion, and according to the multiple combination results, analyze the No. 1 corresponding to each feature value under different label classifications A data volume distribution information; based on the first data volume distribution information, it is determined whether each feature after the exclusion is a biased feature.
  • the analysis module 328 can also be used to use the preset aggregation
  • the class algorithm clusters the excluded structured data to obtain structured data under different classifications; determines the eigenvalues corresponding to each feature after the exclusion, and analyzes the first eigenvalues corresponding to each eigenvalue under different classifications.
  • Data volume distribution information Based on the second data volume distribution information, determine whether each feature after the exclusion is a biased feature.
  • an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining the waiting Evaluate the data; respectively count the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; perform quality assessment on the data to be evaluated based on the attribute characteristics under the multiple evaluation dimensions, and obtain the data to be evaluated respectively in Quality evaluation results under the multiple evaluation dimensions.
  • the embodiment of the present invention also provides a physical structure diagram of a computer device, as shown in FIG. 5 , the computer device includes: a processor 41, A memory 42 and a computer program stored in the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on the bus 43 .
  • the processor 41 executes the program, the following steps are implemented: respectively counting the attribute characteristics of the data to be evaluated under multiple evaluation dimensions; based on the attribute characteristics under the multiple evaluation dimensions, performing The quality assessment is to obtain the quality assessment results of the data to be assessed under the plurality of assessment dimensions respectively.
  • Fangming can obtain the data to be evaluated in the data set; and separately count the attribute characteristics of the data to be evaluated in multiple evaluation dimensions; at the same time, based on the data in the multiple evaluation dimensions attribute characteristics, performing quality assessment on the data to be evaluated, and obtaining quality evaluation results of the data to be evaluated in the plurality of evaluation dimensions, and thus by counting the attribute characteristics of the data to be evaluated in the plurality of evaluation dimensions,
  • the quality of the data set can be automatically evaluated from multiple evaluation dimensions, thereby improving the evaluation accuracy and efficiency of the data set quality, and effectively ensuring the safety of the data set in the process of artificial intelligence development.
  • each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here
  • the steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation.
  • the present invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention se rapporte au domaine des technologies de l'information, et divulgue un procédé et un appareil d'évaluation de la qualité d'un ensemble de données, un dispositif informatique et un support de stockage, visant principalement à améliorer la précision et l'efficacité de l'évaluation de la qualité de l'ensemble de données. Le procédé consiste à : acquérir des données à évaluer dans un ensemble de données ; compiler des caractéristiques d'attribut desdites données dans une pluralité de dimensions d'évaluation ; et, sur la base des caractéristiques d'attribut dans la pluralité de dimensions d'évaluation, évaluer la qualité desdites données, et obtenir des résultats d'évaluation de qualité desdites données dans la pluralité de dimensions d'évaluation. La présente invention peut s'appliquer à l'évaluation de la qualité d'un ensemble de données.
PCT/CN2021/117109 2021-08-30 2021-09-08 Procédé et appareil d'évaluation de qualité d'ensemble de données, dispositif informatique et support de stockage WO2023029065A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110999774.XA CN113448955B (zh) 2021-08-30 2021-08-30 数据集质量评估方法、装置、计算机设备及存储介质
CN202110999774.X 2021-08-30

Publications (1)

Publication Number Publication Date
WO2023029065A1 true WO2023029065A1 (fr) 2023-03-09

Family

ID=77818805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117109 WO2023029065A1 (fr) 2021-08-30 2021-09-08 Procédé et appareil d'évaluation de qualité d'ensemble de données, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN113448955B (fr)
WO (1) WO2023029065A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118211040A (zh) * 2024-05-17 2024-06-18 全拓科技(杭州)股份有限公司 一种用于大数据分析的数据质量评估分析方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118297444A (zh) * 2024-06-06 2024-07-05 中国信息通信研究院 一种面向人工智能的数据集质量通用评估方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764705A (zh) * 2018-05-24 2018-11-06 国信优易数据有限公司 一种数据质量评估平台以及方法
CN108764707A (zh) * 2018-05-24 2018-11-06 国信优易数据有限公司 一种数据评估系统以及方法
CN110705607A (zh) * 2019-09-12 2020-01-17 西安交通大学 一种基于循环重标注自助法的行业多标签降噪方法
US20200081865A1 (en) * 2018-09-10 2020-03-12 Google Llc Rejecting Biased Data Using a Machine Learning Model
CN112463773A (zh) * 2019-09-06 2021-03-09 佛山市顺德区美的电热电器制造有限公司 数据质量确定方法及装置
CN112506904A (zh) * 2020-12-02 2021-03-16 深圳市酷开网络科技股份有限公司 数据质量评估方法、装置、终端设备以及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE333694T1 (de) * 2003-01-18 2006-08-15 Psytechnics Ltd Werkzeug zur nicht invasiven bestimmung der qualität eines sprachsignals
WO2019236560A1 (fr) * 2018-06-04 2019-12-12 The Regents Of The University Of California Cadre d'apprentissage par paire ou à n voies pour une estimation d'erreur et de qualité
CN108960087A (zh) * 2018-06-20 2018-12-07 中国科学院重庆绿色智能技术研究院 一种基于多维度评估标准的人脸图像质量评估方法及系统
CN110121110B (zh) * 2019-05-07 2021-05-25 北京奇艺世纪科技有限公司 视频质量评估方法、设备、视频处理设备及介质
CN111339215A (zh) * 2019-05-31 2020-06-26 北京东方融信达软件技术有限公司 结构化数据集质量评价模型生成方法、评价方法及装置
CN110956224B (zh) * 2019-08-01 2024-03-08 平安科技(深圳)有限公司 评估模型生成、评估数据处理方法、装置、设备及介质
CN111881705B (zh) * 2019-09-29 2023-12-12 深圳数字生命研究院 数据处理、训练、识别方法、装置和存储介质
CN110956613B (zh) * 2019-11-07 2023-04-07 成都傅立叶电子科技有限公司 基于图像质量的目标检测算法性能归一化评价方法及系统
CN111523785A (zh) * 2020-04-16 2020-08-11 三峡大学 一种基于生成对抗网络的电力系统动态安全评估方法
CN111639850A (zh) * 2020-05-27 2020-09-08 中国电力科学研究院有限公司 多源异构数据的质量评估方法与系统
CN111652381A (zh) * 2020-06-04 2020-09-11 深圳前海微众银行股份有限公司 数据集贡献度评估方法、装置、设备及可读存储介质
CN112465041B (zh) * 2020-12-01 2024-01-05 大连海事大学 一种基于层次分析法的ais数据质量评估方法
CN113254599B (zh) * 2021-06-28 2021-10-08 浙江大学 一种基于半监督学习的多标签微博文本分类方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764705A (zh) * 2018-05-24 2018-11-06 国信优易数据有限公司 一种数据质量评估平台以及方法
CN108764707A (zh) * 2018-05-24 2018-11-06 国信优易数据有限公司 一种数据评估系统以及方法
US20200081865A1 (en) * 2018-09-10 2020-03-12 Google Llc Rejecting Biased Data Using a Machine Learning Model
CN112463773A (zh) * 2019-09-06 2021-03-09 佛山市顺德区美的电热电器制造有限公司 数据质量确定方法及装置
CN110705607A (zh) * 2019-09-12 2020-01-17 西安交通大学 一种基于循环重标注自助法的行业多标签降噪方法
CN112506904A (zh) * 2020-12-02 2021-03-16 深圳市酷开网络科技股份有限公司 数据质量评估方法、装置、终端设备以及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118211040A (zh) * 2024-05-17 2024-06-18 全拓科技(杭州)股份有限公司 一种用于大数据分析的数据质量评估分析方法

Also Published As

Publication number Publication date
CN113448955B (zh) 2021-12-07
CN113448955A (zh) 2021-09-28

Similar Documents

Publication Publication Date Title
CN107633265B (zh) 用于优化信用评估模型的数据处理方法及装置
KR102061987B1 (ko) 위험 평가 방법 및 시스템
WO2019214248A1 (fr) Procédé et appareil d'évaluation de risque, dispositif terminal et support d'informations
WO2019184557A1 (fr) Procédé et dispositif de localisation d'une alarme de cause profonde, et support de stockage lisible par ordinateur
Kočišová et al. Discriminant analysis as a tool for forecasting company's financial health
WO2023029065A1 (fr) Procédé et appareil d'évaluation de qualité d'ensemble de données, dispositif informatique et support de stockage
TW201734837A (zh) 一種多重抽樣模型訓練方法及裝置
CN106549813A (zh) 一种网络性能的评估方法及系统
CN113298373B (zh) 一种金融风险评估方法、装置、存储介质和设备
CN107679734A (zh) 一种用于无标签数据分类预测的方法和系统
WO2022199185A1 (fr) Procédé d'inspection d'opération d'utilisateur et produit de programme
CN110348516B (zh) 数据处理方法、装置、存储介质及电子设备
EP4141693A1 (fr) Procédé et dispositif pour obtenir un ensemble de données généré avec un biais prédéterminé pour évaluer l'équité algorithmique d'un modèle d'apprentissage machine
WO2018036402A1 (fr) Procédé et dispositif permettant de déterminer une variable clé dans un modèle
CN114785616A (zh) 数据风险检测方法、装置、计算机设备及存储介质
CN115545103A (zh) 异常数据识别、标签识别方法和异常数据识别装置
CN115619430A (zh) 用户价值评估方法及装置
CN115237970A (zh) 数据预测方法、装置、设备、存储介质及程序产品
CN108629506A (zh) 风控模型的建模方法、装置、计算机设备和存储介质
CN111815442B (zh) 一种链接预测的方法、装置和电子设备
CN114936204A (zh) 一种特征筛选方法、装置、存储介质及电子设备
CN110570301B (zh) 风险识别方法、装置、设备及介质
CN114626940A (zh) 数据分析方法、装置及电子设备
CN111552814B (zh) 基于考核指标图谱的考核方案生成方法及装置
Wang et al. A knowledge discovery case study of software quality prediction: Isbsg database

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21955596

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21955596

Country of ref document: EP

Kind code of ref document: A1