CN113515577A - Data preprocessing method and device - Google Patents

Data preprocessing method and device Download PDF

Info

Publication number
CN113515577A
CN113515577A CN202110789650.9A CN202110789650A CN113515577A CN 113515577 A CN113515577 A CN 113515577A CN 202110789650 A CN202110789650 A CN 202110789650A CN 113515577 A CN113515577 A CN 113515577A
Authority
CN
China
Prior art keywords
feature
data
sample data
sample
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110789650.9A
Other languages
Chinese (zh)
Inventor
郝芳
李策
刘晏萁
杨晓然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110789650.9A priority Critical patent/CN113515577A/en
Publication of CN113515577A publication Critical patent/CN113515577A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data preprocessing method and a data preprocessing device, which can also be used in the financial field, wherein the method comprises the following steps: carrying out feature classification on the collected sample data, and determining the number of feature types and the overall feature missing rate of the sample data; performing feature exploration and anomaly detection on the sample data of each feature type in the sample data, and determining feature distribution, a feature null value rate, a feature correlation relation and an anomaly sample of the sample data of each feature type; generating a data exploration summary, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type; the method and the device can effectively improve the efficiency and accuracy of data preprocessing.

Description

Data preprocessing method and device
Technical Field
The application relates to the field of data processing, can also be used in the field of finance, and particularly relates to a data preprocessing method and device.
Background
The data preprocessing refers to a process of data cleaning according to data exploration and abnormal retrieval results, and the data preprocessing is a premise of data analysis and data modeling, and is shown in a flow diagram of a data preprocessing method in the prior art in fig. 10.
The comprehensive and standard data preprocessing process is necessary, on one hand, during data analysis, comprehensive exploration on characteristics and samples can fully utilize data value to avoid single conclusion and information waste, and on the other hand, during data modeling, data determines the upper limit of a model, characteristics with large skewness and serious asymmetric distribution and abnormal samples separated from a group interfere with the fitting of the model, and the accuracy, the operating efficiency and the stability of the model are influenced.
However, the data preprocessing process in real projects is often incomplete and lacking in verification. The data exploration and anomaly detection processes are performed by an analyst according to experience and completely depend on an analyst knowledge system and subjective initiative, the phenomenon that items only perform data preprocessing on partial characteristics or directly perform analysis and modeling after completely passing through the preprocessing process exists, and in addition, a supervision mechanism is lacked, the situations that dirty data are too much, data deviate from reality, data analysis report dimension is single, model interpretability is poor/stability is poor often exist, and the analysis and modeling effects are greatly reduced.
The market lacks an automated universal preprocessing framework, and issued patent achievements mostly focus on specific preprocessing dimensions of limited fields, such as abnormal sample detection patents only aiming at the advertisement field, and the like. Considering that data cleaning is closely related to conversion and actual business, but data exploration and abnormal detection contents in various fields are basically consistent, namely feature distribution exploration, missing value statistics, abnormal value judgment, abnormal sample detection and the like are performed, so that a technical scheme focusing on data exploration and abnormal detection in a data preprocessing stage is lacked.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a data preprocessing method and device, which can effectively improve the efficiency and accuracy of data preprocessing.
In order to solve at least one of the above problems, the present application provides the following technical solutions:
in a first aspect, the present application provides a data preprocessing method, including:
carrying out feature classification on the collected sample data, and determining the number of feature types and the overall feature missing rate of the sample data;
performing feature exploration and anomaly detection on the sample data of each feature type in the sample data, and determining feature distribution, a feature null value rate, a feature correlation relation and an anomaly sample of the sample data of each feature type;
and generating a data exploration overview, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type.
Further, the generating a data exploration summary and a data exploration result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate, and at least one of the feature distribution, the feature null value rate, the feature correlation relationship, and the sample clustering result of the sample data of each feature type includes:
determining sample data summary information in a data exploration summary according to the characteristic type quantity of the sample data;
performing variable traversal output on the sample data search result with the characteristic types of continuous univariate characteristics, discrete characteristics and date characteristics, and determining characteristic distribution information and characteristic null rate information in a data search result report;
and calculating pairwise correlation coefficients of the sample data with the characteristic type being continuous multivariate characteristics, and determining characteristic correlation relation information in the data exploration result report.
Further, the generating an anomaly detection result report of the sample data according to at least one of the number of the feature types of the sample data, the overall feature missing rate, the feature distribution of the sample data of each feature type, the feature null value rate, the feature correlation relationship, and the sample clustering result includes:
determining integral missing statistical information in an abnormal detection result report according to all the characteristics of which the missing proportion is greater than a threshold value in the characteristic integral missing rate of the sample data and corresponding missing numerical values;
performing feature exploration and determining abnormal features, and determining an abnormal feature list in the abnormal detection result report according to the abnormal features;
and determining low-data-volume abnormal sample data in the sample data according to a preset clustering algorithm, and determining an abnormal sample list in an abnormal detection result report according to the low-data-volume abnormal sample data.
Further, the determining low data volume abnormal sample data in the sample data according to a preset clustering algorithm further includes:
and determining a corresponding clustering algorithm according to whether the data volume of the sample data exceeds a preset data volume threshold value.
Further, the variable traversal outputting the sample data search result with the continuous univariate feature as the feature type to determine the feature distribution information and the feature null value rate information in the data search result report includes:
and carrying out variable traversal output on the sample data search result with the characteristic category being continuous univariate characteristics, and determining the null value number, the null value rate, the negative value number, the 0 value number, the positive value number, the kurtosis, the skewness, the average value, the standard deviation, the quartile, the maximum value, the minimum value, the variable frequency distribution histogram and the kernel density curve in the data search result report.
Further, the variable traversal outputting the sample data search result with the discrete type feature as the feature type, and determining the feature distribution information and the feature null value rate information in the data search result report includes:
and carrying out variable traversal output on the sample data search result with the characteristic type being the discrete type characteristic, and determining the type with the maximum type number, the maximum empty value rate and the maximum frequency rate in the data search result report, the frequency ratio of the type and the frequency rate distribution bar graph.
Further, the variable traversal outputting the sample data search result with the date-type feature as the feature type, and determining the feature distribution information and the feature null rate information in the data search result report includes:
and carrying out variable traversal output on the sample data search result with the characteristic type being date type characteristic, and determining the null value number, the null value rate, the first time, the last time, the quartile date and the frequency distribution according to the date distribution in the data search result report.
In a second aspect, the present application provides a data preprocessing apparatus, comprising:
the characteristic pre-classification module is used for carrying out characteristic classification on the collected sample data and determining the quantity of the characteristic types and the integral loss rate of the characteristics of the sample data;
the characteristic exploration and anomaly detection module is used for carrying out characteristic exploration and anomaly detection on the sample data of each characteristic category in the sample data and determining the characteristic distribution, the characteristic null value rate, the characteristic correlation relationship and the anomaly sample of the sample data of each characteristic category;
and the report generation module is used for generating a data exploration summary, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type.
Further, the report generation module includes:
the data summary processing unit is used for determining sample data summary information in a data exploration summary according to the characteristic type quantity of the sample data;
the data detail exploration unit is used for carrying out variable traversal output on the sample data exploration result with the characteristic types of continuous univariate characteristics, discrete characteristics and date characteristics, and determining characteristic distribution information and characteristic null rate information in the data exploration result report;
and the correlation determining unit is used for calculating pairwise correlation coefficients of the sample data with the characteristic type being continuous multivariate characteristic and determining characteristic correlation information in the data search result report.
Further, the report generation module includes:
the summary anomaly detection unit is used for determining overall deficiency statistical information in an anomaly detection result report according to all the characteristics of which the deficiency proportion is greater than a threshold value in the characteristic overall deficiency rate of the sample data and corresponding deficiency numerical values;
the characteristic anomaly detection unit is used for carrying out characteristic exploration and determining an anomaly characteristic, and determining an anomaly characteristic list in the anomaly detection result report according to the anomaly characteristic;
and the sample abnormity detection unit is used for determining low-data-volume abnormal sample data in the sample data according to a preset clustering algorithm and determining an abnormal sample list in an abnormity detection result report according to the low-data-volume abnormal sample data.
Further, the report generation module further comprises:
and the clustering algorithm selecting unit is used for determining a corresponding clustering algorithm according to whether the data volume of the sample data exceeds a preset data volume threshold value.
In a third aspect, the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the data preprocessing method when executing the program.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the data pre-processing method described herein.
According to the technical scheme, the data preprocessing method and the data preprocessing device have the advantages that data preprocessing processes are normalized and data preprocessing operation amount is simplified through automatic data exploration and abnormal detection reports output, meanwhile, the data preprocessing method and the data preprocessing device can be used as a third party acceptance test basis, and universality is realized on different structured data items, so that the efficiency and the accuracy of data preprocessing can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating a data preprocessing method according to an embodiment of the present application;
FIG. 2 is a second flowchart illustrating a data preprocessing method according to an embodiment of the present application;
FIG. 3 is a third schematic flowchart of a data preprocessing method according to an embodiment of the present application;
FIG. 4 is a block diagram of a data preprocessing apparatus according to an embodiment of the present application;
FIG. 5 is a second block diagram of a data preprocessing apparatus according to an embodiment of the present application;
FIG. 6 is a third block diagram of a data preprocessing apparatus according to an embodiment of the present invention;
FIG. 7 is a fourth block diagram of a data preprocessing device according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating an overall data preprocessing method according to an embodiment of the present application;
FIG. 9 is a schematic diagram illustrating clustering algorithm selection in an embodiment of the present application;
FIG. 10 is a flow chart illustrating a data preprocessing method according to the prior art;
FIG. 11 is a flow chart illustrating a data preprocessing method of the present application in comparison with the prior art;
fig. 12 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Considering that data cleaning and conversion and actual business are closely related in the prior art, but data exploration and abnormal detection contents in various fields are basically consistent, namely feature distribution exploration, missing value statistics, abnormal value judgment, abnormal sample detection and the like are carried out, so that the technical scheme focusing on data exploration and abnormal detection in the data preprocessing stage is lacked, the application provides a data preprocessing method and a data preprocessing device, and referring to fig. 11, the data preprocessing flow is normalized and the data preprocessing operation amount is simplified by automatically outputting data exploration and abnormal detection reports, meanwhile, the data preprocessing method and the data preprocessing device can be used as a third-party acceptance test basis, and have universality on different structured data items, so that the efficiency and the accuracy of data preprocessing can be effectively improved.
In order to effectively improve the efficiency and accuracy of data preprocessing, the present application provides an embodiment of a data preprocessing method, and referring to fig. 1, the data preprocessing method specifically includes the following contents:
step S101: and carrying out feature classification on the collected sample data, and determining the number of feature types and the overall feature missing rate of the sample data.
Optionally, in the present application, the collected sample data may be pre-classified according to different data feature types, the features are divided into a primary key feature, a label feature, a continuous feature, a discrete feature and a date feature, and then, summary analysis is performed to determine the number of feature types and the overall feature missing rate (that is, a null value rate) of the sample data, that is, the summary analysis mainly counts the overall data amount, the number of features of each pre-classified sub-category, and the overall feature missing condition.
Step S102: and performing feature exploration and anomaly detection on the sample data of each feature type in the sample data, and determining the feature distribution, the feature null value rate, the feature correlation relation and the anomaly sample of the sample data of each feature type.
Optionally, after the above summary analysis is completed, the present application may further perform detailed analysis, namely feature exploration and anomaly detection, on the sample data of each feature category in the sample data, where the detailed analysis is developed from two dimensions of features and samples:
(1) feature mining and anomaly detection are respectively carried out on feature dimensions according to pre-classified sub-categories, and three angles of feature distribution, feature null value rate and feature correlation are mainly focused.
(2) And (3) carrying out anomaly detection on the dimension of the sample, and screening the anomalous sample by a machine learning algorithm. The scheme finally outputs a data exploration and anomaly detection report, and an automatic data preprocessing process is embodied.
Therefore, the feature distribution, the feature null value rate, the feature correlation relationship and the abnormal sample of the sample data of each feature type can be determined, and preferably, the feature correlation relationship can be the correlation relationship of continuous variables.
Step S103: and generating a data exploration overview, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type.
Optionally, according to the number of the feature types of the sample data, the feature overall deficiency rate, and at least one of the feature distribution, the feature null value rate, the feature correlation, and the sample clustering result of the sample data of each feature type, a data search result report and/or an anomaly detection result report of the sample data may be generated.
The data search result report may include data summary information, data search detail information for continuous features, data search detail information for discrete features, and data search detail information for date-type features.
The report of the abnormal detection result may include summary information of the abnormal detection (the number of features with abnormal conditions, the number of features related to various types of abnormalities, the detailed features), summary information of the abnormal detection of the sample (the number and proportion of abnormal samples calculated by the summary algorithm), and further may include logic for determining the abnormal detection of the feature, for example, the number of the data after the duplication removal for the main key in the data summary analysis module! For example, for a continuous univariate feature, the feature with skewness >3 is a distribution abnormal feature, the feature with maximum value equal to minimum value is shown, and for a continuous multivariate feature, the feature with collinearity is shown; for the discrete type features, displaying the features with the class number of 1; for a date-type feature, the feature that the first time is the last time is displayed; and aiming at the sample abnormality detection, displaying the number and the ratio of abnormal samples in each algorithm.
As can be seen from the above description, the data preprocessing method provided in the embodiment of the present application can normalize the data preprocessing flow and simplify the data preprocessing operation amount by automatically outputting the data exploration and the exception detection report, and meanwhile, can be used as a third party acceptance test basis and has universality for different structured data items, thereby effectively improving the efficiency and accuracy of data preprocessing.
In order to search for data accurately from sample data, in an embodiment of the data preprocessing method of the present application, referring to fig. 2, the step S103 may further include the following steps:
step S201: and determining sample data summary information in the data exploration summary according to the characteristic type quantity of the sample data.
The method specifically comprises the following steps:
data are m rows by n columns;
data (with/without) label characteristics, and statistics of each category (if present);
thirdly, data (with/without) primary keys, and the number of the primary keys after duplication removal;
fourthly, the number of data feature (excluding label and main key) continuous feature/discrete feature/date feature;
q characteristics in the data characteristics (except label and primary key) are missing.
Step S202: and carrying out variable traversal output on the sample data search result with the characteristic types of continuous univariate characteristics, discrete characteristics and date characteristics, and determining characteristic distribution information and characteristic null rate information in the data search result report.
The method specifically comprises the following steps:
and (3) for the continuous univariate characteristics, outputting each variable in a traversing way:
the characteristics are that: the number of null values, null rate, negative value/0 value/positive value number.
② the characteristic: kurtosis, skewness, mean, standard deviation, quartile, maximum, minimum.
Thirdly, the characteristics are as follows: a variable frequency distribution histogram, a kernel density curve.
And for the discrete features, traversing and outputting the variables:
the characteristics are that: category number, null rate, category with the largest frequency and frequency fraction.
② the characteristic: TOP10 class frequency distribution bar.
And for the date type characteristics, traversing and outputting all variables:
the characteristics are that: null number, null rate, first time, last time, quartile date.
② the characteristic: and outputting frequency distribution according to the date distribution.
Step S203: and calculating pairwise correlation coefficients of the sample data with the characteristic type being continuous multivariate characteristics, and determining characteristic correlation relation information in the data exploration result report.
For the continuous multivariate feature, two correlation coefficients are calculated:
the logarithm of the correlation coefficient is greater than 0.7, relating to the number of variables.
② a correlation diagram with the correlation coefficient more than 0.7.
And thirdly, a correlation diagram and a table.
In order to accurately perform the anomaly detection on the sample data in advance, in an embodiment of the data preprocessing method of the present application, referring to fig. 3, the step S103 may further include the following steps:
step S301: and determining integral missing statistical information in an abnormal detection result report according to all the characteristics of which the missing proportion is greater than a threshold value in the characteristic integral missing rate of the sample data and corresponding missing numerical values.
Specifically, for example, characteristic names and numerical values indicating that the deletion ratio is greater than 70% are shown.
Step S302: performing feature exploration and determining abnormal features, and determining an abnormal feature list in the abnormal detection result report according to the abnormal features;
step S303: and determining low-data-volume abnormal sample data in the sample data according to a preset clustering algorithm, and determining an abnormal sample list in an abnormal detection result report according to the low-data-volume abnormal sample data.
The method specifically comprises the following steps:
a clustering algorithm: and selecting an algorithm, inputting parameters, outputting an evaluation value, comparing data quantity of each category, and temporarily setting the lowest data quantity as an abnormal sample list.
Isolated forest algorithm: inputting parameters, outputting evaluation values, and taking TOP 10% samples as an abnormal sample list.
In order to accurately select a corresponding clustering algorithm according to the sample data volume, in an embodiment of the data preprocessing method of the present application, the step S103 may further include the following steps:
and determining a corresponding clustering algorithm according to whether the data volume of the sample data exceeds a preset data volume threshold value.
Alternatively, referring to fig. 9, the selection logic for the clustering algorithm: selecting according to the number of samples, and when the number of samples is less than 1 ten thousand, clustering by using a KMeans/MeanShift/VBGMM algorithm, outputting each class proportion, and temporarily listing the class sample with the lowest data volume into an abnormal sample; when the sample size is more than 1 ten thousand, the MiniBatch KMeans is used, and the logic for judging whether the sample is abnormal is the same as the logic above.
In some possible embodiments of the present application, the report of the anomaly detection details may include 7 aspects:
1. the primary key has a duplicate value exception.
2. The missing fraction is greater than the threshold anomaly.
3. The continuous characteristic distribution has large bias distribution abnormality or the continuous variable value has only abnormality.
4. Continuum-type feature collinearity anomalies.
5. The discrete eigenvalues are unique anomalies.
6. The date-type characteristic value is only abnormal.
7. The outliers are defined as anomalies.
In order to effectively improve the efficiency and accuracy of data preprocessing, the present application provides an embodiment of a data preprocessing apparatus for implementing all or part of the content of the data preprocessing method, and referring to fig. 4, the data preprocessing apparatus specifically includes the following contents:
the feature pre-classification module 10 is configured to perform feature classification on the collected sample data, and determine the number of feature types and the overall feature missing rate of the sample data.
The feature exploration and anomaly detection module 20 is configured to perform feature exploration and anomaly detection on sample data of each feature type in the sample data, and determine feature distribution, a feature null value rate, a feature correlation relationship, and an anomaly sample of the sample data of each feature type.
And a report generating module 30, configured to generate a data exploration summary, a data exploration result report, and an anomaly detection result report of the sample data according to the number of the feature types of the sample data, the feature overall deficiency rate, and at least one of the feature distribution, the feature null value rate, the feature correlation, and the sample clustering result of the sample data of each feature type.
As can be seen from the above description, the data preprocessing device provided in the embodiment of the present application can normalize the data preprocessing flow and simplify the data preprocessing operation amount by automatically outputting the data exploration and the exception detection report, and meanwhile, can be used as a third party acceptance test basis, and has universality for different structured data items, thereby effectively improving the efficiency and accuracy of data preprocessing.
In order to search for data accurately from sample data, in an embodiment of the data preprocessing device of the present application, referring to fig. 5, the report generating module 30 includes:
and the data summary processing unit 31 is configured to determine sample data summary information in the data exploration summary according to the number of the feature types of the sample data.
And the data detail searching unit 32 is configured to perform variable traversal output on the sample data searching result of which the feature type is a continuous univariate feature, a discrete feature and a date-type feature, and determine feature distribution information and feature null rate information in the data searching result report.
The correlation determining unit 33 is configured to calculate pairwise correlation coefficients of sample data with a continuous multivariate feature as a feature type, and determine feature correlation information in the data search result report.
In order to accurately detect an abnormality in the sample data in advance, in an embodiment of the data preprocessing device of the present application, referring to fig. 6, the report generating module 30 includes:
and the summary anomaly detection unit 34 is configured to determine overall missing statistical information in an anomaly detection result report according to each feature with a missing proportion larger than a threshold in the feature overall missing rate of the sample data and a corresponding missing numerical value.
The characteristic anomaly detection unit 35 is used for performing characteristic exploration, determining an anomaly characteristic, and determining an anomaly characteristic list in the anomaly detection result report according to the anomaly characteristic;
and the sample anomaly detection unit 36 is configured to determine, according to a preset clustering algorithm, low-data-volume anomalous sample data in the sample data, and determine, according to the low-data-volume anomalous sample data, an anomalous sample list in an anomaly detection result report.
In order to accurately select a corresponding clustering algorithm according to the sample data volume, in an embodiment of the data preprocessing device of the present application, referring to fig. 7, the report generating module 30 further includes:
and the clustering algorithm selecting unit 37 is configured to determine a corresponding clustering algorithm according to whether the data volume of the sample data exceeds a preset data volume threshold.
To further explain the present invention, the present application further provides a specific application example of implementing the data preprocessing method by using the data preprocessing apparatus, which is shown in fig. 8 and specifically includes the following contents:
the method comprises the steps of firstly, pre-classifying data features, and dividing the features into a main key feature, a label feature, a continuous feature, a discrete feature and a date feature. Then, the summary analysis and the detailed analysis were performed in this order.
The summary analysis mainly counts the whole data volume, the feature quantity of each pre-classification subclass and the whole feature missing condition.
The detailed analysis is developed from two dimensions, feature and sample:
(1) feature mining and anomaly detection are respectively carried out on feature dimensions according to pre-classified sub-categories, and three angles of feature distribution, feature null value rate and feature correlation are mainly focused.
(2) And (3) carrying out anomaly detection on the dimension of the sample, and screening the anomalous sample by a machine learning algorithm. The scheme finally outputs a data exploration and anomaly detection report, and an automatic data preprocessing process is embodied.
Among them, it is noteworthy that:
(1) the processing objects of the present application are directed to structured data only.
(2) When the characteristics are classified in advance, Boolean type characteristics are classified into a discrete type characteristic list.
(3) For the data category with the label, the distribution result subdivided according to the label is supplemented when the characteristic distribution is drawn, and the attachment content is listed.
(4) Selection logic for clustering algorithm: selecting according to the number of samples, and when the number of samples is less than 1 ten thousand, clustering by using a KMeans/MeanShift/VBGMM algorithm, outputting each class proportion, and temporarily listing the class sample with the lowest data volume into an abnormal sample; when the sample size is more than 1 ten thousand, the MiniBatch KMeans is used, and the logic for judging whether the sample is abnormal is the same as the logic above.
As can be seen from the above, the data preprocessing flow is normalized and the data preprocessing operation amount is simplified by automatically outputting the data exploration and the anomaly detection report, and the data preprocessing flow can be used as a third-party acceptance test basis and has universality for different structured data items.
In terms of hardware, in order to effectively improve the efficiency and accuracy of data preprocessing, the present application provides an embodiment of an electronic device for implementing all or part of the contents of the data preprocessing method, where the electronic device specifically includes the following contents:
a processor (processor), a memory (memory), a communication Interface (Communications Interface), and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between the data preprocessing device and relevant equipment such as a core service system, a user terminal, a relevant database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the logic controller may be implemented with reference to the embodiment of the data preprocessing method and the embodiment of the data preprocessing device in the embodiment, and the contents thereof are incorporated herein, and repeated descriptions are omitted.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, part of the data preprocessing method may be performed on the electronic device side as described above, or all operations may be performed in the client device. The selection may be specifically performed according to the processing capability of the client device, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The client device may further include a processor if all operations are performed in the client device.
The client device may have a communication module (i.e., a communication unit), and may be communicatively connected to a remote server to implement data transmission with the server. The server may include a server on the task scheduling center side, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.
Fig. 12 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 12, the electronic device 9600 can include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 12 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the data preprocessing method functions may be integrated into the central processor 9100. The central processor 9100 may be configured to control as follows:
step S101: and carrying out feature classification on the collected sample data, and determining the number of feature types and the overall feature missing rate of the sample data.
Step S102: and performing feature exploration and anomaly detection on the sample data of each feature type in the sample data, and determining the feature distribution, the feature null value rate, the feature correlation relation and the anomaly sample of the sample data of each feature type.
Step S103: and generating a data exploration overview, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type.
As can be seen from the above description, the electronic device provided in the embodiment of the present application standardizes the data preprocessing process and simplifies the data preprocessing operation amount by automatically outputting the data exploration and the exception detection report, and meanwhile, the electronic device can be used as a third party acceptance test basis and has universality for different structured data items, so that the efficiency and accuracy of data preprocessing can be effectively improved.
In another embodiment, the data preprocessing apparatus may be configured separately from the central processor 9100, for example, the data preprocessing apparatus may be configured as a chip connected to the central processor 9100, and the data preprocessing method function is realized by the control of the central processor.
As shown in fig. 12, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 also does not necessarily include all of the components shown in fig. 12; further, the electronic device 9600 may further include components not shown in fig. 12, which can be referred to in the related art.
As shown in fig. 12, a central processor 9100, sometimes referred to as a controller or operational control, can include a microprocessor or other processor device and/or logic device, which central processor 9100 receives input and controls the operation of the various components of the electronic device 9600.
The memory 9140 can be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 9100 can execute the program stored in the memory 9140 to realize information storage or processing, or the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. Power supply 9170 is used to provide power to electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 9140 can be a solid state memory, e.g., Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 9140 could also be some other type of device. Memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 being used for storing application programs and function programs or for executing a flow of operations of the electronic device 9600 by the central processor 9100.
The memory 9140 can also include a data store 9143, the data store 9143 being used to store data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, contact book applications, etc.).
The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. The communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby implementing ordinary telecommunications functions. The audio processor 9130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100, thereby enabling recording locally through the microphone 9132 and enabling locally stored sounds to be played through the speaker 9131.
An embodiment of the present application further provides a computer-readable storage medium capable of implementing all the steps in the data preprocessing method in which the execution subject is the server or the client in the foregoing embodiments, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all the steps in the data preprocessing method in which the execution subject is the server or the client in the foregoing embodiments, for example, when the processor executes the computer program, the processor implements the following steps:
step S101: and carrying out feature classification on the collected sample data, and determining the number of feature types and the overall feature missing rate of the sample data.
Step S102: and performing feature exploration and anomaly detection on the sample data of each feature type in the sample data, and determining the feature distribution, the feature null value rate, the feature correlation relation and the anomaly sample of the sample data of each feature type.
Step S103: and generating a data exploration overview, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type.
As can be seen from the above description, the computer-readable storage medium provided in the embodiment of the present application, through automatically outputting data exploration and an exception detection report, standardizes a data preprocessing flow, simplifies data preprocessing operation amount, and meanwhile, can be used as a third party acceptance test basis, and has universality for different structured data items, thereby effectively improving the efficiency and accuracy of data preprocessing.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method of pre-processing data, the method comprising:
carrying out feature classification on the collected sample data, and determining the number of feature types and the overall feature missing rate of the sample data;
performing feature exploration and anomaly detection on the sample data of each feature type in the sample data, and determining feature distribution, a feature null value rate, a feature correlation relation and an anomaly sample of the sample data of each feature type;
and generating a data exploration overview, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type.
2. The data preprocessing method according to claim 1, wherein the generating a data exploration summary and a data exploration result report of the sample data according to at least one of the number of the feature types of the sample data, the overall feature missing rate, and the feature distribution, the feature null value rate, the feature correlation, and the sample clustering result of the sample data of each feature type comprises:
determining sample data summary information in a data exploration summary according to the characteristic type quantity of the sample data;
performing variable traversal output on the sample data search result with the characteristic types of continuous univariate characteristics, discrete characteristics and date characteristics, and determining characteristic distribution information and characteristic null rate information in a data search result report;
and calculating pairwise correlation coefficients of the sample data with the characteristic type being continuous multivariate characteristics, and determining characteristic correlation relation information in the data exploration result report.
3. The data preprocessing method according to claim 2, wherein the performing a variable traversal output on the sample data search result with the feature type being a continuous univariate feature to determine feature distribution information and feature null rate information in the data search result report includes:
and carrying out variable traversal output on the sample data search result with the characteristic category being continuous univariate characteristics, and determining the null value number, the null value rate, the negative value number, the 0 value number, the positive value number, the kurtosis, the skewness, the average value, the standard deviation, the quartile, the maximum value, the minimum value, the variable frequency distribution histogram and the kernel density curve in the data search result report.
4. The data preprocessing method according to claim 2, wherein the performing a variable traversal output on the sample data search result with the discrete type feature as the feature type to determine the feature distribution information and the feature null rate information in the data search result report includes:
and carrying out variable traversal output on the sample data search result with the characteristic type being the discrete type characteristic, and determining the type with the maximum type number, the maximum empty value rate and the maximum frequency rate in the data search result report, the frequency ratio of the type and the frequency rate distribution bar graph.
5. The data preprocessing method according to claim 2, wherein the performing a variable traversal output on the sample data search result with the characteristic category being date-type characteristic to determine the characteristic distribution information and the characteristic null-value rate information in the data search result report includes:
and carrying out variable traversal output on the sample data search result with the characteristic type being date type characteristic, and determining the null value number, the null value rate, the first time, the last time, the quartile date and the frequency distribution according to the date distribution in the data search result report.
6. The data preprocessing method according to claim 1, wherein the generating an anomaly detection result report of the sample data according to at least one of the number of the feature types of the sample data, the overall feature missing rate, and a feature distribution, a feature null value rate, a feature correlation relationship, and a sample clustering result of the sample data of each feature type comprises:
determining integral missing statistical information in an abnormal detection result report according to all the characteristics of which the missing proportion is greater than a threshold value in the characteristic integral missing rate of the sample data and corresponding missing numerical values;
performing feature exploration and determining abnormal features, and determining an abnormal feature list in the abnormal detection result report according to the abnormal features;
and determining low-data-volume abnormal sample data in the sample data according to a preset clustering algorithm, and determining an abnormal sample list in an abnormal detection result report according to the low-data-volume abnormal sample data.
7. The data preprocessing method according to claim 6, wherein the determining low data volume abnormal sample data in the sample data according to a preset clustering algorithm further comprises:
and determining a corresponding clustering algorithm according to whether the data volume of the sample data exceeds a preset data volume threshold value.
8. A data preprocessing apparatus, comprising:
the characteristic pre-classification module is used for carrying out characteristic classification on the collected sample data and determining the quantity of the characteristic types and the integral loss rate of the characteristics of the sample data;
the characteristic exploration and anomaly detection module is used for carrying out characteristic exploration and anomaly detection on the sample data of each characteristic category in the sample data and determining the characteristic distribution, the characteristic null value rate, the characteristic correlation relationship and the anomaly sample of the sample data of each characteristic category;
and the report generation module is used for generating a data exploration summary, a data exploration result report and an abnormal detection result report of the sample data according to the number of the feature types of the sample data, the overall feature missing rate and at least one of the feature distribution, the feature null value rate, the feature correlation and the sample clustering result of the sample data of each feature type.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the data preprocessing method according to any of claims 1 to 7 are implemented by the processor when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the data preprocessing method as claimed in any one of the claims 1 to 7.
CN202110789650.9A 2021-07-13 2021-07-13 Data preprocessing method and device Pending CN113515577A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110789650.9A CN113515577A (en) 2021-07-13 2021-07-13 Data preprocessing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110789650.9A CN113515577A (en) 2021-07-13 2021-07-13 Data preprocessing method and device

Publications (1)

Publication Number Publication Date
CN113515577A true CN113515577A (en) 2021-10-19

Family

ID=78067603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110789650.9A Pending CN113515577A (en) 2021-07-13 2021-07-13 Data preprocessing method and device

Country Status (1)

Country Link
CN (1) CN113515577A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116862209A (en) * 2023-09-05 2023-10-10 深圳市金威源科技股份有限公司 New energy automobile charging facility management method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016144360A1 (en) * 2015-03-12 2016-09-15 Hewlett Packard Enterprise Development Lp Progressive interactive approach for big data analytics
CN111352971A (en) * 2020-02-28 2020-06-30 中国工商银行股份有限公司 Bank system monitoring data anomaly detection method and system
CN112463838A (en) * 2020-12-18 2021-03-09 杭州立思辰安科科技有限公司 Industrial data quality evaluation method and system based on machine learning
CN112905380A (en) * 2021-03-22 2021-06-04 上海海事大学 System anomaly detection method based on automatic monitoring log
CN113051317A (en) * 2021-04-09 2021-06-29 上海云从企业发展有限公司 Data exploration method and system and data mining model updating method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016144360A1 (en) * 2015-03-12 2016-09-15 Hewlett Packard Enterprise Development Lp Progressive interactive approach for big data analytics
CN111352971A (en) * 2020-02-28 2020-06-30 中国工商银行股份有限公司 Bank system monitoring data anomaly detection method and system
CN112463838A (en) * 2020-12-18 2021-03-09 杭州立思辰安科科技有限公司 Industrial data quality evaluation method and system based on machine learning
CN112905380A (en) * 2021-03-22 2021-06-04 上海海事大学 System anomaly detection method based on automatic monitoring log
CN113051317A (en) * 2021-04-09 2021-06-29 上海云从企业发展有限公司 Data exploration method and system and data mining model updating method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MACSENCHU: "数据的特征预处理、特征选择、主成分分析", HTTP://WWW.JIANSHU.COM/P/938391EE7DE4, 26 August 2020 (2020-08-26), pages 1 - 30 *
唐十六: "数据预处理和特征工程", HTTPS:ZHUANLAN.ZHIHU.COM/P/34594982, 2 February 2020 (2020-02-02), pages 1 - 15 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116862209A (en) * 2023-09-05 2023-10-10 深圳市金威源科技股份有限公司 New energy automobile charging facility management method and system
CN116862209B (en) * 2023-09-05 2023-12-01 深圳市金威源科技股份有限公司 New energy automobile charging facility management method and system

Similar Documents

Publication Publication Date Title
US20200210899A1 (en) Machine learning model training method and device, and electronic device
CN111275546B (en) Financial customer fraud risk identification method and device
CN109242135B (en) Model operation method, device and business server
CN113051317B (en) Data mining model updating method, system, computer equipment and readable medium
CN111352971A (en) Bank system monitoring data anomaly detection method and system
CN111898675B (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
CN103248705B (en) Server, client and method for processing video frequency
CN111582341B (en) User abnormal operation prediction method and device
CN111738331A (en) User classification method and device, computer-readable storage medium and electronic device
CA3135466A1 (en) User loan willingness prediction method and device and computer system
CN115409518A (en) User transaction risk early warning method and device
CN111210022A (en) Backward model selection method, device and readable storage medium
CN113515577A (en) Data preprocessing method and device
CN113420165B (en) Training of classification model and classification method and device of multimedia data
CN112631850B (en) Fault scene simulation method and device
CN112734565B (en) Fluidity coverage prediction method and device
CN117994021A (en) Auxiliary configuration method, device, equipment and medium for asset verification mode
CN112598540B (en) Material reserve recommending method, equipment and storage medium
CN115146997A (en) Evaluation method and device based on power data, electronic equipment and storage medium
CN111026991B (en) Data display method and device and computer equipment
CN112329943A (en) Combined index selection method and device, computer equipment and medium
CN112927012A (en) Marketing data processing method and device and marketing model training method and device
CN111861488B (en) Machine learning model comparison method and device
CN111768306A (en) Risk identification method and system based on intelligent data analysis
CN113469374B (en) Data prediction method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination