CN113626420A - Data preprocessing method and device and readable storage medium - Google Patents

Data preprocessing method and device and readable storage medium Download PDF

Info

Publication number
CN113626420A
CN113626420A CN202110839234.5A CN202110839234A CN113626420A CN 113626420 A CN113626420 A CN 113626420A CN 202110839234 A CN202110839234 A CN 202110839234A CN 113626420 A CN113626420 A CN 113626420A
Authority
CN
China
Prior art keywords
data
value
characteristic
dimension
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110839234.5A
Other languages
Chinese (zh)
Inventor
赵振崇
薛鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen ZNV Technology Co Ltd
Nanjing ZNV Software Co Ltd
Original Assignee
Shenzhen ZNV Technology Co Ltd
Nanjing ZNV Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen ZNV Technology Co Ltd, Nanjing ZNV Software Co Ltd filed Critical Shenzhen ZNV Technology Co Ltd
Priority to CN202110839234.5A priority Critical patent/CN113626420A/en
Publication of CN113626420A publication Critical patent/CN113626420A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

A data preprocessing method, a data preprocessing device and a computer-readable storage medium are provided, wherein the method comprises the steps of firstly performing data cleaning on a data set to be processed, wherein the data cleaning comprises missing value processing, abnormal value processing, repeated data removing, invalid discrete value removing and unbalanced data processing, then performing characteristic engineering processing on the data set, respectively performing standardization processing on discrete values and continuous values in the characteristic engineering processing, then calculating the characteristic dimension of the data set, performing characteristic dimension reduction if the characteristic dimension is larger than a preset upper dimension limit value, and performing characteristic construction if the characteristic dimension is smaller than a preset lower dimension limit value; the user only needs to input the data set to be processed, so that high-quality data can be obtained, the internal condition of the data does not need to be known, and the difficulty and the workload of data preprocessing are reduced.

Description

Data preprocessing method and device and readable storage medium
Technical Field
The invention relates to the technical field of information processing, in particular to a data preprocessing method and device and a readable storage medium.
Background
With the development of big data, machine learning application is gradually landed in combination with the industry, and in the research and application process of machine learning, the data quality is a key factor influencing the accuracy of a machine learning model, and the upper limit of the machine learning model is determined. How to quickly and effectively preprocess data and improve the quality of the data becomes a key problem in the technical field of preprocessing of machine learning data. For example, in the field of distributed machine learning, Spark is currently widely used in the industry as a fast and general large-scale data processing technology, but there is no general and mature automatic preprocessing technology or automatic feature engineering technology in this field, and a data analyst must manually process data to improve data quality, however, multiple operations such as data cleaning and feature engineering may be involved in data preprocessing, which brings great workload and difficulty to the data analyst.
Disclosure of Invention
The application provides a data preprocessing method and device and a readable storage medium, and aims to solve the problems that automatic data preprocessing cannot be realized in the prior art, and data analysis personnel need to manually manage data, so that the workload of data preprocessing is large and the difficulty is large.
According to a first aspect, there is provided in an embodiment a data pre-processing method comprising:
acquiring a data set to be processed;
performing data cleaning on the data set to be processed, wherein tasks of the data cleaning comprise missing value processing, abnormal value processing, repeated data elimination, invalid discrete value elimination and unbalanced data processing;
performing feature engineering processing on the data set subjected to data cleaning so as to finish preprocessing the data set to be processed, wherein the feature engineering processing comprises the following steps: respectively carrying out standardization processing on the discrete values and the continuous values, then calculating the characteristic dimension of the data set, carrying out characteristic dimension reduction if the characteristic dimension is greater than a preset dimension upper limit value, and carrying out characteristic construction if the characteristic dimension is less than a preset dimension lower limit value;
and outputting the preprocessed data set.
In one embodiment, the dataset to be processed includes a feature column and a tag column, and the missing value processing includes: calculating the proportion of missing values in the characteristic column, deleting the column when the proportion of the missing values in the characteristic column is larger than a preset missing value proportion threshold value, and otherwise, filling the missing values in the column;
the outlier processing includes: judging whether the data is an abnormal value or not, and if so, deleting the abnormal value and filling;
the repeated data elimination comprises the following steps: judging whether the proportion of the repeated data in the characteristic column exceeds a preset repeated data proportion threshold value, if so, deleting the column, and otherwise, keeping the column;
the invalid discrete value culling comprises: calculating the proportion of discrete values in the characteristic row, judging the discrete values of the row as invalid discrete values when the proportion of the discrete values is greater than a preset discrete value proportion threshold value, deleting the row, and otherwise, keeping the row;
the unbalanced data processing comprises: and calculating whether the proportions of different category values in the label column are the same or not, if so, judging that the sample size of each category in the data set to be processed is unbalanced, and carrying out data balance treatment to ensure that the proportions of the samples of each category in the data set to be processed are the same.
In one embodiment, the missing values and outliers are padded using linear padding, fixed value padding, mode padding, median padding, or KNN padding.
In one embodiment, the determination of whether the data is an outlier is made by:
when the average value of the data from the column where the data is located is more than 3 delta, judging the data to be an abnormal value, wherein delta is the standard deviation of the data from the column where the data is located;
or when the data exceeds the upper quartile or the lower quartile of the box line graph of the column where the data is located, judging the data to be an abnormal value;
or when the data accords with a preset regular expression rule, judging the data to be an abnormal value.
In one embodiment, random sampling or SMOTE algorithm is used to perform data balance treatment, so that the proportion of the sample size of each category in the data set to be processed is the same.
In one embodiment, the normalizing the discrete values and the continuous values respectively includes:
for the discrete value, converting the character string into an index, and then carrying out One-Hot coding on the index; for continuous values, data normalization and/or normalization is performed.
In one embodiment, the calculating a feature dimension of the data set, performing feature dimension reduction if the feature dimension is greater than a preset dimension upper limit, and performing feature construction if the feature dimension is less than a preset dimension lower limit includes:
and if the characteristic dimension is larger than the preset dimension upper limit value, performing characteristic dimension reduction through a principal component analysis algorithm, an independent component analysis algorithm or a TSNE algorithm, and if the characteristic dimension is smaller than the preset dimension lower limit value, using a polynomial characteristic structure to expand the characteristic into a multivariate space so as to increase the characteristic dimension.
In one embodiment, the data flushing task is scheduled in a first-in-first-out FIFO mode or in a shared resource FAIR mode.
According to a second aspect, an embodiment provides a data preprocessing apparatus, comprising:
the data set acquisition module is used for acquiring a data set to be processed;
the data cleaning module is used for cleaning the data of the data set to be processed, and tasks of the data cleaning include missing value processing, abnormal value processing, repeated data elimination, invalid discrete value elimination and unbalanced data processing;
the characteristic engineering module is used for carrying out characteristic engineering processing on the data set subjected to data cleaning so as to finish preprocessing the data set to be detected, and the characteristic engineering processing comprises the following steps: respectively carrying out standardization processing on the discrete values and the continuous values, then calculating the characteristic dimension of the data set, carrying out characteristic dimension reduction if the characteristic dimension is greater than a preset dimension upper limit value, and carrying out characteristic construction if the characteristic dimension is less than a preset dimension lower limit value;
and the data output module is used for outputting the preprocessed data set.
According to a third aspect, an embodiment provides a computer-readable storage medium having a program stored thereon, the program being executable by a processor to implement the data preprocessing method of the first aspect.
According to the data preprocessing method, the data preprocessing device and the computer-readable storage medium of the embodiment, firstly, data cleaning is carried out on a data set to be processed, tasks of the data cleaning comprise missing value processing, abnormal value processing, repeated data eliminating, invalid discrete value eliminating and unbalanced data processing, the data set is subjected to characteristic engineering processing after the data cleaning, discrete values and continuous values are respectively subjected to standardization processing in the characteristic engineering processing, then, characteristic dimensions of the data set are calculated, characteristic dimension reduction is carried out if the characteristic dimensions are larger than a preset upper limit value of the dimensions, and characteristic construction is carried out if the characteristic dimensions are smaller than a preset lower limit value of the dimensions, as various data preprocessing operations are included, most of low-quality data can be processed, channels of the data cleaning and the characteristic engineering are opened, and effective connection of the data cleaning and the characteristic engineering is realized, generating high-quality features for machine learning training, and being beneficial to improving the performance of a machine learning model; in the use, the user only needs to input the data set to be processed, high-quality data can be obtained, the internal condition of the data does not need to be known, the difficulty and the workload of data preprocessing performed by the user are reduced, and the machine learning platform is more intelligent and popular.
Drawings
FIG. 1 is a flow diagram of a data pre-processing method in one embodiment;
FIG. 2 is a flow diagram of a feature engineering process of an embodiment;
fig. 3 is a schematic structural diagram of a data preprocessing apparatus in an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.
Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.
The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).
Referring to fig. 1, a data preprocessing method in an embodiment of the present application includes steps 110 to 140, which are described in detail below.
Step 110: a dataset to be processed is obtained. The data set to be processed is generally sample data used for machine learning training, and is structured data, which is usually stored in the form of a table, each row in the table represents a sample, the columns include a feature column and a label column, each feature column represents a feature, and the label column represents a category of the sample.
Step 120: and performing data cleaning on the data set to be processed. The data cleansing may include various tasks and policies according to specific needs of data processing, and in one embodiment, the tasks of data cleansing include missing value processing, abnormal value processing, repeated data elimination, invalid discrete value elimination, and unbalanced data processing, which are described below.
Missing value processing: and calculating the proportion of missing values in the characteristic column, deleting the column when the proportion of the missing values in the characteristic column is larger than a preset missing value proportion threshold, otherwise, filling the missing values in the column, wherein the filling strategy can select linear filling, fixed value filling, mode filling, median filling or KNN (K-Nearest Neighbor) filling and the like.
Abnormal value processing: firstly, judging whether the data is an abnormal value or not, and if so, deleting the abnormal value and filling. Whether the data is an abnormal value can be judged by the following three strategies:
a normal distribution judgment strategy: according to the 3 δ rule, when the data is more than 3 δ from the average value of the data in the column where the data is located, the data is determined to be an abnormal value, wherein δ is the standard deviation of the data in the column where the data is located.
Box line graph judgment strategy: when the data exceeds the upper quartile or the lower quartile of the box line graph of the column where the data is located, the data is judged to be an abnormal value.
And (4) fixed rule judgment strategy: and when the data accords with a preset regular expression rule, judging that the data is an abnormal value. The regular expression may be preset by a user.
The data determined as the abnormal value is deleted and filled, and the filling strategy may be linear filling, fixed value filling, mode filling, median filling, KNN filling, or the like.
Removing repeated data: and judging whether the proportion of the repeated data in the characteristic column exceeds a preset repeated data proportion threshold, if so, deleting the column, and otherwise, keeping the column.
Invalid discrete value elimination: the type of each column in the data set is unique (discrete type or continuous type) according to the value type, the type of the data in the column should be consistent and the same as the type of the column, but there is a possibility that a certain column is originally set as a continuous type, but a plurality of discrete values are filled later due to misoperation or abnormality, etc., so that the data in the column is not a pure continuous value or a discrete value but is mixed with different types of data, and thus the discrete values are invalid, the column should be eliminated, in this embodiment, the case is identified by judging the proportion of the discrete values in the column, calculating the proportion of the discrete values in the characteristic column, and when the proportion of the discrete values is greater than a preset discrete value proportion threshold value, judging the discrete value of the column as an invalid discrete value, deleting the column, otherwise, keeping the column. Discrete values are generally determined by the type of data, for example, String (String) or Boolean (Boolean) data are discrete values, while digital data default to continuous values.
Processing unbalanced data: in order to make the machine learning training effect better, the used training data should be balanced as much as possible, that is, the proportion of the samples of different classes should be the same as much as possible, and if the samples are not balanced, data balance treatment should be performed. In this embodiment, whether the proportions of the samples of different categories in the to-be-processed data set are the same is determined by calculating whether the proportions of the different category values in the to-be-processed data set tag column are the same, and if the proportions of the different category values in the tag column are different, it is determined that the sample amount of each category in the to-be-processed data set is unbalanced, and data balance management is performed, so that the proportions of the samples of each category in the to-be-processed data set are the same. Data balance governance may be accomplished using two strategies, namely random sampling or SMOTE (Synthetic minimality Oversampling Technique) algorithms.
The tasks of the data cleaning are not in sequence and can be executed in parallel, and the tasks can be scheduled through two scheduling strategies: FIFO and FAIR. FIFO (First In First out) is a First-In First-out mode, tasks are submitted In sequence and executed In sequence according to the submission order, and the mode can be set as a default option. The FAIR (english word "FAIR") is a shared resource mode in which different tasks share computing resources with the same priority and can be executed in parallel, and the mode can be set when the computing resources are abundant.
The data cleaning comprises various data preprocessing operations, so that most of low-quality data can be processed, after the data cleaning, missing values, abnormal values, repeated values and invalid values are removed or filled, samples in a data set are more balanced, the data quality is improved, and the performance of a machine learning model is improved.
Step 130: and performing characteristic engineering processing on the data set subjected to data cleaning. The characteristic engineering treatment mainly comprises the following steps: respectively carrying out standardization processing on the discrete values and the continuous values, then calculating the characteristic dimension of the data set, carrying out characteristic dimension reduction if the characteristic dimension is larger than a preset dimension upper limit value, and carrying out characteristic construction if the characteristic dimension is smaller than a preset dimension lower limit value. Specifically, referring to fig. 2, the feature engineering process includes steps 131 to 137, which are described in detail below.
Step 131: it is determined whether the data is a discrete value or a continuous value, and step 132 is performed if the data is a discrete value, and step 133 is performed if the data is a continuous value.
Step 132: converting the character string into an index, for example, when Spark is used, the character string can be converted into an index by using a StringIndexer method therein, and then the index is subjected to One-Hot coding, so that the discrete value representing the category is converted into a representation mode of a binary vector, and machine learning training is facilitated.
Step 133: and the data normalization can improve the accuracy of machine learning after the data normalization.
Step 134: the discrete and continuous values are summed.
Step 135: and calculating the characteristic dimension of the data set, and comparing the characteristic dimension with a preset dimension upper limit value and a preset dimension lower limit value. Step 136 is performed if the characteristic dimension of the data set is greater than the preset upper dimension limit value, and step 137 is performed if the characteristic dimension of the data set is less than the preset lower dimension limit value. The dimension upper limit value can be set to 100, the dimension lower limit value can be set to 10, and a user can modify the dimension upper limit value and the dimension lower limit value.
Step 136: the characteristic dimension of the data set is larger than a preset upper limit value of the dimension, which indicates that the characteristic dimension of the data set is too large, so that the characteristic dimension is reduced to a proper value by a PCA (Principal Components Analysis) algorithm, an ICA (Independent Component Analysis) algorithm or a TSNE (T-Stochastic neighbor Embedding) algorithm.
Step 137: the characteristic dimension of the data set is smaller than a preset dimension lower limit value, which indicates that the characteristic dimension of the data set is too small, the characteristic needs to be expanded, and the characteristic dimension can be increased by using a polynomial characteristic structure to expand the characteristic into a multivariate space.
After the feature engineering processing, high-quality features are generated, the operation of manually performing the feature engineering by a user is omitted, the performance of a machine learning model is improved, and the workload and the use difficulty of the user are reduced.
Step 140: and outputting the preprocessed data set. The preprocessed data set can be used for training a machine learning model.
On the basis of the foregoing data preprocessing method, the present application further provides a data preprocessing device, please refer to fig. 3, and in an embodiment, the data preprocessing device includes a data set obtaining module 1, a data cleaning module 2, a feature engineering module 3, and a data output module 4, which are respectively described below.
Before the device is operated, a user inputs a data set to be processed into the device, the device can be operated by default setting, and the setting of the device can be modified, such as a strategy for setting missing values and filling abnormal values, a strategy for judging abnormal values, various thresholds and the like.
The data set obtaining module 1 is used for obtaining a data set to be processed. After the data set to be processed is obtained, the device scans all data in the data set to be processed, and records the type, missing value proportion, abnormal value proportion, repeated data proportion, discrete value proportion and proportion occupied by different types of samples of each line of data.
The data cleaning module 2 is used for cleaning data of the data set to be processed. The data cleansing may include various tasks and strategies according to specific needs of data processing, in an embodiment, the tasks of data cleansing include missing value processing, abnormal value processing, repeated data elimination, invalid discrete value elimination, and unbalanced data processing, and referring to fig. 3, correspondingly, the data cleansing module includes a missing value processing unit 21, an abnormal value processing unit 22, a repeated data elimination unit 23, an invalid discrete value elimination unit 24, and an unbalanced data processing unit 25, which are respectively described below.
The missing value processing unit 21 is configured to compare the ratio of missing values in the feature column with a preset missing value ratio threshold, delete the column when the ratio of missing values in the feature column is greater than the preset missing value ratio threshold, otherwise perform missing value filling on the column, where the filling policy may select linear filling, fixed value filling, mode filling, median filling, or KNN (K-Nearest Neighbor) filling, and the like.
The abnormal value processing unit 22 is used for judging whether the data is an abnormal value, and if so, deleting the abnormal value and filling. Whether the data is an abnormal value can be judged by the following three strategies:
a normal distribution judgment strategy: according to the 3 δ rule, when the data is more than 3 δ from the average value of the data in the column where the data is located, the data is determined to be an abnormal value, wherein δ is the standard deviation of the data in the column where the data is located.
Box line graph judgment strategy: when the data exceeds the upper quartile or the lower quartile of the box line graph of the column where the data is located, the data is judged to be an abnormal value.
And (4) fixed rule judgment strategy: and when the data accords with a preset regular expression rule, judging that the data is an abnormal value. The regular expression may be preset by a user.
For the data determined as the abnormal value, the abnormal value processing unit 22 deletes and fills the data, and the filling strategy may select linear filling, fixed value filling, mode filling, median filling, KNN filling, or the like.
The repeated data eliminating unit 23 is configured to determine whether a ratio of repeated data in the feature column exceeds a preset repeated data ratio threshold, delete the column if the ratio of repeated data in the feature column exceeds the preset repeated data ratio threshold, and otherwise, retain the column.
The invalid discrete value eliminating unit 24 is configured to compare the proportion of the discrete values in the feature column with a preset discrete value proportion threshold, determine that the discrete value of the column is an invalid discrete value when the proportion of the discrete values is greater than the preset discrete value proportion threshold, delete the column, and otherwise, retain the column. The type of each column in the data set is unique (discrete type or continuous type) according to the type of the value, the type of the data in the column should be consistent and the same as the type of the column, but it is possible that what a certain column user originally set is the continuous type, but later because of misoperation or abnormality, many discrete values are filled in, so that the data in this column is not a pure continuous value or a discrete value but is mixed with different types of data, and thus these discrete values are invalid, this column should be eliminated, in this embodiment, this case is identified by judging the proportion of the discrete values in the column, the discrete values can generally be judged according to the type of the data, for example, data of a String (String) or a Boolean (Boolean) is a discrete value, and data of a digital type defaults to a continuous value.
In order to make the machine learning training effect better, the used training data should be balanced as much as possible, that is, the proportion of the samples of different classes should be the same as much as possible, and if the samples are not balanced, data balance treatment should be performed. In this embodiment, the unbalanced data processing unit 25 is configured to compare whether the proportions of the different types of values in the tag column of the data set to be processed are the same, to determine whether the proportions of the different types of samples in the data set to be processed are the same, and if the proportions of the different types of values in the tag column are different, determine that the amount of the samples in the data set to be processed is unbalanced, and perform data balance control, so that the proportions of the samples in the data set to be processed in the different types are the same. Data balance governance may be accomplished using two strategies, namely random sampling or SMOTE (Synthetic minimality Oversampling Technique) algorithms.
The tasks of the data cleaning are not in sequence and can be executed in parallel, and the tasks can be scheduled through two scheduling strategies: FIFO and FAIR. FIFO (First In First out) is a First-In First-out mode, tasks are submitted In sequence and executed In sequence according to the submission order, and the mode can be set as a default option. The FAIR (english word "FAIR") is a shared resource mode in which different tasks share computing resources with the same priority and can be executed in parallel, and the mode can be set when the computing resources are abundant.
The data cleaning module 2 carries out various data preprocessing operations, so that most of low-quality data can be processed, after the data cleaning, missing values, abnormal values, repeated values and invalid values are removed or filled, samples in a data set are more balanced, the data quality is improved, and the performance of a machine learning model is improved.
The characteristic engineering module 3 is used for performing characteristic engineering processing on the data set subjected to data cleaning, so as to complete preprocessing of the data set to be detected. The characteristic engineering treatment mainly comprises the following steps: respectively carrying out standardization processing on the discrete values and the continuous values, then calculating the characteristic dimension of the data set, carrying out characteristic dimension reduction if the characteristic dimension is larger than a preset dimension upper limit value, and carrying out characteristic construction if the characteristic dimension is smaller than a preset dimension lower limit value. Referring to fig. 3, correspondingly, the feature engineering module 3 includes a discrete value processing unit 31, a continuous value processing unit 32, a feature dimension reducing unit 33, and a feature constructing unit 34, which are respectively described below.
The discrete value processing unit 31 is configured to convert the character string into an index, for example, when Spark is used, the character string may be converted into an index by using a StringIndexer method therein, and then the index is subjected to One-Hot encoding, so that the discrete value representing the category is converted into a representation manner of a binary vector, which is beneficial for machine learning training.
The continuous value processing unit 32 is configured to normalize and/or normalize the continuous value, and the normalization and/or normalization of the continuous value may improve the accuracy of machine learning.
The feature dimension reduction unit 33 is configured to perform feature dimension reduction through a PCA (Principal Components Analysis), an ICA (Independent Component Analysis) or a TSNE (T-Stochastic neighbor Embedding) algorithm when the feature dimension of the data set is greater than a preset upper limit value of the dimension, so as to reduce the feature dimension to a suitable value.
The feature construction unit 34 is configured to expand the feature by using polynomial feature construction to expand the feature into a multivariate space to increase the feature dimension when the feature dimension of the data set is smaller than a preset dimension lower limit value.
The data set is processed by the feature engineering module 3 to generate high-quality features, so that the operation of manually performing feature engineering by a user is omitted, the performance of a machine learning model is improved, and the workload and the use difficulty of the user are reduced.
The data output module 4 is used for outputting the preprocessed data set. The preprocessed data set can be used for training a machine learning model.
According to the data preprocessing method and the device of the embodiment, firstly, data cleaning is carried out on a data set to be processed, tasks of the data cleaning comprise missing value processing, abnormal value processing, repeated data removing, invalid discrete value removing and unbalanced data processing, the data set is subjected to characteristic engineering processing after the data cleaning, discrete values and continuous values are respectively subjected to standardization processing in the characteristic engineering processing, then characteristic dimensions of the data set are calculated, characteristic dimension reduction is carried out if the characteristic dimensions are larger than a preset upper dimension limit value, characteristic construction is carried out if the characteristic dimensions are smaller than a preset lower dimension limit value, due to the fact that various data preprocessing operations are included, most of low-quality data can be processed, channels of the data cleaning and the characteristic engineering are opened, effective connection of the data cleaning and the characteristic engineering is achieved, high-quality characteristics are generated and used in machine learning training, the performance of the machine learning model is improved; the data preprocessing process can be executed by a computer, in the using process, a user only needs to input a data set to be processed to obtain high-quality data without knowing the internal condition of the data, automatic data preprocessing is realized, the difficulty and workload of the user in data preprocessing are reduced, a machine learning platform is more intelligent and popular, and the blank that distributed computing engines such as Spark and the like do not support automatic data preprocessing is filled.
Reference is made herein to various exemplary embodiments. However, those skilled in the art will recognize that changes and modifications may be made to the exemplary embodiments without departing from the scope hereof. For example, the various operational steps, as well as the components used to perform the operational steps, may be implemented in differing ways depending upon the particular application or consideration of any number of cost functions associated with operation of the system (e.g., one or more steps may be deleted, modified or incorporated into other steps).
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. Additionally, as will be appreciated by one skilled in the art, the principles herein may be reflected in a computer program product on a computer readable storage medium, which is pre-loaded with computer readable program code. Any tangible, non-transitory computer-readable storage medium may be used, including magnetic storage devices (hard disks, floppy disks, etc.), optical storage devices (CD-to-ROM, DVD, Blu-Ray discs, etc.), flash memory, and/or the like. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including means for implementing the function specified. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified.
While the principles herein have been illustrated in various embodiments, many modifications of structure, arrangement, proportions, elements, materials, and components particularly adapted to specific environments and operative requirements may be employed without departing from the principles and scope of the present disclosure. The above modifications and other changes or modifications are intended to be included within the scope of this document.
The foregoing detailed description has been described with reference to various embodiments. However, one skilled in the art will recognize that various modifications and changes may be made without departing from the scope of the present disclosure. Accordingly, the disclosure is to be considered in an illustrative and not a restrictive sense, and all such modifications are intended to be included within the scope thereof. Also, advantages, other advantages, and solutions to problems have been described above with regard to various embodiments. However, the benefits, advantages, solutions to problems, and any element(s) that may cause any element(s) to occur or become more pronounced are not to be construed as a critical, required, or essential feature or element of any or all the claims. As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, system, article, or apparatus. Furthermore, the term "coupled," and any other variation thereof, as used herein, refers to a physical connection, an electrical connection, a magnetic connection, an optical connection, a communicative connection, a functional connection, and/or any other connection.
Those skilled in the art will recognize that many changes may be made to the details of the above-described embodiments without departing from the underlying principles of the invention. Accordingly, the scope of the invention should be determined only by the claims.

Claims (10)

1. A method of pre-processing data, comprising:
acquiring a data set to be processed;
performing data cleaning on the data set to be processed, wherein tasks of the data cleaning comprise missing value processing, abnormal value processing, repeated data elimination, invalid discrete value elimination and unbalanced data processing;
performing feature engineering processing on the data set subjected to data cleaning so as to finish preprocessing the data set to be processed, wherein the feature engineering processing comprises the following steps: respectively carrying out standardization processing on the discrete values and the continuous values, then calculating the characteristic dimension of the data set, carrying out characteristic dimension reduction if the characteristic dimension is greater than a preset dimension upper limit value, and carrying out characteristic construction if the characteristic dimension is less than a preset dimension lower limit value;
and outputting the preprocessed data set.
2. The data preprocessing method of claim 1, wherein the dataset to be processed includes a feature column and a tag column, and the missing value processing includes: calculating the proportion of missing values in the characteristic column, deleting the column when the proportion of the missing values in the characteristic column is larger than a preset missing value proportion threshold value, and otherwise, filling the missing values in the column;
the outlier processing includes: judging whether the data is an abnormal value or not, and if so, deleting the abnormal value and filling;
the repeated data elimination comprises the following steps: judging whether the proportion of the repeated data in the characteristic column exceeds a preset repeated data proportion threshold value, if so, deleting the column, and otherwise, keeping the column;
the invalid discrete value culling comprises: calculating the proportion of discrete values in the characteristic row, judging the discrete values of the row as invalid discrete values when the proportion of the discrete values is greater than a preset discrete value proportion threshold value, deleting the row, and otherwise, keeping the row;
the unbalanced data processing comprises: and calculating whether the proportions of different category values in the label column are the same or not, if so, judging that the sample size of each category in the data set to be processed is unbalanced, and carrying out data balance treatment to ensure that the proportions of the samples of each category in the data set to be processed are the same.
3. The data preprocessing method of claim 2 wherein missing values and outliers are padded using linear padding, fixed value padding, mode padding, median padding or KNN padding.
4. The data preprocessing method of claim 2, wherein whether the data is an abnormal value is judged by:
when the average value of the data from the column where the data is located is more than 3 delta, judging the data to be an abnormal value, wherein delta is the standard deviation of the data from the column where the data is located;
or when the data exceeds the upper quartile or the lower quartile of the box line graph of the column where the data is located, judging the data to be an abnormal value;
or when the data accords with a preset regular expression rule, judging the data to be an abnormal value.
5. The data preprocessing method of claim 2 wherein the random sampling or SMOTE algorithm is used for data balance governance to make the proportion of the sample size of each category in the data set to be processed the same.
6. The data preprocessing method of claim 1, wherein the separately normalizing discrete values and continuous values comprises:
for the discrete value, converting the character string into an index, and then carrying out One-Hot coding on the index; for continuous values, data normalization and/or normalization is performed.
7. The data preprocessing method of claim 1, wherein the calculating a feature dimension of the data set, performing feature dimension reduction if the feature dimension is greater than a preset dimension upper limit value, and performing feature construction if the feature dimension is less than a preset dimension lower limit value comprises:
and if the characteristic dimension is larger than the preset dimension upper limit value, performing characteristic dimension reduction through a principal component analysis algorithm, an independent component analysis algorithm or a TSNE algorithm, and if the characteristic dimension is smaller than the preset dimension lower limit value, using a polynomial characteristic structure to expand the characteristic into a multivariate space so as to increase the characteristic dimension.
8. The data pre-processing method of claim 1, wherein the data flushing task is scheduled in a first-in-first-out (FIFO) mode or in a shared resource (FAIR) mode.
9. A data preprocessing apparatus, comprising:
the data set acquisition module is used for acquiring a data set to be processed;
the data cleaning module is used for cleaning the data of the data set to be processed, and tasks of the data cleaning include missing value processing, abnormal value processing, repeated data elimination, invalid discrete value elimination and unbalanced data processing;
the characteristic engineering module is used for carrying out characteristic engineering processing on the data set subjected to data cleaning so as to finish preprocessing the data set to be detected, and the characteristic engineering processing comprises the following steps: respectively carrying out standardization processing on the discrete values and the continuous values, then calculating the characteristic dimension of the data set, carrying out characteristic dimension reduction if the characteristic dimension is greater than a preset dimension upper limit value, and carrying out characteristic construction if the characteristic dimension is less than a preset dimension lower limit value;
and the data output module is used for outputting the preprocessed data set.
10. A computer-readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the data preprocessing method as claimed in any one of claims 1 to 8.
CN202110839234.5A 2021-07-23 2021-07-23 Data preprocessing method and device and readable storage medium Pending CN113626420A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110839234.5A CN113626420A (en) 2021-07-23 2021-07-23 Data preprocessing method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110839234.5A CN113626420A (en) 2021-07-23 2021-07-23 Data preprocessing method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN113626420A true CN113626420A (en) 2021-11-09

Family

ID=78380862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110839234.5A Pending CN113626420A (en) 2021-07-23 2021-07-23 Data preprocessing method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113626420A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738157A (en) * 2023-08-09 2023-09-12 柏森智慧空间科技集团有限公司 Method for preprocessing data in property management platform

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738157A (en) * 2023-08-09 2023-09-12 柏森智慧空间科技集团有限公司 Method for preprocessing data in property management platform

Similar Documents

Publication Publication Date Title
CN108009430B (en) Sensitive data rapid scanning method and device
CN112507666B (en) Document conversion method, device, electronic equipment and storage medium
CN107590254B (en) Big data support platform with merging processing method
CN108369584B (en) Information processing system, descriptor creation method, and descriptor creation program
CN113626420A (en) Data preprocessing method and device and readable storage medium
CN110704371A (en) Large-scale data management and data distribution system and method
CN113743650B (en) Power load prediction method, device, equipment and storage medium
CN113326131B (en) Data processing method, device, equipment and storage medium
CN107562703A (en) Dictionary tree reconstructing method and system
CN115080745A (en) Multi-scene text classification method, device, equipment and medium based on artificial intelligence
US10331799B2 (en) Generating a feature set
CN111275166A (en) Image processing device and equipment based on convolutional neural network and readable storage medium
CN110413750A (en) The method and apparatus for recalling standard question sentence according to user's question sentence
CN115238645A (en) Asset data identification method and device, electronic equipment and computer storage medium
CN111046635B (en) Method, device, computer equipment and storage medium for making freemaker template
US8386922B2 (en) Information processing apparatus and information processing method
CN107688663B (en) Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue
CN111143761A (en) Matrix completion method based on discrete manufacturing equipment process data
CN113778502A (en) Data processing method, device, system and storage medium
CN111767395A (en) Abstract generation method and system based on picture
CN113935387A (en) Text similarity determination method and device and computer readable storage medium
JP6317280B2 (en) Same form file selection device, same form file selection method, and same form file selection program
CN115481539B (en) Simulation result data rapid analysis and storage method
KR102615133B1 (en) Method and apparatus for transforming data distribution
CN107943763A (en) A kind of big text data processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination