CN113051317A - Data exploration method and system and data mining model updating method and system - Google Patents

Data exploration method and system and data mining model updating method and system Download PDF

Info

Publication number
CN113051317A
CN113051317A CN202110383259.9A CN202110383259A CN113051317A CN 113051317 A CN113051317 A CN 113051317A CN 202110383259 A CN202110383259 A CN 202110383259A CN 113051317 A CN113051317 A CN 113051317A
Authority
CN
China
Prior art keywords
data
sample data
index
exploration
continuous numerical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110383259.9A
Other languages
Chinese (zh)
Inventor
蒋博劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yuncong Enterprise Development Co ltd
Original Assignee
Shanghai Yuncong Enterprise Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yuncong Enterprise Development Co ltd filed Critical Shanghai Yuncong Enterprise Development Co ltd
Priority to CN202110383259.9A priority Critical patent/CN113051317A/en
Publication of CN113051317A publication Critical patent/CN113051317A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention provides a data exploration method and system and a data mining model updating method and system, wherein the characteristic types of all sample data in a target data set are obtained by carrying out meta-information derivation on the target data set; then, sample data corresponding to each characteristic type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration; and calculating stability index values among the data sets based on the index exploration result, and updating the data mining model when the stability index values are larger than a preset threshold value. The invention can judge whether the data mining model needs to be updated or not by performing data exploration on the sample data on the basis of the meta information, so that the data mining model can adapt to all the sample data including the newly added sample. The invention does not need a large amount of manual intervention, can automatically calculate the statistical information of the data characteristics, generate the characteristic distribution image and accurately deduce the data element information.

Description

Data exploration method and system and data mining model updating method and system
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data exploration method and system, a data mining model updating method and system, a computer device, and a machine-readable medium.
Background
Today, big data sheet type data is the primary input form for machine learning data mining tasks, such as internet companies, banks, government databases, personal basic information in data warehouses, demographic information, behavioral logs, transaction streamers, and so forth. The machine learning data mining model usually takes the information as an input training sample to finish classification, regression or sequencing tasks, and finally achieves the business purposes of recommendation, marketing, wind control and the like. Then, the trained data mining model has timeliness to a certain extent, and as time goes on, the newly added samples and the samples used for modeling before are inevitably distributed cheaply to a certain extent, so that the data mining model trained by fitting of the original training samples is no longer suitable for the newly added samples. Therefore, for the training samples and the new samples, data exploration needs to be performed on the samples, and the distribution situation of each important index feature is judged according to the data exploration result and is used as a judgment basis for updating the data mining model.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a data exploration method and system, and a data mining model updating method and system, which are used for solving the technical problems existing in the prior art.
In order to achieve the above objects and other related objects, the present invention provides a data exploration method applied to a computer model training process, comprising the steps of:
performing meta-information derivation on a target data set to acquire the feature types of all sample data in the target data set;
sample data corresponding to each characteristic type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration.
Optionally, if the feature type of the sample data includes a continuous numerical feature and a discrete feature, the process of obtaining the index probing result includes:
determining a statistical index of continuous numerical characteristic sample data;
calculating an index value of the continuous numerical characteristic sample data according to the determined statistical index;
according to the determined statistical indexes and the calculated index values, carrying out box separation processing on the continuous numerical characteristic sample data, and counting the proportion of the sample data in each box separation interval to all the sample data;
distinguishing positive samples and negative samples of the continuous numerical characteristic sample data, and obtaining the proportion of the positive samples and the negative samples in each box-dividing interval to obtain index probing results of the continuous numerical characteristic sample data;
and/or determining a statistical index of discrete type characteristic sample data;
and calculating the index value of the discrete type characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete type characteristic sample data.
Optionally, the method further includes performing distribution exploration on sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image in a target scene according to the distribution exploration result; the target scene of the continuous numerical type features comprises a two-classification scene, and the target scene of the discrete type features comprises a regression scene.
Optionally, if the distribution of the continuous numerical characteristic sample data is probed, there are:
carrying out pairwise combination on sample data corresponding to the continuous numerical type features, taking one sample data in each combination as a horizontal axis value of the distribution image, and taking the other sample data as a vertical axis value of the distribution image; forming a sample data point based on the horizontal axis value and the vertical axis value, and filling the sample data point into the distribution image for display;
or, calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image for display.
Optionally, the method further includes distinguishing sample label values on the distribution image by using different colors by using label column information in a supervised scene.
Optionally, if a time column exists in the target data set, the method further includes:
constructing an index according to the time column, and performing mean value aggregation on sample data in a target time range to obtain a time series data set;
and generating a time series curve under the continuous numerical characteristic and the discrete characteristic based on the time series data set, and performing mean value smoothing on missing values in the time series curve.
The invention also provides a data mining model updating method, which comprises the following steps:
acquiring a training data set and a data set to be tested;
index probing is carried out on the training data set and the data set to be tested by using any one of the data probing methods, and the ratio of the bin values of the continuous numerical characteristic sample data in the training data set to the ratio of the bin values of the continuous numerical characteristic sample data in the data set to be tested is obtained;
calculating a stability index value between the training data set and the data set to be tested according to the bin value ratio of the continuous numerical characteristic sample data in the training data set and the bin value ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating a data mining model when the stability index value is larger than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank sample data.
The invention also provides a data exploration system, which is applied to the training process of the computer model and comprises the following steps:
the meta-information deducing module is used for deducing meta-information of a target data set to acquire the characteristic types of all sample data in the target data set;
the index probing module is used for probing the sample data corresponding to each characteristic type to obtain a corresponding probing result; the probing includes at least one of: index exploration and data distribution exploration.
Optionally, if the feature type of the sample data includes a continuous numerical feature and a discrete feature, the process of obtaining the index search result by the index search module includes:
determining a statistical index of continuous numerical characteristic sample data;
calculating an index value of the continuous numerical characteristic sample data according to the determined statistical index;
according to the determined statistical indexes and the calculated index values, carrying out box separation processing on the continuous numerical characteristic sample data, and counting the proportion of the sample data in each box separation interval to all the sample data;
distinguishing positive samples and negative samples of the continuous numerical characteristic sample data, and obtaining the proportion of the positive samples and the negative samples in each box-dividing interval to obtain index probing results of the continuous numerical characteristic sample data;
and/or determining a statistical index of discrete type characteristic sample data;
and calculating the index value of the discrete type characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete type characteristic sample data.
Optionally, the system further comprises a distribution exploration module, configured to perform distribution exploration on sample data corresponding to each feature type according to the index exploration result, and form and display a distribution image in a target scene according to the distribution exploration result; the target scene of the continuous numerical type features comprises a two-classification scene, and the target scene of the discrete type features comprises a regression scene.
Optionally, if the distribution of the continuous numerical characteristic sample data is probed, there are:
carrying out pairwise combination on sample data corresponding to the continuous numerical type features, taking one sample data in each combination as a horizontal axis value of the distribution image, and taking the other sample data as a vertical axis value of the distribution image; forming a sample data point based on the horizontal axis value and the vertical axis value, and filling the sample data point into the distribution image for display;
or, calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image for display.
Optionally, the time sequence aggregation module is further included, and is configured to, when a time column exists in the target data set, perform mean value aggregation on sample data in a target time range based on an index constructed according to the time column, and obtain a time sequence data set;
and generating a time series curve under the continuous numerical characteristic and the discrete characteristic based on the time series dataset, and performing mean value smoothing processing on missing values in the time series curve.
The invention also provides a data mining model updating system, which comprises:
the acquisition module is used for acquiring a training data set and a data set to be tested;
the bin value dividing module is used for performing index exploration on the training data set and the data set to be tested by using any one of the data exploration methods to obtain a bin value ratio of continuous numerical characteristic sample data in the training data set and a bin value ratio of the continuous numerical characteristic sample data in the data set to be tested;
the model updating module is used for calculating a stability index value between the training data set and the data set to be tested according to the bin value ratio of the continuous numerical characteristic sample data in the training data set and the bin value ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating the data mining model when the stability index value is larger than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank sample data.
The present invention also provides a computer apparatus comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as in any one of the above.
The invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method as described in any one of the above.
As described above, the present invention provides a data exploration method and system, and a data mining model updating method and system, which have the following beneficial effects: acquiring the characteristic types of all sample data in the target data set by carrying out meta-information deduction on the target data set; then, sample data corresponding to each characteristic type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration. The target data set may be one data set or may be composed of a plurality of data sets. The invention can perform data exploration on one or more data sets, determine the characteristic distribution condition in the data sets according to the exploration result, and then judge whether the data mining model needs to be updated or not based on the characteristic distribution condition in the data sets, so that the data mining model can adapt to all sample data including newly added samples. The invention does not need a large amount of manual intervention, can automatically calculate the statistical information of the data characteristics, generate the characteristic distribution image and accurately deduce the data element information. In addition, the invention can also utilize MiniBatchKmeans to perform clustering processing on the scatter diagram in the distribution image, thereby reducing the complexity of the distribution image and ensuring the efficiency of generating the sample points. Moreover, the invention automatically carries out time sequence conversion on the data set, and carries out smoothing filling processing on the converted time sequence images, thereby automatically generating beautiful and effective time sequence images. Meanwhile, the invention can distinguish the sample label value on the image by using different colors by using the label column information under the supervision scene for the generated distribution image.
Drawings
FIG. 1 is a flowchart illustrating a data probing method according to an embodiment;
fig. 2 is a schematic flowchart of a data probing method according to another embodiment;
FIG. 3 is a schematic diagram illustrating an exemplary exploration of the distribution of satisfaction levels in a continuous numerical signature;
FIG. 4 is a diagram illustrating a distribution exploration of device usage numbers in a continuous numerical signature according to another embodiment;
FIG. 5 is a diagram illustrating salary distribution exploration in a discrete feature according to an exemplary embodiment;
FIG. 6 is a schematic diagram illustrating industry distribution exploration in discrete features provided in accordance with another embodiment;
FIG. 7 is a diagram illustrating a combined distribution exploration of rank characteristics and satisfaction rank characteristics according to an exemplary embodiment;
FIG. 8 is a schematic diagram of a combined distribution exploration of a satisfaction level characteristic and a duration-of-net characteristic provided by another embodiment;
FIG. 9 is a schematic diagram of probing a combined distribution of a level feature and a duration-in-net feature according to yet another embodiment;
FIG. 10 is a flowchart illustrating a method for updating a data mining model according to an embodiment;
FIG. 11 is a diagram illustrating a hardware configuration of a data probing system according to an embodiment;
FIG. 12 is a diagram illustrating a hardware configuration of a data mining model update system, according to an embodiment;
fig. 13 is a schematic hardware structure diagram of a terminal device according to an embodiment;
fig. 14 is a schematic hardware structure diagram of a terminal device according to another embodiment.
Description of the element reference numerals
M10 meta information derivation module
M20 probing module
M100 acquisition module
M200 value-dividing module
M300 model updating module
1100 input device
1101 first processor
1102 output device
1103 first memory
1104 communication bus
1200 processing assembly
1201 second processor
1202 second memory
1203 communication assembly
1204 Power supply Assembly
1205 multimedia assembly
1206 Audio component
1207 input/output interface
1208 sensor assembly
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The large data table type data has the characteristics of rich content and complex form besides large data scale. And because the application is wide, the actual business behind each data source is different, the big data tabular data content and meaning of different data sources are different: in a data table of demographic information, a row of records represents information of a user; in the data table of behavior log information class, one row of record may represent one click/purchase behavior, or may represent a summary of click/purchase behaviors of the user in one day/one month. Even in the same data table of the same data source, data of a plurality of data types such as numerical value data, discrete type data, time stamp data, and the like are often included.
While machine learning data mining models are typically classification, regression, or ranking tasks that serve at some particular granularity. For example, for a credit wind control model, the default probability needs to be judged for a certain user or for a certain loan application; for the advertisement recommendation model, a recommendation list needs to be generated for a certain user on a certain day. However, even if different modeling scenarios are faced, for tabular structured data, if a model capable of completing tasks with high quality needs to be trained through data, information contained in data of an original table is often deficient, and at this time, new features (new columns in a data table) need to be constructed through feature combination, feature transformation and other modes (collectively referred to as feature engineering) and appropriate feature screening is performed, so that a high-quality and efficient model can be trained. The efficient features are usually derived from the knowledge of a modeling engineer on data, some fixed items extracted from long-term business experience in a special scene are removed, and other effective feature generation modes can be artificially constructed after the modeling engineer observes the expression form of the data. The representation form of the data mainly comprises: basic types of data (continuous numerical type, discrete type, etc.), distribution curves, statistical indicators of data, etc. This information contains the business meaning hidden behind the data, also called meta-information; the meta-information is a priori information related to the business meaning behind the data, which cannot be directly reflected in the data content itself. For example, the same column of integers in the interval [0, 100] may represent the age, or may represent a class code, such as a province/region code. When it represents age, it is essentially a data column, and the magnitude relation of values is meaningful, 30 years >20 years >10 years; and if the code represents province/region coding, the values of 30, 20 and 10 have no size relationship, and the code is allowed to be recoded in a disorderly sequence without changing data information.
Furthermore, the model has a certain degree of timeliness, since over time the newly added samples and the samples previously used for modeling are inevitably subject to a certain degree of distribution shift, resulting in the model fitted by the original training samples not being suitable for the newly added samples. Therefore, for the training samples and the newly added samples, the distribution of each important index feature needs to be controlled so as to timely judge whether the model needs to be replaced.
Therefore, as shown in fig. 1, the present invention provides a data exploration method applied to a computer model training process, comprising the following steps:
s10, performing meta-information derivation on the target data set to acquire the characteristic types of all sample data in the target data set; the target data set may be composed of one data set or may be composed of a plurality of data sets.
S20, probing the sample data corresponding to each characteristic type to obtain a corresponding probing result; probing the sample data includes at least one of: index exploration and data distribution exploration.
The method can perform data exploration on one or more data sets, determine the characteristic distribution condition in the data sets according to the exploration result, and then judge whether the data mining model needs to be updated or not based on the characteristic distribution condition in the data sets, so that the data mining model can adapt to all sample data including newly added samples. Specifically, according to the embodiment of the application, stability index values among data sets can be calculated based on the exploration result, and when the stability index values are larger than the preset threshold, the data mining model is updated, so that the data mining model can adapt to all sample data including the newly added samples.
According to the above description, in an exemplary embodiment, if the feature type of the sample data is a continuous numerical feature, the process of obtaining the index search result includes:
determining a statistical index of continuous numerical characteristic sample data; calculating an index value of the continuous numerical characteristic sample data according to the determined statistical index; according to the determined statistical indexes and the calculated index values, carrying out box separation processing on the continuous numerical characteristic sample data, and counting the proportion of the sample data in each box separation interval to all the sample data; and distinguishing the positive sample from the negative sample of the continuous numerical characteristic sample data, and acquiring the proportion of the positive sample to the negative sample in each box-dividing interval to obtain an index exploration result of the continuous numerical characteristic sample data. Wherein, the statistical indicators of the continuous numerical type features include but are not limited to: indexes such as mean value, guaranteed alignment difference, median, quartile difference, skewness, kurtosis, characteristic value range (interval), total value, deletion rate, 0 value rate and the like.
If the characteristic type of the sample data is a discrete characteristic, the process of obtaining the index exploration result comprises the following steps:
determining statistical indexes of discrete characteristic sample data; and calculating the index value of the discrete type characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete type characteristic sample data.
Wherein, the statistical indicators of the discrete features include but are not limited to: mode, value ratio, total value, characteristic value domain (array), deficiency rate and other indexes. In the embodiment of the application, the feature columns in the data set can be automatically divided into discrete features and continuous numerical features according to empirical rules.
According to the above description, in an exemplary embodiment, the method further includes performing distribution exploration on sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image in the target scene according to the distribution exploration result; the target scene of the continuous numerical characteristic comprises a two-classification scene, and the target scene of the discrete characteristic comprises a regression scene. Specifically, as an example, in a supervised scene, the comparison correlation between each feature value and the result of the last label column may be reflected as much as possible. Taking the most common two-classification scenario as an example, data is classified according to positive and negative samples, for continuous numerical features, a histogram is drawn by using the previous binning result, and the histogram and the kernel density estimation map under the positive and negative samples are displayed at the same time except for the histogram and the kernel density estimation map of the data population, as shown in fig. 3 and 4. As can be seen from fig. 3, the proportion of negative samples in the leftmost bin is significantly higher than that in other bins, so that the people with the lowest satisfaction level can be considered to be classified and set as the label feature, and the model is re-established. If under scenes such as wind control, high-proportion negative samples under the condition are subjected to binning, rejection rules can be directly set, and the overall efficiency is improved. As can be seen from fig. 4, the rightmost bins are almost all negative samples, and at this time, if a feature or rule that the feature is higher than a certain threshold is set, the rightmost samples are separated, so that the accuracy of the overall model can be improved to a great extent. As another example, for discrete features, only the distribution histograms under positive and negative samples are shown, as shown in fig. 5 and 6, since the kernel density estimate of the data no longer has practical significance. According to fig. 5, three values of low, medium and high are provided, so that the probability of negative samples of high and new people is obviously low, and the method can be used for constructing characteristics and rules. And the proportion of the three populations is approximately 7:6:1, which should be followed when sampling the data set or constructing the contrast data set. As can be seen from fig. 6, the positive and negative sample ratios do not differ much at each value, and only a simple encoding process needs to be performed on the feature. As another example, if in a regression scene, the sample color separation region is gradually displayed according to the value size of the discrete feature or the continuous numerical feature label column.
Since meta-information cannot be obtained from the data content itself, without meta-information, feature engineering can easily process low-quality, redundant features, affecting the final model quality. For example, in the feature construction stage, performing cross-feature combination is a common and effective way, but blind performing feature combination also generates a large amount of useless features, taking second-order feature cross as an example, assuming that the data set includes n original features, if all the features are combined in the second order, a single combination method can generate (n × (n-1))/2 new features, which not only brings great computation amount, but also reduces the proportion of the original features to 2/(n +1) of the original features, and makes it more difficult for the machine-learned algorithm model to learn useful information. In practice, it is often necessary to observe the distribution of combinations between features to determine the effectiveness of the combinations of features. Therefore, according to the above description, in an exemplary embodiment, if the feature type of the sample data includes a continuous numerical feature, two continuous numerical feature sample data are combined, and one sample data in each combination is taken as a horizontal axis value of the distribution image, and the other sample data is taken as a vertical axis value of the distribution image; forming sample data points based on the horizontal axis value and the vertical axis value, and directly filling the continuous numerical characteristic sample data into the distribution image; or calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image. Specifically, the continuous features are combined pairwise, one sample data in each combination is used as a horizontal axis value of the distribution image, the other sample data is used as a vertical axis value of the distribution image, then a sample data point is formed based on the horizontal axis value and the vertical axis value, and then the sample data point is filled in the distribution image. When the feature combination sub-map is generated, a two-dimensional scatter diagram form is mostly adopted, the horizontal and vertical axes of the image respectively represent two selected features, and each point represents the value of one sample on the two feature axes. Under the conditions of large data set and large number of samples, generating a large number of scattered points consumes large computing resources, the time consumption is long, and the generated images are too dense, so that an observer cannot conveniently acquire information from the images. Aiming at the problems of large data set and too many sample points, a MiniBatchKmeans clustering algorithm can be adopted, and the distance between data points is calculated by a batch processing method. The advantage of Mini Batch is that not all data samples are used in the calculation process, but a part of samples are extracted from different types of samples to represent each type for calculation, and the running time is correspondingly reduced due to the small amount of calculation samples. After the distance between the data points is calculated, the data points with too close distance are properly merged, and the number of the finally output data points is controlled. For the merged sample point, there may be a case that the sample point before merging has values on both positive and negative samples, and at this time, a voting method (voting) is adopted to determine the final label of the merged sample point. As an example, each time two non-tag column features are selected, in the form of a scatter diagram, data points are drawn at corresponding positions according to values of the data points in the table under two different features, and positive and negative samples are distinguished. When the data set is too large and the number of samples is too large, the efficiency of generating the graph is too low due to performance problems, sample points are clustered in advance by adopting a MiniBatchKmeans algorithm, the number of the clustered categories is controlled between 250 and 300 by people, and therefore the number of the points displayed on a single graph can be controlled not to exceed 300 finally, the efficiency is improved, and the attractiveness of a certain degree is guaranteed. And the positive and negative judgment of the samples in each class adopts a voting method (voting), if a certain clustering point is formed by clustering 5 original data points, wherein 3 positive samples are contained, and 2 negative samples are contained, the clustering point is finally displayed as positive. (here, a dynamic threshold method may also be used, for example, in the original data set, the proportion of negative samples is 10%, and if the point including a negative sample in a cluster point exceeds 10%, the cluster point is determined as a negative sample.) the scatter diagram of the combined features is favorable for finding the combined relationship between the features, and further constructing a cross feature or multi-level rule, as shown in fig. 7 to 9. Fig. 7 shows a combined distribution diagram of the level feature grade and the satisfaction level feature satisfunction _ level, which can find that there are two regions where the negative sample density is significantly higher, namely, a region where the satisfaction level satisfunction _ level is low and a region where the level grade is high; and secondly, in the region with the satisfaction level satisfactone _ level of about 0.4 and the level grade of about 0.5, samples in the two regions can be extracted to construct new characteristics or rules, and the final model expression can be promoted. Fig. 8 shows a combined distribution diagram of the satisfaction level feature satisfactone _ level and the duration net feature net _ time _ length, fig. 9 shows a combined distribution diagram of the level feature grade and the duration net feature net _ time _ length, and the analysis of fig. 8 and 9 is the same as that of fig. 7. According to the above description, when the images in which the features are combined two by two exhibit a certain tendency, it is considered that it is effective to construct the features accordingly; if the images of the feature combination are still in a scattered and irregular state, the combination of the features can be optionally abandoned. According to the method and the device, the image scatter diagram is clustered by utilizing MiniBatchKmeans, the complexity of the image is reduced, and meanwhile, the efficiency of generating the sample points is guaranteed. In the embodiment of the application, MiniBatchKmeans is an optimization scheme of a K-Means algorithm, and the calculation speed is mainly optimized under the condition of large data volume. Compared with the standard K-Means algorithm, the Mini Batch K-Means algorithm has the advantages that the calculation speed is increased, and the Mini Batch K-Means algorithm has a better effect under the condition of large data volume. The kmeans algorithm is also known as the k-means algorithm. The algorithm idea is roughly as follows: firstly, randomly selecting k samples from a sample set as cluster centers, calculating the distances between all the samples and the k cluster centers, dividing each sample into clusters with the closest cluster center, and calculating the new cluster center of each cluster for the new clusters. The logic mainly comprises the following steps: step 1, selecting K points as cluster centers of initial aggregation (or selecting non-sample points); step 2, respectively calculating the distance from each sample point to K cluster cores (the distance is generally Euclidean distance or cosine distance), finding the cluster core closest to the point, and attributing the cluster core to a corresponding cluster; and 3, after all the points belong to the clusters, dividing the M points into K clusters. Then, the gravity center (average distance center) of each cluster is recalculated and is determined as a new 'cluster core'; and iterating step 2 and step 3 repeatedly until a certain termination condition is reached.
For a data set with sample occurrence time, the distribution change of certain specific characteristics along with time is often required to be checked, and a simple scheme is to correspond and sort characteristic columns to be observed with time one by one, form data in a time series form and then map the data for observation. However, the time-series data set has a relatively strict requirement on the time format, and firstly, the time series used as the index must be normalized and standardized, so that the time span is uniform and the time unit is reasonable. Secondly, at different time points, the observed features also need to have unique values. In addition, after the data set is converted according to the time index, the problem of content loss under partial time index is inevitable, and at the moment, if 0 value filling is directly performed, curve distortion and jaggy are caused. Therefore, in an exemplary embodiment, if a time column exists in the target data set, an index is constructed according to the time column, and mean value aggregation is performed on sample data within a target time range to obtain a time series data set; and generating a time series curve under the continuous numerical characteristic and the discrete characteristic based on the time series data set, and performing mean value smoothing on missing values in the time series curve. Specifically, if a time column exists in the data set, an index is selected to be established for the time column, the index is established by taking year, month and day as units according to the total time span of the data, all feature data in a specified time range are subjected to mean value aggregation to obtain the time series data set, a time series curve under each feature is drawn according to the time series data set, mean value smoothing is performed on missing values, and each change condition is observed. The missing value filling method is used for filling the missing values by adopting a mean value smoothing method, and can replace a median smoothing method for filling, so that the influence caused by abnormal points is reduced. The embodiment of the application can automatically perform time series conversion on the data set, and perform smoothing filling processing on the converted time series images, so as to automatically generate beautiful and effective time series images.
According to the above description, in an exemplary embodiment, the method further includes, for the generated distribution image, using the label column information in the supervised scene, and performing sample label value differentiation on the distribution image by using different colors. By way of example, the univariate profile and the combined profile of the data may be plotted using different colors, for example, according to the final data label.
In a specific embodiment, as shown in fig. 2 to fig. 9, the method proposes a data exploration method in a supervised scene, including:
and (4) meta-information configuration derivation, namely performing meta-information derivation on the target data set to acquire the characteristic types of all sample data in the target data set. Specifically, the method calculates according to the ratio of the total value of the features to the total number of samples, and if the ratio or the total value of the values is smaller than an experience threshold set by a user, the data is temporarily determined to be discrete type data; when the ratio is 1 and the character lengths of all samples in the feature are the same, the data are determined to be discrete ID class data. And when the numerical value is near the threshold value and is difficult to distinguish, calculating the sample number difference under adjacent values, and if the fluctuation is too large, judging the data to be discrete data. The meta information deduces information such as data types for exploring the data table, and generates a preset meta information configuration file. For the meta-information derivation stage, by default, the data type condition of each column in the data table is heuristically guessed according to the column name and data distribution of the data table, and the numerical column and the discrete column (category attribute column or ID column) are distinguished. And for the discrete columns, guessing whether the discrete columns are category attribute columns (with fewer values) or ID columns (with more values) according to the number and distribution of values appearing, and generating a default meta-information configuration file to allow a user to modify the configuration file.
And in the index counting stage, indexes are respectively explored by adopting different statistical modes according to the feature types obtained in the meta-information pushing stage. Specifically, the feature columns in the data set are automatically divided into discrete features and continuous numerical features according to empirical rules, and different statistical indexes are respectively calculated for two different types of data: the numerical characteristic is subjected to box separation processing, the number of samples in each box separation interval is counted to account for the total proportion, and the number of positive and negative samples accounts for the proportion of the box separation samples. For the discrete type features, indexes such as mode, value ratio, total value, feature value range (array), deficiency rate and the like of the features are calculated.
And a data distribution exploration stage, wherein a univariate distribution diagram and a combined distribution diagram of the data are drawn by using different colors according to the final data label. For numerical features, probing a data distribution histogram and a kernel density estimation graph; only the data distribution histogram is explored for the class-type features. And combining the numerical characteristics in pairs, and drawing a scatter diagram by using the sample points clustered by MiniBatchKmeans and taking different characteristics as coordinates. Under the supervision scene, the comparison correlation between each characteristic value and the final label column result can be reflected as much as possible. Taking the most common two-classification scenario as an example, data is classified according to positive and negative samples, for continuous numerical features, a histogram is drawn by using the previous binning result, and the histogram and the kernel density estimation map under the positive and negative samples are displayed at the same time except for the histogram and the kernel density estimation map of the data population, as shown in fig. 3 and 4. As can be seen from fig. 3, the proportion of negative samples in the leftmost bin is significantly higher than that in other bins, so that the people with the lowest satisfaction level can be considered to be classified and set as the label feature, and the model is re-established. If under scenes such as wind control, high-proportion negative samples under the condition are subjected to binning, rejection rules can be directly set, and the overall efficiency is improved. As can be seen from fig. 4, the rightmost bins are almost all negative samples, and at this time, if a feature or rule that the feature is higher than a certain threshold is set, the rightmost samples are separated, so that the accuracy of the overall model can be improved to a great extent. As another example, for discrete features, only the distribution histograms under positive and negative samples are shown, as shown in fig. 5 and 6, since the kernel density estimate of the data no longer has practical significance. According to fig. 5, three values of low, medium and high are provided, so that the probability of negative samples of high and new people is obviously low, and the method can be used for constructing characteristics and rules. And the proportion of the three populations is approximately 7:6:1, which should be followed when sampling the data set or constructing the contrast data set. As can be seen from fig. 6, the positive and negative sample ratios do not differ much at each value, and only a simple encoding process needs to be performed on the feature. As another example, if in a regression scene, the sample color separation region is gradually displayed according to the value size of the discrete feature or the continuous numerical feature label column. As shown in fig. 7 to 9. Fig. 7 shows a combined distribution diagram of the level feature grade and the satisfaction level feature satisfunction _ level, which can find that there are two regions where the negative sample density is significantly higher, namely, a region where the satisfaction level satisfunction _ level is low and a region where the level grade is high; and secondly, in the region with the satisfaction level satisfactone _ level of about 0.4 and the level grade of about 0.5, samples in the two regions can be extracted to construct new characteristics or rules, and the final model expression can be promoted. Fig. 8 shows a combined distribution diagram of the satisfaction level feature satisfactone _ level and the duration net feature net _ time _ length, fig. 9 shows a combined distribution diagram of the level feature grade and the duration net feature net _ time _ length, and the analysis of fig. 8 and 9 is the same as that of fig. 7. According to the above description, when the images in which the features are combined two by two exhibit a certain tendency, it is considered that it is effective to construct the features accordingly; if the images of the feature combination are still in a scattered and irregular state, the combination of the features can be optionally abandoned. According to the method and the device, the image scatter diagram is clustered by utilizing MiniBatchKmeans, the complexity of the image is reduced, and meanwhile, the efficiency of generating the sample points is guaranteed.
And in the time sequence exploration stage, a time sequence needing to be converted is selected, a time index with proper span is automatically generated, and the original features are subjected to mean value aggregation according to the new time index. For the result after polymerization, the missing value is filled in by using a smoothing method, and finally a curve is generated. Specifically, if a time column exists in the data set, an index is selected to be established for the time column, the index is established by taking year, month and day as units according to the total time span of the data, all feature data in a specified time range are subjected to mean value aggregation to obtain the time series data set, a time series curve under each feature is drawn according to the time series data set, mean value smoothing is performed on missing values, and each change condition is observed. The missing value filling method is used for filling the missing values by adopting a mean value smoothing method, and can replace a median smoothing method for filling, so that the influence caused by abnormal points is reduced. The embodiment of the application can automatically perform time series conversion on the data set, and perform smoothing filling processing on the converted time series images, so as to automatically generate beautiful and effective time series images.
And in the data stability probing stage, according to the training set binning result, the binning value ratio in each feature bin is recorded, and when a test set with a uniform data format and content exists, the corresponding training set is selected and the feature binning result is read.
In summary, the method obtains the feature types of all sample data in the target data set by performing meta-information derivation on the target data set aiming at the problems in the prior art; then, sample data corresponding to each characteristic type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration; and calculating stability index values among the data sets based on the index exploration result, and updating the data mining model when the stability index values are larger than a preset threshold value. The method can judge whether the data mining model needs to be updated or not by performing data exploration on the sample data on the basis of the meta information, so that the data mining model can adapt to all the sample data including the newly added sample. The method does not need a large amount of manual intervention, can automatically calculate the statistical information of the data characteristics, generate the characteristic distribution image and accurately deduce the data element information. In addition, the method can also utilize MiniBatchKmeans to perform clustering processing on the scatter diagram in the distribution image, thereby reducing the complexity of the distribution image and ensuring the efficiency of generating the sample points. Moreover, the method automatically carries out time sequence conversion on the data set, carries out smoothing filling processing on the converted time sequence images, and automatically generates beautiful and effective time sequence images. Meanwhile, the method can distinguish sample label values on the generated distribution image by using different colors according to label column information in a supervised scene. The method can automatically complete the data exploration process, facilitates subsequent data mining and modeling, can optimize the calculation result by using a clustering algorithm, enables the image to be simpler and more visual, and can also achieve the dynamic monitoring effect on model data.
As shown in FIG. 10, the present invention further provides a method for updating a data mining model, comprising the following steps:
s100, acquiring a training data set and a data set to be tested;
s200, index probing is carried out on the training data set and the data set to be tested by using a data probing method, and the ratio of the bin values of the continuous numerical characteristic sample data in the training data set to the ratio of the bin values of the continuous numerical characteristic sample data in the data set to be tested is obtained;
s300, calculating a stability index value between the training data set and the data set to be tested according to the bin value ratio of the continuous numerical characteristic sample data in the training data set and the bin value ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating the data mining model when the stability index value is larger than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank the sample data.
Specifically, a training data set and a data set to be tested are obtained, then index exploration is carried out on the two data sets, the ratio of the bin dividing values of continuous numerical value type sample data in each data set is obtained from an exploration result, and then the stability index value PSI between the two data sets is calculated according to the ratio of the bin dividing values of the two data sets. The stability index value PSI is calculated in the following way:
PSI ═ SUM ((ratio of bin values of continuous numerical characteristic sample data in the data set to be tested-ratio of bin values of continuous numerical characteristic sample data in the training data set) × ln (ratio of bin values of continuous numerical characteristic sample data in the data set to be tested/ratio of bin values of continuous numerical characteristic sample data in the training data set)).
And when the PSI value is larger than the preset threshold value, giving a prompt that the data mining model needs to be updated, namely updating the data mining model. The preset threshold may be set according to an actual situation, and the preset threshold in the embodiment of the present application is set to 0.25.
The embodiment of the present application may implement the data probing method, so that specific functions and technical effects of the embodiment of the present application only need to refer to the embodiment, and details are not described herein.
As shown in fig. 11, the present invention further provides a data exploration system, which is applied to a computer model training process, and includes:
the meta information derivation module M10 is configured to perform meta information derivation on the target data set, and acquire feature types of all sample data in the target data set; the target data set may be composed of one data set or may be composed of a plurality of data sets.
A probing module M20, configured to probe sample data corresponding to each feature type to obtain a corresponding probing result; probing the sample data includes at least one of: index exploration and data distribution exploration.
The system can perform data exploration on one or more data sets, determine the characteristic distribution condition in the data sets according to the exploration result, and then judge whether the data mining model needs to be updated or not based on the characteristic distribution condition in the data sets, so that the data mining model can adapt to all sample data including newly added samples. Specifically, according to the embodiment of the application, stability index values among data sets can be calculated based on the exploration result, and when the stability index values are larger than the preset threshold, the data mining model is updated, so that the data mining model can adapt to all sample data including the newly added samples.
According to the above description, in an exemplary embodiment, if the feature type of the sample data is a continuous numerical feature, the process of obtaining the index search result includes:
determining a statistical index of continuous numerical characteristic sample data; calculating an index value of the continuous numerical characteristic sample data according to the determined statistical index; according to the determined statistical indexes and the calculated index values, carrying out box separation processing on the continuous numerical characteristic sample data, and counting the proportion of the sample data in each box separation interval to all the sample data; and distinguishing the positive sample from the negative sample of the continuous numerical characteristic sample data, and acquiring the proportion of the positive sample to the negative sample in each box-dividing interval to obtain an index exploration result of the continuous numerical characteristic sample data. Wherein, the statistical indicators of the continuous numerical type features include but are not limited to: indexes such as mean value, guaranteed alignment difference, median, quartile difference, skewness, kurtosis, characteristic value range (interval), total value, deletion rate, 0 value rate and the like.
If the characteristic type of the sample data is a discrete characteristic, the process of obtaining the index exploration result comprises the following steps:
determining statistical indexes of discrete characteristic sample data; and calculating the index value of the discrete type characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete type characteristic sample data.
Wherein, the statistical indicators of the discrete features include but are not limited to: mode, value ratio, total value, characteristic value domain (array), deficiency rate and other indexes. In the embodiment of the application, the feature columns in the data set can be automatically divided into discrete features and continuous numerical features according to empirical rules.
According to the above description, in an exemplary embodiment, the method further includes performing distribution exploration on sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image in the target scene according to the distribution exploration result; the target scene of the continuous numerical characteristic comprises a two-classification scene, and the target scene of the discrete characteristic comprises a regression scene. Specifically, as an example, in a supervised scene, the comparison correlation between each feature value and the result of the last label column may be reflected as much as possible. Taking the most common two-classification scenario as an example, data is classified according to positive and negative samples, for continuous numerical features, a histogram is drawn by using the previous binning result, and the histogram and the kernel density estimation map under the positive and negative samples are displayed at the same time except for the histogram and the kernel density estimation map of the data population, as shown in fig. 3 and 4. As can be seen from fig. 3, the proportion of negative samples in the leftmost bin is significantly higher than that in other bins, so that the people with the lowest satisfaction level can be considered to be classified and set as the label feature, and the model is re-established. If under scenes such as wind control, high-proportion negative samples under the condition are subjected to binning, rejection rules can be directly set, and the overall efficiency is improved. As can be seen from fig. 4, the rightmost bins are almost all negative samples, and at this time, if a feature or rule that the feature is higher than a certain threshold is set, the rightmost samples are separated, so that the accuracy of the overall model can be improved to a great extent. As another example, for discrete features, only the distribution histograms under positive and negative samples are shown, as shown in fig. 5 and 6, since the kernel density estimate of the data no longer has practical significance. According to fig. 5, three values of low, medium and high are provided, so that the probability of negative samples of high and new people is obviously low, and the method can be used for constructing characteristics and rules. And the proportion of the three populations is approximately 7:6:1, which should be followed when sampling the data set or constructing the contrast data set. As can be seen from fig. 6, the positive and negative sample ratios do not differ much at each value, and only a simple encoding process needs to be performed on the feature. As another example, if in a regression scene, the sample color separation region is gradually displayed according to the value size of the discrete feature or the continuous numerical feature label column.
Since meta-information cannot be obtained from the data content itself, without meta-information, feature engineering can easily process low-quality, redundant features, affecting the final model quality. For example, in the feature construction stage, performing cross-feature combination is a common and effective way, but blind performing feature combination also generates a large amount of useless features, taking second-order feature cross as an example, assuming that the data set includes n original features, if all the features are combined in the second order, a single combination method can generate (n × (n-1))/2 new features, which not only brings great computation amount, but also reduces the proportion of the original features to 2/(n +1) of the original features, and makes it more difficult for the machine-learned algorithm model to learn useful information. In practice, it is often necessary to observe the distribution of combinations between features to determine the effectiveness of the combinations of features. Therefore, according to the above description, in an exemplary embodiment, if the feature type of the sample data includes a continuous numerical feature, two continuous numerical feature sample data are combined, and one sample data in each combination is taken as a horizontal axis value of the distribution image, and the other sample data is taken as a vertical axis value of the distribution image; forming sample data points based on the horizontal axis value and the vertical axis value, and directly filling the continuous numerical characteristic sample data into the distribution image; or calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image. Specifically, the continuous features are combined pairwise, one sample data in each combination is used as a horizontal axis value of the distribution image, the other sample data is used as a vertical axis value of the distribution image, then a sample data point is formed based on the horizontal axis value and the vertical axis value, and then the sample data point is filled in the distribution image. When the feature combination sub-map is generated, a two-dimensional scatter diagram form is mostly adopted, the horizontal and vertical axes of the image respectively represent two selected features, and each point represents the value of one sample on the two feature axes. Under the conditions of large data set and large number of samples, generating a large number of scattered points consumes large computing resources, the time consumption is long, and the generated images are too dense, so that an observer cannot conveniently acquire information from the images. Aiming at the problems of large data set and too many sample points, a MiniBatchKmeans clustering algorithm can be adopted, and the distance between data points is calculated by a batch processing method. The advantage of Mini Batch is that not all data samples are used in the calculation process, but a part of samples are extracted from different types of samples to represent each type for calculation, and the running time is correspondingly reduced due to the small amount of calculation samples. After the distance between the data points is calculated, the data points with too close distance are properly merged, and the number of the finally output data points is controlled. For the merged sample point, there may be a case that the sample point before merging has values on both positive and negative samples, and at this time, a voting method (voting) is adopted to determine the final label of the merged sample point. As an example, each time two non-tag column features are selected, in the form of a scatter diagram, data points are drawn at corresponding positions according to values of the data points in the table under two different features, and positive and negative samples are distinguished. When the data set is too large and the number of samples is too large, the efficiency of generating the graph is too low due to performance problems, sample points are clustered in advance by adopting a MiniBatchKmeans algorithm, the number of the clustered categories is controlled between 250 and 300 by people, and therefore the number of the points displayed on a single graph can be controlled not to exceed 300 finally, the efficiency is improved, and the attractiveness of a certain degree is guaranteed. And the positive and negative judgment of the samples in each class adopts a voting method (voting), if a certain clustering point is formed by clustering 5 original data points, wherein 3 positive samples are contained, and 2 negative samples are contained, the clustering point is finally displayed as positive. (here, a dynamic threshold method may also be used, for example, in the original data set, the proportion of negative samples is 10%, and if the point including a negative sample in a cluster point exceeds 10%, the cluster point is determined as a negative sample.) the scatter diagram of the combined features is favorable for finding the combined relationship between the features, and further constructing a cross feature or multi-level rule, as shown in fig. 7 to 9. Fig. 7 shows a combined distribution diagram of the level feature grade and the satisfaction level feature satisfunction _ level, which can find that there are two regions where the negative sample density is significantly higher, namely, a region where the satisfaction level satisfunction _ level is low and a region where the level grade is high; and secondly, in the region with the satisfaction level satisfactone _ level of about 0.4 and the level grade of about 0.5, samples in the two regions can be extracted to construct new characteristics or rules, and the final model expression can be promoted. Fig. 8 shows a combined distribution diagram of the satisfaction level feature satisfactone _ level and the duration net feature net _ time _ length, fig. 9 shows a combined distribution diagram of the level feature grade and the duration net feature net _ time _ length, and the analysis of fig. 8 and 9 is the same as that of fig. 7. According to the above description, when the images in which the features are combined two by two exhibit a certain tendency, it is considered that it is effective to construct the features accordingly; if the images of the feature combination are still in a scattered and irregular state, the combination of the features can be optionally abandoned. According to the method and the device, the image scatter diagram is clustered by utilizing MiniBatchKmeans, the complexity of the image is reduced, and meanwhile, the efficiency of generating the sample points is guaranteed. In the embodiment of the application, MiniBatchKmeans is an optimization scheme of a K-Means algorithm, and the calculation speed is mainly optimized under the condition of large data volume. Compared with the standard K-Means algorithm, the Mini Batch K-Means algorithm has the advantages that the calculation speed is increased, and the Mini Batch K-Means algorithm has a better effect under the condition of large data volume. The kmeans algorithm is also known as the k-means algorithm. The algorithm idea is roughly as follows: firstly, randomly selecting k samples from a sample set as cluster centers, calculating the distances between all the samples and the k cluster centers, dividing each sample into clusters with the closest cluster center, and calculating the new cluster center of each cluster for the new clusters. The logic mainly comprises the following steps: step 1, selecting K points as cluster centers of initial aggregation (or selecting non-sample points); step 2, respectively calculating the distance from each sample point to K cluster cores (the distance is generally Euclidean distance or cosine distance), finding the cluster core closest to the point, and attributing the cluster core to a corresponding cluster; and 3, after all the points belong to the clusters, dividing the M points into K clusters. Then, the gravity center (average distance center) of each cluster is recalculated and is determined as a new 'cluster core'; and iterating step 2 and step 3 repeatedly until a certain termination condition is reached.
For a data set with sample occurrence time, the distribution change of certain specific characteristics along with time is often required to be checked, and a simple scheme is to correspond and sort characteristic columns to be observed with time one by one, form data in a time series form and then map the data for observation. However, the time-series data set has a relatively strict requirement on the time format, and firstly, the time series used as the index must be normalized and standardized, so that the time span is uniform and the time unit is reasonable. Secondly, at different time points, the observed features also need to have unique values. In addition, after the data set is converted according to the time index, the problem of content loss under partial time index is inevitable, and at the moment, if 0 value filling is directly performed, curve distortion and jaggy are caused. Therefore, in an exemplary embodiment, if a time column exists in the target data set, an index is constructed according to the time column, and mean value aggregation is performed on sample data within a target time range to obtain a time series data set; and generating a time series curve under the continuous numerical characteristic and the discrete characteristic based on the time series data set, and performing mean value smoothing on missing values in the time series curve. Specifically, if a time column exists in the data set, an index is selected to be established for the time column, the index is established by taking year, month and day as units according to the total time span of the data, all feature data in a specified time range are subjected to mean value aggregation to obtain the time series data set, a time series curve under each feature is drawn according to the time series data set, mean value smoothing is performed on missing values, and each change condition is observed. The missing value filling method is used for filling the missing values by adopting a mean value smoothing method, and can replace a median smoothing method for filling, so that the influence caused by abnormal points is reduced. The embodiment of the application can automatically perform time series conversion on the data set, and perform smoothing filling processing on the converted time series images, so as to automatically generate beautiful and effective time series images.
According to the above description, in an exemplary embodiment, the method further includes, for the generated distribution image, using the label column information in the supervised scene, and performing sample label value differentiation on the distribution image by using different colors. By way of example, the univariate profile and the combined profile of the data may be plotted using different colors, for example, according to the final data label.
In a specific embodiment, the system provides a data probing method in a supervised scene, as shown in fig. 2 to 9, and specific functions and technical effects are only referred to the above embodiments, which are not described herein again.
In summary, the system obtains the feature types of all sample data in the target data set by performing meta-information derivation on the target data set, aiming at the problems in the prior art; then, sample data corresponding to each characteristic type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration; and calculating stability index values among the data sets based on the index exploration result, and updating the data mining model when the stability index values are larger than a preset threshold value. The system can judge whether the data mining model needs to be updated or not by performing data exploration on the sample data on the basis of the meta information, so that the data mining model can adapt to all the sample data including the newly added sample. The system does not need a large amount of manual intervention, can automatically calculate the statistical information of the data characteristics, generate a characteristic distribution image, and accurately deduce the data element information. In addition, the system can also utilize MiniBatchKmeans to perform clustering processing on the scatter diagram in the distribution image, so that the complexity of the distribution image is reduced, and the efficiency of generating the sample points is ensured. Moreover, the system automatically carries out time series conversion on the data set, carries out smoothing filling processing on the converted time series images and automatically generates beautiful and effective time series images. Meanwhile, the system can distinguish sample label values on the generated distribution image by using different colors according to label column information in a supervised scene. The system can automatically complete the data exploration process, facilitates subsequent data mining and modeling, can optimize the calculation result by using a clustering algorithm, enables the image to be simpler and more visual, and can dynamically monitor the model data.
As shown in fig. 12, the present invention further provides a data mining model updating system, which includes:
the acquisition module M100 is used for acquiring a training data set and a data set to be tested;
the bin value dividing module M200 is used for performing index exploration on the training data set and the data set to be tested by using a data exploration method, and acquiring the bin value ratio of continuous numerical characteristic sample data in the training data set and the bin value ratio of the continuous numerical characteristic sample data in the data set to be tested;
the model updating module M300 is used for calculating a stability index value between the training data set and the data set to be tested according to the bin value ratio of the continuous numerical characteristic sample data in the training data set and the bin value ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating the data mining model when the stability index value is greater than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank the sample data.
Specifically, a training data set and a data set to be tested are obtained, then index exploration is carried out on the two data sets, the ratio of the bin dividing values of continuous numerical value type sample data in each data set is obtained from an exploration result, and then the stability index value PSI between the two data sets is calculated according to the ratio of the bin dividing values of the two data sets. The stability index value PSI is calculated in the following way:
PSI ═ SUM ((ratio of bin values of continuous numerical characteristic sample data in the data set to be tested-ratio of bin values of continuous numerical characteristic sample data in the training data set) × ln (ratio of bin values of continuous numerical characteristic sample data in the data set to be tested/ratio of bin values of continuous numerical characteristic sample data in the training data set)).
And when the PSI value is larger than the preset threshold value, giving a prompt that the data mining model needs to be updated, namely updating the data mining model. The preset threshold may be set according to an actual situation, and the preset threshold in the embodiment of the present application is set to 0.25.
The embodiment of the present application may implement the data probing method, so that specific functions and technical effects of the embodiment of the present application only need to refer to the embodiment, and details are not described herein.
An embodiment of the present application further provides a computer device, where the computer device may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
The present embodiment also provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the data processing method in fig. 1 according to the present embodiment.
Fig. 13 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.
In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.
Fig. 14 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. FIG. 14 is a specific embodiment of FIG. 13 in an implementation. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.
The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication components 1203, power components 1204, multimedia components 1205, audio components 1206, input/output interfaces 1207, and/or sensor components 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.
The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.
The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The audio component 1206 is configured to output and/or input speech signals. For example, the audio component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, audio component 1206 also includes a speaker for outputting voice signals.
The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.
The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication component 1203, the audio component 1206, the input/output interface 1207 and the sensor component 1208 referred to in the embodiment of fig. 14 may be implemented as the input device in the embodiment of fig. 13.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (15)

1. A data exploration method is applied to a computer model training process and comprises the following steps:
performing meta-information derivation on a target data set to acquire the feature types of all sample data in the target data set;
sample data corresponding to each characteristic type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration.
2. The data probing method according to claim 1, wherein if the feature type of the sample data includes a continuous numerical feature and a discrete feature, the process of obtaining the index probing result includes:
determining a statistical index of continuous numerical characteristic sample data;
calculating an index value of the continuous numerical characteristic sample data according to the determined statistical index;
according to the determined statistical indexes and the calculated index values, carrying out box separation processing on the continuous numerical characteristic sample data, and counting the proportion of the sample data in each box separation interval to all the sample data;
distinguishing positive samples and negative samples of the continuous numerical characteristic sample data, and obtaining the proportion of the positive samples and the negative samples in each box-dividing interval to obtain index probing results of the continuous numerical characteristic sample data;
and/or determining a statistical index of discrete type characteristic sample data;
and calculating the index value of the discrete type characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete type characteristic sample data.
3. The data exploration method according to claim 2, further comprising the steps of performing distribution exploration on sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image in a target scene according to the distribution exploration result; the target scene of the continuous numerical type features comprises a two-classification scene, and the target scene of the discrete type features comprises a regression scene.
4. The data detecting method according to claim 3, wherein if the distribution detection is performed on the continuous numerical type feature sample data, there are:
carrying out pairwise combination on sample data corresponding to the continuous numerical type features, taking one sample data in each combination as a horizontal axis value of the distribution image, and taking the other sample data as a vertical axis value of the distribution image; forming a sample data point based on the horizontal axis value and the vertical axis value, and filling the sample data point into the distribution image for display;
or, calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image for display.
5. The data exploration method according to claim 3 or 4, further comprising the step of distinguishing sample label values on the distribution image by using different colors by using label column information in a supervised scene.
6. The method of claim 1, wherein if there is a time column in the target data set, further comprising:
constructing an index according to the time column, and performing mean value aggregation on sample data in a target time range to obtain a time series data set;
and generating a time series curve under the continuous numerical characteristic and the discrete characteristic based on the time series data set, and performing mean value smoothing on missing values in the time series curve.
7. A data mining model updating method is characterized by comprising the following steps:
acquiring a training data set and a data set to be tested;
performing index exploration on the training data set and the data set to be tested by using the data exploration method as claimed in any one of claims 1 to 6, and acquiring the ratio of the bin values of the continuous numerical characteristic sample data in the training data set and the ratio of the bin values of the continuous numerical characteristic sample data in the data set to be tested;
calculating a stability index value between the training data set and the data set to be tested according to the bin value ratio of the continuous numerical characteristic sample data in the training data set and the bin value ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating a data mining model when the stability index value is larger than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank sample data.
8. A data exploration system is applied to a computer model training process and comprises the following steps:
the meta-information deducing module is used for deducing meta-information of a target data set to acquire the characteristic types of all sample data in the target data set;
the index probing module is used for probing the sample data corresponding to each characteristic type to obtain a corresponding probing result; the probing includes at least one of: index exploration and data distribution exploration.
9. The data probing system according to claim 8, wherein if the feature types of the sample data include a continuous numerical feature and a discrete feature, the process of obtaining the probe result by the probe module comprises:
determining a statistical index of continuous numerical characteristic sample data;
calculating an index value of the continuous numerical characteristic sample data according to the determined statistical index;
according to the determined statistical indexes and the calculated index values, carrying out box separation processing on the continuous numerical characteristic sample data, and counting the proportion of the sample data in each box separation interval to all the sample data;
distinguishing positive samples and negative samples of the continuous numerical characteristic sample data, and obtaining the proportion of the positive samples and the negative samples in each box-dividing interval to obtain index probing results of the continuous numerical characteristic sample data;
and/or determining a statistical index of discrete type characteristic sample data;
and calculating the index value of the discrete type characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete type characteristic sample data.
10. The data exploration system according to claim 9, further comprising a distribution exploration module, configured to perform distribution exploration on sample data corresponding to each feature type according to the index exploration result, and form and display a distribution image in a target scene according to the distribution exploration result; the target scene of the continuous numerical type features comprises a two-classification scene, and the target scene of the discrete type features comprises a regression scene.
11. The data probing system according to claim 10, wherein if the continuous numerical type feature sample data is distributively probed, there are:
carrying out pairwise combination on sample data corresponding to the continuous numerical type features, taking one sample data in each combination as a horizontal axis value of the distribution image, and taking the other sample data as a vertical axis value of the distribution image; forming a sample data point based on the horizontal axis value and the vertical axis value, and filling the sample data point into the distribution image for display;
or, calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image for display.
12. The data exploration system according to claim 8, further comprising a time sequence module, configured to, when a time column exists in the target data set, perform mean aggregation on sample data in a target time range based on constructing an index according to the time column to obtain a time sequence data set;
and generating a time series curve under the continuous numerical characteristic and the discrete characteristic based on the time series dataset, and performing mean value smoothing processing on missing values in the time series curve.
13. A data mining model updating system is characterized by comprising:
the acquisition module is used for acquiring a training data set and a data set to be tested;
a bin value dividing module, configured to perform index exploration on the training data set and the data set to be tested by using the data exploration method according to any one of claims 1 to 6, to obtain a bin value ratio of continuous numerical characteristic sample data in the training data set and a bin value ratio of continuous numerical characteristic sample data in the data set to be tested;
the model updating module is used for calculating a stability index value between the training data set and the data set to be tested according to the bin value ratio of the continuous numerical characteristic sample data in the training data set and the bin value ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating the data mining model when the stability index value is larger than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank sample data.
14. A computer device, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-7.
15. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of any of claims 1-7.
CN202110383259.9A 2021-04-09 2021-04-09 Data exploration method and system and data mining model updating method and system Pending CN113051317A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110383259.9A CN113051317A (en) 2021-04-09 2021-04-09 Data exploration method and system and data mining model updating method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110383259.9A CN113051317A (en) 2021-04-09 2021-04-09 Data exploration method and system and data mining model updating method and system

Publications (1)

Publication Number Publication Date
CN113051317A true CN113051317A (en) 2021-06-29

Family

ID=76518948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110383259.9A Pending CN113051317A (en) 2021-04-09 2021-04-09 Data exploration method and system and data mining model updating method and system

Country Status (1)

Country Link
CN (1) CN113051317A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780583A (en) * 2021-09-18 2021-12-10 中国平安人寿保险股份有限公司 Model training monitoring method, device, equipment and storage medium
CN113837863A (en) * 2021-09-27 2021-12-24 上海冰鉴信息科技有限公司 Business prediction model creation method and device and computer readable storage medium
CN115905373A (en) * 2023-03-09 2023-04-04 北京永洪商智科技有限公司 Data query and analysis method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015022554A (en) * 2013-07-19 2015-02-02 沖電気工業株式会社 Data processing device, data processing program, database system, communication control device, and network system
WO2020038141A1 (en) * 2018-08-24 2020-02-27 阿里巴巴集团控股有限公司 Distributed graph embedding method, apparatus and system, and device
CN111369044A (en) * 2020-02-27 2020-07-03 腾讯云计算(北京)有限责任公司 Method and device for estimating loss and computer readable storage medium
WO2020143233A1 (en) * 2019-01-07 2020-07-16 平安科技(深圳)有限公司 Method and device for building scorecard model, computer apparatus and storage medium
CN111950937A (en) * 2020-09-01 2020-11-17 上海海事大学 Key personnel risk assessment method based on fusion space-time trajectory
CN112085205A (en) * 2019-06-14 2020-12-15 第四范式(北京)技术有限公司 Method and system for automatically training machine learning models

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015022554A (en) * 2013-07-19 2015-02-02 沖電気工業株式会社 Data processing device, data processing program, database system, communication control device, and network system
WO2020038141A1 (en) * 2018-08-24 2020-02-27 阿里巴巴集团控股有限公司 Distributed graph embedding method, apparatus and system, and device
WO2020143233A1 (en) * 2019-01-07 2020-07-16 平安科技(深圳)有限公司 Method and device for building scorecard model, computer apparatus and storage medium
CN112085205A (en) * 2019-06-14 2020-12-15 第四范式(北京)技术有限公司 Method and system for automatically training machine learning models
CN111369044A (en) * 2020-02-27 2020-07-03 腾讯云计算(北京)有限责任公司 Method and device for estimating loss and computer readable storage medium
CN111950937A (en) * 2020-09-01 2020-11-17 上海海事大学 Key personnel risk assessment method based on fusion space-time trajectory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘斌等: "基于数字内容偏好的多标签分类应用", 《计算机与现代化》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780583A (en) * 2021-09-18 2021-12-10 中国平安人寿保险股份有限公司 Model training monitoring method, device, equipment and storage medium
CN113837863A (en) * 2021-09-27 2021-12-24 上海冰鉴信息科技有限公司 Business prediction model creation method and device and computer readable storage medium
CN113837863B (en) * 2021-09-27 2023-12-29 上海冰鉴信息科技有限公司 Business prediction model creation method and device and computer readable storage medium
CN115905373A (en) * 2023-03-09 2023-04-04 北京永洪商智科技有限公司 Data query and analysis method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113051317A (en) Data exploration method and system and data mining model updating method and system
CN108875797B (en) Method for determining image similarity, photo album management method and related equipment
CN111143697B (en) Content recommendation method and related device
CN112131322B (en) Time sequence classification method and device
CN110659817A (en) Data processing method and device, machine readable medium and equipment
CN112163614A (en) Anchor classification method and device, electronic equipment and storage medium
CN113763502B (en) Chart generation method, device, equipment and storage medium
WO2022193753A1 (en) Continuous learning method and apparatus, and terminal and storage medium
CN111898675B (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
WO2021103401A1 (en) Data object classification method and apparatus, computer device and storage medium
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
CN111275683B (en) Image quality grading processing method, system, device and medium
CN111538852B (en) Multimedia resource processing method, device, storage medium and equipment
CN111178455B (en) Image clustering method, system, device and medium
CN112100177A (en) Data storage method and device, computer equipment and storage medium
KR102429788B1 (en) Curation method for recommending contents using item-responsive object
CN110544166A (en) Sample generation method, device and storage medium
US20240004375A1 (en) Data processing method, and electronic device and storage medium
CN112308702A (en) Credit risk assessment method, credit risk assessment device, credit risk assessment medium and credit risk assessment equipment
CN112417197A (en) Sorting method, sorting device, machine readable medium and equipment
CN112329943A (en) Combined index selection method and device, computer equipment and medium
CN112597363B (en) User processing method and device, electronic equipment and storage medium
CN111985553A (en) Feature construction method and device, machine readable medium and equipment
CN110245775B (en) User collection and payment data analysis method and device and computer equipment
US20240095551A1 (en) Systems and methods for successive feature imputation using machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination