CN113051317B

CN113051317B - Data mining model updating method, system, computer equipment and readable medium

Info

Publication number: CN113051317B
Application number: CN202110383259.9A
Authority: CN
Inventors: 蒋博劼
Original assignee: Shanghai Yuncong Enterprise Development Co ltd
Current assignee: Shanghai Yuncong Enterprise Development Co ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2024-05-28
Anticipated expiration: 2041-04-09
Also published as: CN113051317A

Abstract

The invention provides a data mining model updating method, a system, computer equipment and a readable medium, which acquire the characteristic types of all sample data in a target data set by carrying out meta information deduction on the target data set; then, the sample data corresponding to each feature type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration; and calculating a stability index value between the data sets based on the index exploration result, and updating the data mining model when the stability index value is greater than a preset threshold value. The invention can judge whether the data mining model needs to be updated or not by carrying out data exploration on the sample data based on the meta information, thereby enabling the data mining model to be suitable for all sample data comprising newly added samples. The invention can automatically calculate the statistical information of the data characteristics without large amount of manual intervention, generate the characteristic distribution image and accurately deduce the data meta-information.

Description

Data mining model updating method, system, computer equipment and readable medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data exploration method and system, a data mining model updating method and system, a computer device, and a machine readable medium.

Background

Today, big data tabular data is the primary input form for machine learning data mining tasks such as internet companies, banks, government databases, personal basic information in data warehouses, demographic information, behavioral logs, transaction streamlines, and the like. The machine learning data mining model usually takes the information as a training sample input to finish classification, regression or sequencing tasks, and finally achieves the purposes of recommendation, marketing, wind control and other business. Then, the trained data mining model has timeliness to a certain extent, and as time goes by, a new sample and a sample used for modeling before are inevitably distributed cheaply to a certain extent, so that the data mining model fitting the trained data mining model by the original training sample is not applicable to the new sample. Therefore, for the training samples and the newly added samples, data exploration is required for the samples, and the distribution situation of each important index feature is judged according to the data exploration result, so that the distribution situation is used as a judgment basis for updating the data mining model.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a data exploration method and system, and a data mining model updating method and system, for solving the technical problems existing in the prior art.

To achieve the above and other related objects, the present invention provides a data exploration method applied to a computer model training process, comprising the steps of:

Performing meta-information deduction on a target data set to acquire characteristic types of all sample data in the target data set;

Probing sample data corresponding to each feature type to obtain a corresponding probing result; the probing includes at least one of: index exploration and data distribution exploration.

Optionally, if the feature type of the sample data includes a continuous numerical feature and a discrete feature, the process of obtaining the index probing result includes:

Determining a statistical index of the continuous numerical characteristic sample data;

calculating index values of continuous numerical characteristic sample data according to the determined statistical indexes;

Carrying out box division processing on the continuous numerical characteristic sample data according to the determined statistical index and the calculated index value, and counting the proportion of the sample data in each box division interval to all the sample data;

distinguishing positive samples from negative samples of the continuous numerical characteristic sample data, and obtaining the proportion of the positive samples and the negative samples in each box division interval to obtain index exploration results of the continuous numerical characteristic sample data;

And/or determining a statistical indicator of the discrete feature sample data;

and calculating an index value of the discrete characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete characteristic sample data.

Optionally, the method further comprises the steps of performing distribution exploration on sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image under a target scene according to the distribution exploration result; the target scenes of the continuous numerical type features comprise two kinds of scenes, and the target scenes of the discrete type features comprise regression scenes.

Alternatively, if the continuous numerical feature sample data is subjected to distributed exploration, there are:

Carrying out pairwise combination on sample data corresponding to continuous numerical value type characteristics, and taking one sample data in each combination as a horizontal axis value of the distribution image and the other sample data as a vertical axis value of the distribution image; forming sample data points based on the horizontal axis value and the vertical axis value, and filling the sample data points into the distribution image for display;

Or calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image for display.

Optionally, the method further comprises distinguishing sample label values on the distribution image by using different colors by using label column information in a supervised scene.

Optionally, if there is a time column in the target dataset, further comprising:

Constructing an index according to the time sequence, and carrying out mean value aggregation on sample data in a target time range to obtain a time sequence data set;

And generating a continuous numerical characteristic and a time sequence curve under a discrete characteristic based on the time sequence data set, and carrying out mean value smoothing on missing values in the time sequence curve.

The invention also provides a data mining model updating method, which comprises the following steps:

Acquiring a training data set and a data set to be tested in a target scene; wherein, the target scene includes: a user behavior log scene;

Performing index exploration on the training data set and the data set to be tested by using any one of the data exploration methods to obtain the fraction box value duty ratio of the continuous numerical characteristic sample data in the training data set and the fraction box value duty ratio of the continuous numerical characteristic sample data in the data set to be tested;

Calculating a stability index value between the training data set and the data set to be tested according to the fraction ratio of the continuous numerical characteristic sample data in the training data set and the fraction ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating a data mining model when the stability index value is greater than a preset threshold value; wherein the data mining model is used for classifying, regressing and/or ordering the sample data.

The invention also provides a data exploration system, which is applied to the training process of the computer model and comprises the following steps:

The meta information deriving module is used for performing meta information derivation on a target data set and obtaining the characteristic types of all sample data in the target data set;

The index exploration module is used for exploration of sample data corresponding to each characteristic type and obtaining corresponding exploration results; the probing includes at least one of: index exploration and data distribution exploration.

Optionally, if the feature type of the sample data includes a continuous numerical feature and a discrete feature, the process of obtaining the index probing result by the index probing module includes:

And/or determining a statistical indicator of the discrete feature sample data;

Optionally, the system further comprises a distribution exploration module, which is used for carrying out distribution exploration on the sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image under the target scene according to the distribution exploration result; the target scenes of the continuous numerical type features comprise two kinds of scenes, and the target scenes of the discrete type features comprise regression scenes.

Optionally, the system further comprises a time sequence module, which is used for carrying out mean value aggregation on sample data in a target time range based on the construction of an index according to the time sequence when the target data set has the time sequence, so as to obtain a time sequence data set;

The invention also provides a data mining model updating system, which comprises:

The acquisition module is used for acquiring a training data set and a data set to be tested in a target scene; wherein, the target scene includes: a user behavior log scene;

The box-dividing value module is used for performing index exploration on the training data set and the data set to be tested by utilizing the data exploration method to acquire the box-dividing value duty ratio of continuous numerical type characteristic sample data in the training data set and the box-dividing value duty ratio of continuous numerical type characteristic sample data in the data set to be tested;

The model updating module is used for calculating a stability index value between the training data set and the data set to be tested according to the bin value proportion of the continuous numerical characteristic sample data in the training data set and the bin value proportion of the continuous numerical characteristic sample data in the data set to be tested, and updating a data mining model when the stability index value is greater than a preset threshold value; wherein the data mining model is used for classifying, regressing and/or ordering the sample data.

The present invention also provides a computer device comprising:

One or more processors; and

One or more machine readable media storing instructions that, when executed by the one or more processors, cause the apparatus to perform the method of any of the preceding claims.

The invention also provides one or more machine readable media having instructions stored thereon which, when executed by one or more processors, cause an apparatus to perform a method as claimed in any of the preceding claims.

As described above, the invention provides a data exploration method and system, and a data mining model updating method and system, which have the following beneficial effects: obtaining the characteristic types of all sample data in the target data set by carrying out meta information deduction on the target data set; then, the sample data corresponding to each feature type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration. The target data set may be one data set or may be composed of a plurality of data sets. The invention can perform data exploration on one or more data sets, determine the characteristic distribution condition in the data sets according to the exploration result, and then judge whether the data mining model needs to be updated or not based on the characteristic distribution condition in the data sets, so that the data mining model can adapt to all sample data comprising newly added samples. The invention can automatically calculate the statistical information of the data characteristics without large amount of manual intervention, generate the characteristic distribution image and accurately deduce the data meta-information. In addition, the invention can also utilize MiniBatchKmeans to perform clustering processing on the scatter diagram in the distribution image, thereby reducing the complexity of the distribution image and simultaneously ensuring the efficiency of generating the sample points. In addition, the invention automatically carries out time sequence conversion on the data set, and carries out smoothing filling processing on the converted time sequence image, thereby automatically generating beautiful and effective time sequence image. Meanwhile, the invention can distinguish sample label values on the generated distributed images by using label column information in a supervised scene and using different colors.

Drawings

FIG. 1 is a flow chart of a data exploration method according to an embodiment;

FIG. 2 is a flow chart of a data exploration method according to another embodiment;

FIG. 3 is a schematic diagram of a distribution survey of satisfaction levels in a continuous numerical feature, according to one embodiment;

FIG. 4 is a schematic diagram of a distribution probe of device usage numbers in a continuous numeric feature provided by another embodiment;

FIG. 5 is a schematic diagram of a distributed exploration of salary in discrete features according to one embodiment;

FIG. 6 is a schematic diagram of an industry profile exploration in discrete features provided by another embodiment;

FIG. 7 is a schematic diagram of a combined distribution probe of rating features and satisfaction rating features provided by an embodiment;

FIG. 8 is a schematic diagram of a combined distribution probe of satisfaction level features and net-length features provided by another embodiment;

FIG. 9 is a schematic diagram of a combined profile exploration of rank features and net-length features provided by yet another embodiment;

FIG. 10 is a flowchart illustrating a method for updating a data mining model according to an embodiment;

FIG. 11 is a schematic diagram of a hardware configuration of a data exploration system according to an embodiment;

FIG. 12 is a schematic diagram of a hardware architecture of a data mining model update system according to an embodiment;

fig. 13 is a schematic hardware structure of a terminal device according to an embodiment;

Fig. 14 is a schematic hardware structure of a terminal device according to another embodiment.

Description of element reference numerals

M10 meta information deriving module

M20 exploration module

M100 acquisition module

M200 binning value module

M300 model updating module

1100. Input device

1101. First processor

1102. Output device

1103. First memory

1104. Communication bus

1200. Processing assembly

1201. Second processor

1202. Second memory

1203. Communication assembly

1204. Power supply assembly

1205. Multimedia assembly

1206. Audio assembly

1207. Input/output interface

1208. Sensor assembly

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

Because the big data form data has the characteristics of rich content and complex form besides huge data scale. And because of wide application and different actual services behind each data source, the large data table type data content and meaning of different data sources are different: in the demographic information data table, one row of records represents information of one user; in the data table of the behavior log information class, a row record may represent a clicking/purchasing behavior, or may represent a behavior summary of clicking/purchasing behaviors of a user in one day/one month. Even in the same data table of the same data source, data of various data types such as numerical value data, discrete class data, time stamp data and the like are often included.

While machine learning data mining models typically serve classification, regression or ordering tasks at some particular granularity. For example, for a credit management model, it is necessary to determine the probability of default for a certain user, or for a certain loan application; for the advertisement recommendation model, a recommendation list needs to be generated for a certain user on a certain day. However, even though the modeling scenarios are different, if a model capable of completing tasks with high quality is required to be trained by data for the structured data in the form of a table, the information contained in the data of the original table often lacks, and at this time, new features (new columns in the data table) need to be constructed by means of feature combination, feature transformation and the like (collectively referred to as feature engineering), and appropriate feature screening is performed, so that a high-quality and efficient model can be trained. The efficient features often come from knowledge of the data by modeling engineers, removing some fixed terms extracted from long-term business experience under special circumstances, and other efficient feature generation modes can be constructed artificially after the modeling engineers observe the expression form of the data. The data are presented in a form mainly comprising: basic types of data (continuous numerical type, discrete type, etc.), distribution curves, statistical indexes of data, etc. These information contain business meanings hidden behind the data, also called meta-information; meta-information refers to a priori information related to the meaning of the business behind the data, which cannot be directly represented in the data content itself. For example, the same column of integers in the 0, 100 interval may represent age, or some sort of code, such as province/region code. When it represents age, it is essentially a data column, the magnitude of which is meaningful, 30 years >20 years >10 years; and if representing province/region codes, the values of 30, 20 and 10 have no size relationship, so that the coding is re-coded in a scrambling sequence without changing data information.

Furthermore, models have a degree of timeliness, since over time, the newly added samples and the samples previously used for modeling are inevitably subject to some degree of distribution drift, resulting in the model fitted from the original training samples not being suitable for the newly added samples. Therefore, for the training samples and the newly added samples, the distribution condition of each important index feature needs to be controlled so as to judge whether the model needs to be replaced in time.

Therefore, as shown in fig. 1, the present invention provides a data exploration method applied to a computer model training process, comprising the following steps:

S10, performing meta-information deduction on a target data set to obtain feature types of all sample data in the target data set; the target data set may be one data set or a plurality of data sets.

S20, probing sample data corresponding to each feature type to obtain a corresponding probing result; probing the sample data includes at least one of: index exploration and data distribution exploration.

The method can conduct data exploration on one or more data sets, determine the characteristic distribution situation in the data sets according to exploration results, and then judge whether the data mining model needs to be updated or not based on the characteristic distribution situation in the data sets, so that the data mining model can adapt to all sample data including newly added samples. Specifically, the embodiment of the application can calculate the stability index value between the data sets based on the exploration result, and update the data mining model when the stability index value is greater than the preset threshold value, so that the data mining model can adapt to all sample data including newly added samples.

According to the above description, in an exemplary embodiment, if the feature type of the sample data is a continuous numerical feature, the process of obtaining the index probe result includes:

Determining a statistical index of the continuous numerical characteristic sample data; calculating index values of continuous numerical characteristic sample data according to the determined statistical indexes; carrying out box division processing on the continuous numerical characteristic sample data according to the determined statistical index and the calculated index value, and counting the proportion of the sample data in each box division interval to all the sample data; and distinguishing positive samples from negative samples of the continuous numerical characteristic sample data, and acquiring the proportion of the positive samples and the negative samples in each box division interval to obtain an index exploration result of the continuous numerical characteristic sample data. Wherein the statistical indicators of the continuous numerical features include, but are not limited to: the indexes such as mean value, security standard deviation, median, quartile, skewness, kurtosis, characteristic value range (interval), value total number, deletion rate, 0 value rate and the like.

If the feature type of the sample data is discrete feature, the process of obtaining the index exploration result comprises the following steps:

determining a statistical index of the discrete characteristic sample data; and calculating an index value of the discrete characteristic sample data according to the determined statistical index to obtain an index exploration result of the discrete characteristic sample data.

Wherein the statistical indicators of the discrete features include, but are not limited to: mode, value duty ratio, total number of values, characteristic value range (array), deletion rate and other indexes. In the embodiment of the application, the characteristic columns in the data set can be automatically divided into discrete type characteristics and continuous numerical type characteristics according to the experience rule.

According to the above description, in an exemplary embodiment, the method further includes performing distribution exploration on the sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image under the target scene according to the distribution exploration result; wherein, the target scene of continuous numerical type characteristic includes two kinds of scenes, and the target scene of discrete type characteristic includes regression scene. Specifically, as an example, in a supervised scenario, the comparative relevance of each feature value to the final tag column result may be reflected as much as possible. Taking the most common two-classification scene as an example, distinguishing the data according to positive and negative samples, drawing a histogram by utilizing the previous binning result for continuous numerical characteristics, and simultaneously displaying the histogram and the kernel density estimation diagram under the positive and negative samples except the histogram and the kernel density estimation diagram of the data overall, as shown in fig. 3 and 4. As can be seen from fig. 3, the negative sample proportion of the leftmost bin is significantly higher than that of the other bins, so that the lowest satisfaction level crowd can be considered to be classified and set as the label feature, and the model can be re-established. If the high-proportion negative sample under the condition is divided into boxes in the wind control scene and the like, the reject rule can be directly set, and the overall efficiency is improved. As can be seen from fig. 4, the rightmost bin is almost all negative samples, and if the feature or rule that the feature is higher than a certain threshold is set, the rightmost sample is independent, so that the overall model accuracy can be greatly improved. As another example, for discrete features, since the kernel density estimation of the data is no longer of practical significance, only the distribution histograms under positive and negative samples are shown, as shown in fig. 5 and 6. As can be seen from FIG. 5, there are three kinds of values low, medium, high, and it can be seen that the probability of negative samples of high and new people is obviously low, and the negative samples can be used for constructing characteristics and rules. And the ratio of the three populations is about 7:6:1, which should be followed when sampling the data set or constructing the comparison data set. As can be seen from fig. 6, the difference between the positive and negative sample ratios is not large at each value, and only a simple encoding process is required for this feature. As another example, if in the regression scene, the sample color separation is displayed gradually according to the value of the discrete feature or the continuous numerical feature tag column.

Because the meta information cannot be obtained from the data content itself, the feature engineering can easily process low-quality and redundant features without the meta information, and the final model quality is affected. For example, in the feature construction stage, feature cross combination is a common and effective way, but blind feature combination also generates a large number of useless features, taking second-order feature cross as an example, assuming that the data set contains n original features, if all the features are combined in the second order, a single combination way can generate (n×1)/2 new features, which not only brings about extremely large calculation amount, but also reduces the ratio of the original features to 2/(n+1), so that useful information is more difficult to learn by a machine-learned algorithm model. In practice, it is often necessary to observe the distribution of combinations among features to determine the effectiveness of the features after they are combined. Therefore, according to the above description, in an exemplary embodiment, if the feature type of the sample data includes a continuous numerical feature, the continuous numerical feature sample data are combined two by two, and one sample data in each combination is taken as the horizontal axis value of the distribution image, and the other sample data is taken as the vertical axis value of the distribution image; forming sample data points based on the horizontal axis value and the vertical axis value, and directly filling the continuous numerical characteristic sample data into the distribution image; or calculating the distance between any two continuous numerical characteristic sample data, clustering all the continuous numerical characteristic sample data according to the distance calculation result, and filling the clustered sample data into the distribution image. Specifically, the continuous features are combined in pairs, one sample data in each combination is used as a horizontal axis value of the distribution image, the other sample data is used as a vertical axis value of the distribution image, then sample data points are formed based on the horizontal axis value and the vertical axis value, and the sample data points are filled in the distribution image. In the generation of feature combination charts, a two-dimensional scatter diagram is mostly adopted, and the abscissa axis of the image represents two selected features respectively, wherein each point represents the value of one sample on two feature axes. Under the condition of large data set and large sample number, a large amount of scattered points are generated, so that a large amount of calculation resources are consumed, the time consumption is long, the generated image is too dense, and the observer cannot conveniently acquire information from the generated image. Aiming at the problems of large data set and too many sample points, miniBatchKmeans clustering algorithm can be adopted, and the distance between the data points can be calculated by using a batch processing method. The benefit of Mini Batch is that not all data samples need to be used in the calculation process, but rather a portion of samples from different classes of samples are extracted to represent the respective types for calculation, and the run time is correspondingly reduced due to the small number of calculated samples. After the distance between the data points is calculated, the data points with the too close distance are properly combined, and the number of the data points finally output is controlled. For the combined sample points, there may be a case that the sample points before combination have values on both positive and negative samples, and at this time, a voting method (voting) is adopted to determine the label of the final combined sample point. As an example, two non-tag columns of features are selected at a time, data points are drawn at corresponding positions according to the values of the data points in the table under two different characteristics in the form of a scatter diagram, and distinction between positive and negative samples is performed. When the data set is too large and the number of samples is too large, the efficiency of generating the graph is too low due to the performance problem, at the moment, the MiniBatchKmeans algorithm is adopted to cluster the sample points in advance, the number of the clustered categories is controlled between 250 and 300 by people, so that the number of the points displayed on a single graph can be controlled to be not more than 300, the efficiency is improved, and the attractiveness of a certain degree is also ensured. And the positive and negative judgment of the samples in each class adopts a voting method (voting), if a certain clustering point is clustered by 5 original data points, wherein 3 positive samples and 2 negative samples are contained, and the clustering point is finally displayed as positive. (alternatively, a dynamic threshold method may be used, for example, the original data set may have a negative sample proportion of 10%, and if the clustered points include more than 10% of the negative samples, the clustered points are determined to be negative samples.) the scatter diagram of the combined feature is beneficial to find the combination relationship between the features, so as to construct a cross feature or multi-level rule, as shown in fig. 7-9. FIG. 7 shows a combined distribution of the grade feature grade and the satisfaction grade feature satisfaction _level, where two areas with significantly higher negative sample densities are found, one with low satisfaction grade satisfaction _level and high grade; secondly, the satisfaction grade satisfaction _level is about 0.4, and the grade is about 0.5, samples of the two areas can be extracted to construct new characteristics or rules, and the final model expression can be improved. Fig. 8 shows a combined distribution diagram of the satisfaction rank feature satisfaction _level and the net duration feature net_time_length, fig. 9 shows a combined distribution diagram of the rank feature grade and the net duration feature net_time_length, and the analyses of fig. 8 and 9 are the same as those of fig. 7. According to the above description, when images of feature combinations in pairs show a certain tendency, it is considered effective to construct features therefrom; if the image of the feature combination is still in a sporadic, irregular state, the feature combination may be chosen to be discarded. According to the embodiment of the application, the MiniBatchKmeans is used for carrying out clustering processing on the image scatter diagram, so that the complexity of the image is reduced, and meanwhile, the efficiency of generating the sample point is ensured. In the embodiment of the application, miniBatchKmeans is an optimization scheme of the K-Means algorithm, and mainly optimizes the calculation speed under the condition of large data volume. Compared with the standard K-Means algorithm, the Mini Batch K-Means increases the calculation speed, and the Mini Batch K-Means algorithm has better effect under the condition of larger data volume. The kmeans algorithm is also known as the k-means algorithm. The algorithm idea is approximately as follows: firstly, randomly selecting k samples from a sample set as cluster centers, calculating the distances between all samples and the k cluster centers, dividing each sample into clusters where the "cluster center" closest to the sample is located, and calculating the new "cluster center" of each cluster for the new cluster. The logic is mainly as follows: step 1, selecting K points as cluster centers (non-sample points can be selected) of initial aggregation; step 2, respectively calculating the distance between each sample point and K cluster cores (the distance generally takes Euclidean distance or cosine distance), finding the cluster core closest to the point, and attributing the cluster core to the corresponding cluster; and 3, after all the points belong to the clusters, dividing M points into K clusters. The center of gravity (average distance center) of each cluster is then recalculated and set as a new "cluster core"; and repeatedly iterating the step 2 and the step 3 until a certain stopping condition is reached.

For a data set with sample occurrence time, it is often necessary to check the distribution change condition of certain specific features along with time, and a simple scheme is to perform one-to-one correspondence and ordering on feature columns to be observed with time to form data in a time sequence form, and then perform diagrammatical observation. However, the time series data set has a strict requirement on the time format, and firstly, the time series used as the index must be normalized and standardized, so that the time span is uniform and the time unit is reasonable. Second, at different points in time, the observed features also need to have unique values. In addition, after the data set is converted according to the time index, the problem of content missing under part of the time index is unavoidable, and at this time, if 0 value filling is directly performed, curve distortion is caused, and jaggies appear. Therefore, in an exemplary embodiment, if a time sequence exists in the target data set, an index is built according to the time sequence, and sample data in a target time range is subjected to mean value aggregation to obtain a time sequence data set; and generating a continuous numerical characteristic and a time sequence curve under a discrete characteristic based on the time sequence data set, and carrying out mean value smoothing on missing values in the time sequence curve. Specifically, if a time sequence exists in the data set, an index is established for the time sequence, the index is established according to the total time span of the data, the characteristic data of all the specified time ranges are constructed by taking the year, month and day as units, the time sequence data set is obtained in a mean value aggregation mode, a time sequence curve under each characteristic is drawn according to the time sequence data set, mean value smoothing processing is carried out on the missing values, and each change condition is observed. The embodiment of the application adopts the mean value smoothing method to fill the missing values, can replace filling by using a median smoothing method, and reduces the influence caused by abnormal points. The embodiment of the application can automatically perform time sequence conversion on the data set, and perform smoothing filling processing on the converted time sequence image, thereby automatically generating beautiful and effective time sequence image.

According to the above description, in an exemplary embodiment, the method further includes distinguishing the sample label values on the generated distribution image by using label column information in the supervised scene and using different colors. As an example, univariate and combined profiles of data may be plotted using different colors, for example, in accordance with the final data label.

In a specific embodiment, as shown in fig. 2 to 9, the method proposes a data exploration method in a supervised scenario, including:

And (3) performing meta-information configuration deduction, namely performing meta-information deduction on the target data set, and acquiring the characteristic types of all sample data in the target data set. Specifically, calculating according to the ratio of the total number of the values of the characteristics to the total number of the samples, and if the ratio or the total number of the values is smaller than the empirical threshold value set by us, temporarily judging the data as discrete type data; when the ratio is 1 and the character length of all samples in the feature is the same, it is determined as discrete ID class data. If the value is located near the threshold value, the difference in the number of samples at the adjacent value is calculated, and if the fluctuation is too large, it is determined as discrete data. The meta information deduces information such as data types used for exploring the data table, and a preset meta information configuration file is generated. For the meta information deriving stage, by default, the data type of each column in the data table is heuristically guessed according to the column name and data distribution of the data table, and a numeric column and a discrete column (category attribute column or ID column) are distinguished. For discrete columns, based on the number and distribution of the values, whether the category attribute columns (fewer values) or the ID columns (more values) are guessed, a default meta-information configuration file is generated, and the user is allowed to modify.

And a statistics index stage, namely respectively adopting different statistics modes to probe indexes according to the feature types obtained by the meta information pushing stage. Specifically, the feature columns in the data set are automatically divided into discrete type features and continuous numerical type features according to an empirical rule, and different statistical indexes are calculated for two different types of data respectively: and (3) for the numerical value type characteristic, calculating indexes such as a mean value, a security difference, a median, a quartile difference, a skewness, a kurtosis, a characteristic value range (interval), a value total number, a missing rate, a value rate of 0 and the like of the characteristic, then carrying out box division on the characteristic, counting the proportion of samples in each box division interval to the total proportion, and the proportion of positive and negative samples to the box division sample. For discrete features, indexes such as mode, value duty ratio, total value, feature value range (array), deletion rate and the like of the features are calculated.

And a data distribution exploration stage, wherein a univariate distribution map and a combined distribution map of the data are drawn according to the final data label by using different colors. For numerical type characteristics, exploring a data distribution histogram and a nuclear density estimation graph; only the data distribution histogram is probed for category type features. And combining the numerical value type features in pairs, and drawing a scatter diagram by using the different features as coordinates and MiniBatchKmeans clustered sample points. Under the supervision scene, the comparison correlation of each feature value and the final tag list result can be reflected as far as possible. Taking the most common two-classification scene as an example, distinguishing the data according to positive and negative samples, drawing a histogram by utilizing the previous binning result for continuous numerical characteristics, and simultaneously displaying the histogram and the kernel density estimation diagram under the positive and negative samples except the histogram and the kernel density estimation diagram of the data overall, as shown in fig. 3 and 4. As can be seen from fig. 3, the negative sample proportion of the leftmost bin is significantly higher than that of the other bins, so that the lowest satisfaction level crowd can be considered to be classified and set as the label feature, and the model can be re-established. If the high-proportion negative sample under the condition is divided into boxes in the wind control scene and the like, the reject rule can be directly set, and the overall efficiency is improved. As can be seen from fig. 4, the rightmost bin is almost all negative samples, and if the feature or rule that the feature is higher than a certain threshold is set, the rightmost sample is independent, so that the overall model accuracy can be greatly improved. As another example, for discrete features, since the kernel density estimation of the data is no longer of practical significance, only the distribution histograms under positive and negative samples are shown, as shown in fig. 5 and 6. As can be seen from FIG. 5, there are three kinds of values low, medium, high, and it can be seen that the probability of negative samples of high and new people is obviously low, and the negative samples can be used for constructing characteristics and rules. And the ratio of the three populations is about 7:6:1, which should be followed when sampling the data set or constructing the comparison data set. As can be seen from fig. 6, the difference between the positive and negative sample ratios is not large at each value, and only a simple encoding process is required for this feature. As another example, if in the regression scene, the sample color separation is displayed gradually according to the value of the discrete feature or the continuous numerical feature tag column. As shown in fig. 7 to 9. FIG. 7 shows a combined distribution of the grade feature grade and the satisfaction grade feature satisfaction _level, where two areas with significantly higher negative sample densities are found, one with low satisfaction grade satisfaction _level and high grade; secondly, the satisfaction grade satisfaction _level is about 0.4, and the grade is about 0.5, samples of the two areas can be extracted to construct new characteristics or rules, and the final model expression can be improved. Fig. 8 shows a combined distribution diagram of the satisfaction rank feature satisfaction _level and the net duration feature net_time_length, fig. 9 shows a combined distribution diagram of the rank feature grade and the net duration feature net_time_length, and the analyses of fig. 8 and 9 are the same as those of fig. 7. According to the above description, when images of feature combinations in pairs show a certain tendency, it is considered effective to construct features therefrom; if the image of the feature combination is still in a sporadic, irregular state, the feature combination may be chosen to be discarded. According to the embodiment of the application, the MiniBatchKmeans is used for carrying out clustering processing on the image scatter diagram, so that the complexity of the image is reduced, and meanwhile, the efficiency of generating the sample point is ensured.

And in the time sequence exploration stage, selecting a time sequence to be converted, automatically generating a time index with proper span, and carrying out mean value aggregation on the original features according to the new time index. And filling the missing values into the polymerized result by using a smoothing method, and finally generating a curve. Specifically, if a time sequence exists in the data set, an index is established for the time sequence, the index is established according to the total time span of the data, the characteristic data of all the specified time ranges are constructed by taking the year, month and day as units, the time sequence data set is obtained in a mean value aggregation mode, a time sequence curve under each characteristic is drawn according to the time sequence data set, mean value smoothing processing is carried out on the missing values, and each change condition is observed. The embodiment of the application adopts the mean value smoothing method to fill the missing values, can replace filling by using a median smoothing method, and reduces the influence caused by abnormal points. The embodiment of the application can automatically perform time sequence conversion on the data set, and perform smoothing filling processing on the converted time sequence image, thereby automatically generating beautiful and effective time sequence image.

And in the data stability exploration stage, according to the training set box division result, recording the box division value proportion in each feature box, and when a test set with uniform data format and content exists, selecting a corresponding training set, and reading the feature box division result.

In summary, the method aims at the problems existing in the prior art, and obtains the characteristic types of all sample data in the target data set by performing meta-information deduction on the target data set; then, the sample data corresponding to each feature type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration; and calculating a stability index value between the data sets based on the index exploration result, and updating the data mining model when the stability index value is greater than a preset threshold value. The method can judge whether the data mining model needs to be updated or not by carrying out data exploration on the sample data based on the meta information, so that the data mining model can adapt to all sample data comprising newly added samples. The method can automatically calculate the statistical information of the data characteristics without a large amount of manual intervention, generate characteristic distribution images and accurately deduce the data meta-information. In addition, the method can also utilize MiniBatchKmeans to perform clustering processing on the scatter diagram in the distribution image, reduce the complexity of the distribution image and ensure the efficiency of sample point generation. In addition, the method automatically performs time sequence conversion on the data set, performs smoothing filling processing on the converted time sequence image, and automatically generates attractive and effective time sequence image. Meanwhile, the method can use label column information in a supervised scene to distinguish sample label values on the generated distributed image by using different colors. The method can automatically complete the data exploration flow, is convenient for subsequent data mining modeling, can optimize the calculation result by using a clustering algorithm, ensures that the image is more concise and visual, and can dynamically monitor the model data.

As shown in fig. 10, the present invention further provides a method for updating a data mining model, including the following steps:

s100, acquiring a training data set and a data set to be tested in a target scene; wherein, the target scene includes: a user behavior log scene;

S200, index exploration is carried out on a training data set and a data set to be tested by using a data exploration method, and the fraction box value duty ratio of continuous numerical type characteristic sample data in the training data set and the fraction box value duty ratio of continuous numerical type characteristic sample data in the data set to be tested are obtained;

s300, calculating a stability index value between the training data set and the data set to be tested according to the bin value proportion of the continuous numerical characteristic sample data in the training data set and the bin value proportion of the continuous numerical characteristic sample data in the data set to be tested, and updating the data mining model when the stability index value is greater than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank the sample data.

Specifically, a training data set and a data set to be tested in a target scene are obtained; wherein, the target scene includes: and (3) a user behavior log scene, performing index exploration on the two data sets, acquiring the bin value duty ratio of continuous numerical sample data in each data set from the exploration result, and calculating the stability index value PSI between the two data sets according to the bin value duty ratio of the two data sets. The stability index value PSI is calculated as follows:

Psi=sum ((fraction ratio of fractional value of continuous numerical feature sample data in data set to be tested-fraction ratio of fractional value of continuous numerical feature sample data in training data set) ×ln (fraction ratio of fractional value of continuous numerical feature sample data in data set to be tested/fraction ratio of fractional value of continuous numerical feature sample data in training data set)).

And when the PSI value is larger than a preset threshold value, giving a prompt that the data mining model needs to be updated, namely updating the data mining model. The preset threshold may be set according to actual situations, and in the embodiment of the present application, the preset threshold is set to 0.25.

The embodiment of the present application may execute the data probing method, so specific functions and technical effects of the embodiment of the present application may be referred to the above embodiment, and will not be described herein.

As shown in fig. 11, the present invention further provides a data exploration system, which is applied to a training process of a computer model, and includes:

the meta-information deriving module M10 is configured to derive meta-information from the target data set, and obtain feature types of all sample data in the target data set; the target data set may be one data set or a plurality of data sets.

The exploration module M20 is used for exploration of sample data corresponding to each feature type and obtaining a corresponding exploration result; probing the sample data includes at least one of: index exploration and data distribution exploration.

The system can perform data exploration on one or more data sets, determine the characteristic distribution situation in the data sets according to exploration results, and then judge whether the data mining model needs to be updated or not based on the characteristic distribution situation in the data sets, so that the data mining model can adapt to all sample data including newly added samples. Specifically, the embodiment of the application can calculate the stability index value between the data sets based on the exploration result, and update the data mining model when the stability index value is greater than the preset threshold value, so that the data mining model can adapt to all sample data including newly added samples.

In a specific embodiment, the system proposes a data exploration method in a supervised scenario, as shown in fig. 2 to 9, and specific functions and technical effects may be referred to the above embodiments, which are not described herein again.

In summary, the system obtains the feature types of all sample data in the target data set by performing meta-information deduction on the target data set according to the problems existing in the prior art; then, the sample data corresponding to each feature type is probed, and a corresponding probing result is obtained; the probing includes at least one of: index exploration and data distribution exploration; and calculating a stability index value between the data sets based on the index exploration result, and updating the data mining model when the stability index value is greater than a preset threshold value. The system can judge whether the data mining model needs to be updated or not by carrying out data exploration on the sample data based on the meta information, so that the data mining model can adapt to all sample data comprising newly added samples. The system can automatically calculate the statistical information of the data characteristics without a large amount of manual intervention, generate characteristic distribution images and accurately deduce the data meta-information. Moreover, the system can also utilize MiniBatchKmeans to perform clustering processing on the scatter diagram in the distribution image, so that the complexity of the distribution image is reduced, and meanwhile, the efficiency of generating sample points is ensured. The system automatically performs time sequence conversion on the data set, performs smoothing filling processing on the converted time sequence image, and automatically generates attractive and effective time sequence image. Meanwhile, the system can distinguish sample label values on the generated distributed images by using label column information in a supervised scene and using different colors. The system can automatically complete the data exploration flow, facilitates the subsequent data mining modeling, can optimize the calculation result by using a clustering algorithm, ensures that the image is more concise and visual, and can simultaneously realize the dynamic monitoring effect on model data.

As shown in fig. 12, the present invention further provides a data mining model updating system, including:

the acquisition module M100 is used for acquiring a training data set and a data set to be tested in a target scene; wherein, the target scene includes: a user behavior log scene;

the box-dividing value module M200 is used for performing index exploration on the training data set and the data set to be tested by utilizing a data exploration method to obtain the box-dividing value duty ratio of continuous numerical characteristic sample data in the training data set and the box-dividing value duty ratio of continuous numerical characteristic sample data in the data set to be tested;

The model updating module M300 is used for calculating a stability index value between the training data set and the data set to be tested according to the bin value proportion of the continuous numerical characteristic sample data in the training data set and the bin value proportion of the continuous numerical characteristic sample data in the data set to be tested, and updating the data mining model when the stability index value is greater than a preset threshold value; wherein the data mining model is used to classify, regress and/or rank the sample data.

The embodiment of the application also provides a computer device, which can comprise: one or more processors; and one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform the method described in fig. 1. In practical applications, the device may be used as a terminal device or may be used as a server, and examples of the terminal device may include: smart phones, tablet computers, e-book readers, MP3 (dynamic video expert compression standard voice layer 3,Moving Picture Experts Group Audio Layer III) players, MP4 (dynamic video expert compression standard voice layer 4,Moving Picture Experts Group Audio Layer IV) players, laptop computers, car computers, desktop computers, set-top boxes, smart televisions, wearable devices, etc., embodiments of the present application are not limited to specific devices.

The embodiment of the application also provides a non-volatile readable storage medium, where one or more modules (programs) are stored, where the one or more modules are applied to a device, and the device may execute instructions (instructions) of steps included in a data processing method in fig. 1 according to the embodiment of the application.

Fig. 13 is a schematic hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103 and at least one communication bus 1104. The communication bus 1104 is used to enable communication connections between the elements. The first memory 1103 may comprise a high-speed RAM memory or may further comprise a non-volatile memory NVM, such as at least one magnetic disk memory, and various programs may be stored in the first memory 1103 for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be implemented as, for example, a central processing unit (Central Processing Unit, abbreviated as CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Alternatively, the input device 1100 may include a variety of input devices, for example, may include at least one of a user-oriented user interface, a device-oriented device interface, a programmable interface of software, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware insertion interface (such as a USB interface, a serial port, etc.) for data transmission between devices; alternatively, the user-oriented user interface may be, for example, a user-oriented control key, a voice input device for receiving voice input, and a touch-sensitive device (e.g., a touch screen, a touch pad, etc. having touch-sensitive functionality) for receiving user touch input by a user; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, for example, an input pin interface or an input interface of a chip, etc.; the output device 1102 may include a display, sound, or the like.

In this embodiment, the processor of the terminal device may include functions for executing each module of the speech recognition device in each device, and specific functions and technical effects may be referred to the above embodiments and are not described herein.

Fig. 14 is a schematic hardware structure of a terminal device according to another embodiment of the present application. Fig. 14 is a diagram of one particular embodiment of the implementation of fig. 13. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, video, etc. The second memory 1202 may include a random access memory (random access memory, abbreviated as RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: a communication component 1203, a power component 1204, a multimedia component 1205, an audio component 1206, an input/output interface 1207, and/or a sensor component 1208. The components and the like specifically included in the terminal device are set according to actual requirements, which are not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method shown in fig. 1 described above. Further, the processing component 1200 may include one or more modules that facilitate interactions between the processing component 1200 and other components. For example, the processing component 1200 may include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. Power supply components 1204 can include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for terminal devices.

The multimedia component 1205 includes a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

The audio component 1206 is configured to output and/or input speech signals. For example, the audio component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received voice signals may be further stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the audio component 1206 further includes a speaker for outputting voice signals.

The input/output interface 1207 provides an interface between the processing assembly 1200 and peripheral interface modules, which may be click wheels, buttons, and the like. These buttons may include, but are not limited to: volume button, start button and lock button.

The sensor assembly 1208 includes one or more sensors for providing status assessment of various aspects for the terminal device. For example, the sensor assembly 1208 may detect an on/off state of the terminal device, a relative positioning of the assembly, and the presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communication between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card, so that the terminal device may log into a GPRS network and establish communication with a server via the internet.

As noted above, the communication assembly 1203, the audio assembly 1206, the input/output interface 1207, and the sensor assembly 1208 in the embodiment of fig. 14 may be implemented as input devices in the embodiment of fig. 13.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A method for updating a data mining model, comprising the steps of:

index exploration is carried out on the training data set and the data set to be tested by using a data exploration method, and the fraction box value duty ratio of continuous numerical type characteristic sample data in the training data set and the fraction box value duty ratio of continuous numerical type characteristic sample data in the data set to be tested are obtained;

Calculating a stability index value between the training data set and the data set to be tested according to the fraction ratio of the continuous numerical characteristic sample data in the training data set and the fraction ratio of the continuous numerical characteristic sample data in the data set to be tested, and updating a data mining model when the stability index value is greater than a preset threshold value; the data mining model is used for classifying, regressing and/or sorting sample data; wherein, the stability index value psi=sum ((fraction ratio of the fraction of the continuous numerical characteristic sample data in the data set to be tested-fraction ratio of the continuous numerical characteristic sample data in the training data set));

The data exploration method comprises the following steps: performing meta-information deduction on a target data set to acquire characteristic types of all sample data in the target data set; probing sample data corresponding to each feature type to obtain a corresponding probing result; the probing includes at least one of: index exploration and data distribution exploration.

2. The method according to claim 1, wherein if the feature type of the sample data includes a continuous numerical feature and a discrete feature, the process of acquiring the index probe result includes:

And/or determining a statistical indicator of the discrete feature sample data;

3. The method for updating a data mining model according to claim 2, further comprising performing distribution exploration on sample data corresponding to each feature type according to the index exploration result, and forming and displaying a distribution image under a target scene according to the distribution exploration result; the target scenes of the continuous numerical type features comprise two kinds of scenes, and the target scenes of the discrete type features comprise regression scenes.

4. A method for updating a data mining model according to claim 3, wherein if the continuous numerical feature sample data is subjected to distributed exploration, there are:

5. The method of claim 3 or 4, further comprising distinguishing sample tag values on the distribution image using different colors using tag column information in a supervised scene.

6. The method of claim 1, further comprising, if there is a time column in the target dataset:

7. A data mining model update system, comprising:

The box dividing value module is used for carrying out index exploration on the training data set and the data set to be tested by utilizing a data exploration method to obtain the box dividing value duty ratio of continuous numerical characteristic sample data in the training data set and the box dividing value duty ratio of continuous numerical characteristic sample data in the data set to be tested;

The model updating module is used for calculating a stability index value between the training data set and the data set to be tested according to the bin value proportion of the continuous numerical characteristic sample data in the training data set and the bin value proportion of the continuous numerical characteristic sample data in the data set to be tested, and updating a data mining model when the stability index value is greater than a preset threshold value; the data mining model is used for classifying, regressing and/or sorting sample data; wherein, the stability index value psi=sum ((fraction ratio of the fraction of the continuous numerical characteristic sample data in the data set to be tested-fraction ratio of the continuous numerical characteristic sample data in the training data set));

8. The data mining model updating system of claim 7, wherein if the feature type of the sample data includes a continuous numerical feature and a discrete feature, the process of the index probe module obtaining the index probe result includes:

And/or determining a statistical indicator of the discrete feature sample data;

9. The data mining model updating system according to claim 8, further comprising a distribution exploration module, configured to conduct distribution exploration on sample data corresponding to each feature type according to the index exploration result, and form and display a distribution image under a target scene according to the distribution exploration result; the target scenes of the continuous numerical type features comprise two kinds of scenes, and the target scenes of the discrete type features comprise regression scenes.

10. The data mining model updating system of claim 9, wherein if the continuous numerical feature sample data is subjected to distributed exploration, there are:

11. The system of claim 7, further comprising a time series module configured to, when the target data set has a time series, average aggregate sample data within a target time range based on constructing an index according to the time series to obtain a time series data set;

12. A computer device, comprising:

One or more processors; and

One or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-6.

13. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform the method of any of claims 1-6.