WO2021103401A1 - 数据对象分类方法、装置、计算机设备和存储介质 - Google Patents

数据对象分类方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021103401A1
WO2021103401A1 PCT/CN2020/085805 CN2020085805W WO2021103401A1 WO 2021103401 A1 WO2021103401 A1 WO 2021103401A1 CN 2020085805 W CN2020085805 W CN 2020085805W WO 2021103401 A1 WO2021103401 A1 WO 2021103401A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
classification
preset
variable
standard
Prior art date
Application number
PCT/CN2020/085805
Other languages
English (en)
French (fr)
Inventor
高源�
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021103401A1 publication Critical patent/WO2021103401A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Definitions

  • This application relates to the field of big data technology, in particular to a data object classification method, device, computer equipment and storage medium.
  • the existing solution is to extract and classify the features of the data object through a machine learning model (for example, a decision tree model (Gradient Boosting Decision Tree, GBDT)), and compare the classified data with the filtering requirements to get the difference Filter the required target data.
  • a machine learning model for example, a decision tree model (Gradient Boosting Decision Tree, GBDT)
  • GDT decision tree model
  • the inventor realizes that these methods have a higher classification accuracy for data with a relatively simple composition structure, but for data with a relatively complex composition structure (for example, the same object has two or more data sources, and data from different sources is different.
  • the dimensions are not necessarily the same, and there are partial associations), the existing machine learning model cannot effectively synthesize the features of the associated dimensions, so that the accuracy of the extracted features is not enough, and the screening and classification of the object data is not accurate enough.
  • This application provides a data object classification method, device, computer equipment and storage medium to solve the inaccurate data object classification technology that cannot be accurately extracted in the prior art. problem.
  • a data object classification method includes:
  • the target classification of the object to be classified in the comprehensive classification result is determined according to the basic data of each object to be classified, and the target score of the object to be classified relative to the evaluation score is determined according to the associated data. Value, weighting the target score to obtain the estimated probability that the object to be classified belongs to the target classification;
  • the object to be classified is classified into the target classification as the target object.
  • a data object classification device includes:
  • the data division module is used to obtain the basic data of each object to be classified as the data to be processed, and divide the data to be processed into standard data and associated data according to preset screening requirements;
  • the data classification module is used to obtain standard nominal variables according to the standard data, perform feature classification on the standard nominal variables, and perform fusion processing on the classified results to obtain a comprehensive classification result;
  • a data evaluation module configured to obtain an associated nominal variable according to the associated data, and perform data evaluation on the associated nominal variable to obtain an evaluation score of the associated nominal variable;
  • the object screening module is configured to determine the target classification of the object to be classified in the comprehensive classification result according to the basic data of each object to be classified, and determine that the object to be classified is relative to the Evaluating the target score of the evaluation score, weighting the target score to obtain the evaluation probability that the object to be classified belongs to the target classification;
  • the object classification module is configured to classify the object to be classified into the target classification as the target object if the evaluation probability is greater than a preset threshold.
  • a computer device includes: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be used by the one or more The processor executes, and the one or more computer programs are configured to execute a data object classification method, wherein the data object classification method includes:
  • the target classification of the object to be classified in the comprehensive classification result is determined according to the basic data of each object to be classified, and the target score of the object to be classified relative to the evaluation score is determined according to the associated data. Value, weighting the target score to obtain the estimated probability that the object to be classified belongs to the target classification;
  • the object to be classified is classified into the target classification as the target object.
  • the target classification of the object to be classified in the comprehensive classification result is determined according to the basic data of each object to be classified, and the target score of the object to be classified relative to the evaluation score is determined according to the associated data. Value, weighting the target score to obtain the estimated probability that the object to be classified belongs to the target classification;
  • the object to be classified is classified into the target classification as the target object.
  • the above data object classification method, device, computer equipment and storage medium classify the acquired basic data of the data object and then input it into the preset classifier for processing to obtain the classification result and the evaluation result, and then summarize the obtained classification As a result, the target classification of the filtered data object is determined. This makes the final determination of the target classification included in the data objects to be screened more targeted, more in line with the screening requirements, improves the accuracy of data object classification, and solves the technical problem of inaccurate data object classification in the prior art.
  • Figure 1 is a schematic diagram of the application environment of the data object classification method
  • Figure 2 is a schematic flow diagram of a data object classification method
  • FIG. 3 is a schematic flowchart of step 202 in FIG. 2;
  • FIG. 4 is a schematic diagram of the flow of step 204 in FIG. 2;
  • FIG. 5 is a schematic flowchart of step 402 in FIG. 4;
  • FIG. 6 is another schematic diagram of the process of step 204 in FIG. 2;
  • FIG. 7 is a schematic flowchart of step 604 in FIG. 6;
  • FIG. 8 is a schematic flowchart of step 206 in FIG. 2;
  • Figure 9 is a schematic diagram of a data object classification device
  • Figure 10 is a schematic diagram of a computer device in an embodiment.
  • the data object classification method provided by the embodiment of the present application can be applied to the application environment as shown in FIG. 1.
  • the application environment may include a terminal 102, a network 106, and a server 104.
  • the network 106 is used to provide a communication link medium between the terminal 102 and the server 104.
  • the network 106 may include various connection types, such as wired and wireless communications. Link or fiber optic cable, etc.
  • the user can use the terminal 102 to interact with the server 104 through the network 106 to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc., may be installed on the terminal 102.
  • the terminal 102 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Experts compress standard audio level 4) Players, laptop portable computers and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3
  • MP4 Motion Picture Experts compress standard audio level 4
  • laptop portable computers and desktop computers etc.
  • the server 104 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal 102.
  • the account reconciliation method provided in the embodiments of the present application is generally executed by the server/terminal, and correspondingly, the account reconciliation device is generally set in the server/terminal device.
  • terminals, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 uses the terminal 102 as a data object to pull basic data from it, and classifies the pulled basic data according to preset filtering. After the classified data is processed in different processing methods, comprehensive classification results and evaluation scores are obtained. Finally, the data objects are classified and classified evaluation according to the classified data, and the evaluation probability of the data object belonging to the obtained classification category is determined. If the evaluation probability is greater than the preset threshold, the classification is considered correct, and the data object is regarded as the target object.
  • the terminal 102 and the server 104 are connected through a network.
  • the network can be a wired network or a wireless network.
  • the terminal 102 can be, but is not limited to, various personal computers, laptops, smart phones, tablets, and portable wearable devices.
  • the server 104 can be implemented by an independent server or a cluster of multiple servers.
  • a data object classification method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • Step 202 Obtain the basic data of each object to be classified as the data to be processed, and divide the data to be processed into standard data and associated data according to preset screening requirement data.
  • the data to be processed comes from multiple data sources.
  • the server collects the required data from each data source as the basic data, and then uses all the basic data obtained as the data to be processed, and divides the data to be processed into standards according to the preset filtering demand data Data and associated data.
  • the data source can be each server in a server cluster, and the basic data corresponding to the data object can include the running time of the server, hardware parameters, log files, historical maintenance records, and the number of solutions completed by the server.
  • a server is regarded as a data object, as the object to be classified.
  • each data object corresponds to a wide range of basic data dimensions.
  • a distributed storage method is generally used for data storage, but the general distributed storage is due to broadband
  • the transmission rate is too slow, which makes the data collection efficiency low, or when the data volume is too large, the data collection channel is congested or even paralyzed. Therefore, in this embodiment, a big data platform is used. From the distributed storage system, according to the data The identification of the object, from each data source, the data containing the identification of the data object is obtained as the basic data of the data object.
  • the standard data is data that has a strong correlation with the preset screening demand data
  • the associated data is data that has a weaker correlation with the preset screening demand data than the standard data or has no correlation with the preset screening demand data.
  • Step 204 Obtain standard nominal variables according to the standard data, perform feature classification on the standard nominal variables, and perform fusion processing on the classified results to obtain a comprehensive classification result.
  • Discrete variables refer to variables whose values can be listed one by one in a certain order, usually with integer values. For example, the number of times the server solves the problem per unit time, the number of repairs per month in a year, etc., the value of discrete variables can be obtained by counting.
  • General data preprocessing methods include: continuous variable discretization, data binning, heat coding, etc.
  • the required data can be obtained through continuous variable discretization as a standard nominal variable.
  • the standard nominal variables in this embodiment are discrete variables.
  • feature classification processing on the standard nominal variables to obtain the distinguishing features and feature combinations in the standard nominal variables, and obtain a feature classification result.
  • classification results are fused to obtain the prediction result as a comprehensive classification result.
  • Step 206 Obtain the associated nominal variable according to the associated data, and perform data evaluation on the associated nominal variable to obtain the evaluation score of the associated nominal variable.
  • the data of each variable or each parameter is selected from the associated data, and the information value (IV value, Information Value) of the variable or parameter is calculated for the filtering operation, and then a certain number of variables are selected based on the calculated IV value Or parameters.
  • the logistic regression model selects 8-14 variables or parameters whose IV values meet the requirements
  • the boost tree model selects 20-30 variables or parameters.
  • each variable randomly select i groups of data from the variable data and input them into the logistic regression model for evaluation. Because a variable can correspond to multiple data objects (generally multiple), so The data objects selected according to each variable are different. Among them, there is a user tag for each data object, such as potential customers, hard-core customers, or unintentional customers, and so on. Specifically, the IV value is calculated based on the weight of evidence (WOE, Weight of Evidence) of the data.
  • WE Weight of Evidence
  • Step 208 Determine the target classification of the object to be classified in the comprehensive classification result according to the basic data of each object to be classified, and determine the target score of the object to be classified relative to the evaluation score according to the associated data, and perform weighting processing on the target score , Get the evaluation probability that the object to be classified belongs to the target classification.
  • any data object that is, the object to be classified
  • Step 210 If the estimated probability is greater than the preset threshold, classify the object to be classified into the target classification as the target object.
  • the object to be classified belongs to the target classification.
  • the target classification of the object to be classified is obtained, and then all of my objects to be classified In this way, the data object corresponding to each category in the comprehensive classification result can be obtained, and the screening and classification of the objects to be classified can be realized.
  • the comprehensive classification result can contain multiple categories. In the final classification result, there may not be data objects that meet the requirements in each category, and the specific results after data processing shall prevail.
  • the basic data of the acquired data object is classified and then input into the preset classifier for processing to obtain the classification result and the evaluation result, and then summarize the obtained classification results to determine the data object with screening Target classification.
  • step 202 includes:
  • Step 302 Classify the to-be-processed data according to the object attribute to obtain the object attribute data.
  • Object attributes are various attributes of the object to be classified, such as the location data of the user stored on the server, the number of participation in a question under a certain topic, the number of purchases of a certain type of goods, the frequency of searching for a place name, and so on. Then the location that users frequently go to is one of the object attributes of the data to be processed.
  • Step 304 Calculate the correlation coefficient between the object attribute data and the preset screening requirement data by Spearman's rank correlation coefficient method, as the data correlation level.
  • Spearman's rank correlation coefficient is used to evaluate an algorithm that uses a monotonic function to describe the relationship between two variables.
  • the preset screening requirements can be set according to the application scenario. For example, to push users to participate in a certain technology seminar, the target group needs to be obtained. Specifically, the Spearman rank correlation coefficient is used to calculate the correlation between the number of times someone purchases a certain type of item and someone’s participation in a certain technology seminar, the number of people’s participation in a certain topic and the number of someone’s participation in a technology seminar Probability of meeting.
  • the object attribute data is the correlation between "the number of times a user participates in questions under electrical engineering topics" and "someone participates in an intellectual property and enterprise R&D conference initiated by a certain technology company", or "someone goes to a certain place The correlation between “number of times” and “someone participates in the intellectual property and enterprise R&D conference initiated by a technology company”.
  • the correlation between them and the preset screening requirement data is calculated through the Spearman rank correlation coefficient, as the data correlation level.
  • the result obtained can be monotonously correlated or uncorrelated, and these can be obtained intuitively from the resulting data chart.
  • Step 306 If the data correlation level meets the preset correlation level, use the object attribute data as standard data.
  • the preset correlation level may be that the object attribute data and the preset filtering requirement data are positively correlated. If the data correlation level is also that the object attribute data is positively correlated with the preset filtering requirement data, then the object attribute data is used as the standard data.
  • Step 308 If the data correlation level does not meet the preset correlation level, the object attribute data is used as the associated data.
  • the object attribute data that does not meet the preset relevant level is regarded as the associated data.
  • the preset relevant level data is not only the positive correlation between the object attribute data and the preset filtering requirement data, but also the object attribute data and the preset filtering requirement data. Negative correlation, this needs to be determined according to needs, and there is no limitation here.
  • the data to be processed is classified and post-processed by calculating the degree of correlation between the object attribute data and the preset screening requirement data, and the data set that has a strong correlation with the preset screening requirement data is processed to obtain the processing result and improve The accuracy of the classification of objects to be classified.
  • step 204 includes:
  • Step 402 Extract the continuous variable data in the standard data, and perform pseudo-splitting of the continuous variable data according to the preset pseudo-splitting point to obtain the information gain of the continuous variable data before and after the pseudo-splitting.
  • the pseudo-splitting point can be a number of points marked on the continuous variable data according to the arithmetic ratio. These points divide the continuous variable data into several sub-data, and then calculate the entropy of each sub-data to obtain a pseudo-splitting entropy. The entropy of the continuous variable data is compared, and the difference between the entropy of each sub-data after the pseudo-splitting and the entropy of the continuous variable data before the pseudo-splitting is obtained as the gain variable.
  • Step 404 If the information gain is greater than the preset gain difference, use the pseudo-split point as the split point to split the continuous variable data to obtain the discretized data after splitting, and use the split point as the preset pseudo-split for the next pseudo-split Point to split.
  • the preset gain difference can be adjusted according to business needs.
  • Step 406 If the number of splits reaches the preset number of splits, the splitting is stopped, and the discretized data obtained after the last splitting is used as a discrete variable.
  • the number of splits reaches the preset number of splits, it means that the discretized data obtained after the last split can meet the needs, the splitting is stopped, and the discretized data obtained from the last split is used as a discrete variable.
  • the discretized data obtained before the preset number of splits is reached is not truly discrete data, such as 12/3/5/13/34, but the information increment obtained each time meets the preset gain difference
  • the data on both sides of the split point of the value can be data within a certain period of time.
  • Step 408 Perform dimensionality reduction processing on the discrete variables through data binning, and sort the discrete data obtained after the dimensionality reduction processing according to the characteristic values of the continuous variable data to obtain standard nominal variables.
  • the methods of data binning include but are not limited to: equal frequency binning and equal width binning.
  • Each box of discrete variables after binning is regarded as a nominal variable.
  • the eigenvalues of the nominal variables are sorted from small to large.
  • the standard nominal variable belongs to the categorical variable, and its variable value is qualitative, that is, the value determined under the existing premises or conditions is manifested as mutually incompatible categories or attributes.
  • the continuous variable data is discretized, which can increase the speed of data processing and facilitate storage and use.
  • step 402 further includes:
  • Step 502 Perform feature data cutting from the continuous variable data according to the preset pseudo-splitting point, and calculate the pseudo-splitting data entropy of the feature data obtained after the cutting.
  • the continuous variable data is cut into the characteristic data.
  • a monitoring terminal monitors the temperature of a server, and performs data cut every minute during the hour from 8:00 to 9:00 in the morning.
  • the server performs a temperature measurement and obtains 60 measurement values respectively as: t_1, t_2, t_3,..., t_59, t_60. It is easy to understand that the measurement value is a continuous variable.
  • n_1, n_2, n_3,..., n_58, n_59 where n_1 is one of t_1 and t_2
  • the pseudo-split point between t_2 and t_3, n_2 is the pseudo-split point between t_2 and t_3. Then calculate the data entropy between every two pseudo-splitting points as the pseudo-splitting data entropy.
  • Step 504 Obtain the continuous data entropy of the continuous variable data, and calculate the difference between the continuous data entropy and the pseudo-splitting data entropy as an information gain.
  • the entropy of the continuous variable data is calculated as the continuous data entropy, and the difference between the entropy of each data to be split and the entropy of the continuous data is calculated, and the calculated difference is used as the information increment.
  • the continuous variable data is discretized to extract representative values in the continuously changing data, simplify analysis, increase the speed of data processing, and facilitate storage and use.
  • step 204 includes:
  • Step 602 Input standard nominal variables into at least two preset machine learning models for feature classification, and obtain feature classification results corresponding to each preset machine learning model.
  • the preset machine learning models include but are not limited to: Gradient Boosting Decison Tree (GBDT), Boosting Tree (Boosting Tree), Random Forest (Random Forest), ID3 algorithm model, etc.
  • feature classification includes feature extraction of standard nominal variables, and then classification processing according to the extracted features to obtain feature classification results.
  • step 604 a K-fold cross-validation method is used to perform fusion processing on the feature classification results to obtain a comprehensive classification result.
  • K-fold cross-validation (k-fold cross-validation) first divides all data, that is, the feature classification results obtained through each preset machine learning model, into K sub-samples, and selects one of the sub-samples as the test set without repetition, and the other K- 1 sample is used for training. A total of K repetitions, the average K times to get the results or use other indicators, and finally get a single estimate.
  • K-fold cross-validation is used to ensure that each sub-sample participates in training, which reduces the generalization error.
  • step 604 includes:
  • Step 702 Segment the feature classification result into a feature classification training set and a feature classification test set.
  • the feature classification training set is used to train the model, and the feature classification test set is used to test the feature classification model trained by the feature classification training set.
  • the feature classification training set is 10,000 and the feature classification test set is 2500 rows.
  • Step 704 Segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set, and verify the feature classification model obtained through the feature classification feeding set training according to the feature classification verification set, to obtain the feature classification verification data .
  • the preset cutting conditions are: each time a certain amount of data is taken from the feature classification training set as the feature classification verification set for model verification, the remaining data is used as the feature classification feeding set for model training, and when the feature classification verification set is obtained It is necessary to ensure that the data taken from the feature classification training set each time has not participated in model verification, so as to ensure that each row of data in the feature classification verification set participates in the model verification, and every feature classification feeding set for model training has New data compared with the data of the previous model training. In this way, the generalization error can be reduced.
  • Use the model to verify the verification set to obtain 2000 data each time the verification obtains 2000 data, the 10000 rows of feature classification training set data can be verified in 5 times, and 10000 verification data can be obtained as the feature classification verification data.
  • Step 706 Input the feature classification test set into the feature classification model for testing to obtain feature classification test data.
  • Step 708 Re-segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set for the next training and verification.
  • the feature classification training set can be re-segmented according to the preset cutting conditions, or the feature classification training set can be divided according to the preset cutting conditions in advance by the preset number of divisions, and then each model training Both use the new feature classification feeding set for training and so on.
  • Step 710 When the number of re-segmentation reaches the preset number of divisions, the segmentation is stopped, and all the obtained feature classification verification data and all the obtained feature classification test data are processed according to the preset fusion conditions to obtain the feature classification prediction data, and The feature classification prediction data is used as a comprehensive classification result.
  • the preset number of divisions in this embodiment may be 5 times.
  • the preset fusion condition is a method of processing all the obtained feature classification test data and feature classification verification data.
  • This embodiment is: model training, verification and testing of the feature classification results obtained by each preset machine learning model
  • the obtained feature classification test data and feature classification verification data are integrated.
  • only 3 preset machine learning models can be used, and 6 data matrices can be obtained, that is, the feature classification obtained after model training and verification tests are performed on the feature classification results of each preset machine learning model
  • the verification data is used as a data matrix, and each feature classification test data is also used as a data matrix.
  • the feature classification verification data corresponding to the three preset machine learning models are respectively labeled as A1, A2, and A3 and are arranged together into a matrix of 10,000 rows and 3 columns as training data (training data), and the resulting feature classification test data is labeled as B1, B2 and B3 are merged into a matrix with 2500 rows and 3 columns as testing data, so that the lower-level learner is retrained based on such data according to preset fusion conditions.
  • the preset fusion conditions are based on the feature classification verification data and feature classification test data of each preset machine learning model as three features, among which, the feature classification verification data and feature classification test corresponding to each preset machine learning model
  • the data is used as a predicted classification result, and the lower-level learner will learn and train to assign a weight w on the preset classification result to make the final classification most accurate.
  • the lower learning period can be regression prediction.
  • a variety of preset machine learning models are used to perform feature classification on the feature classification results, and all feature data obtained through K-fold cross-validation are used as the predicted classification results, and the predicted classification results are weighted according to the preset fusion method to obtain Comprehensive classification results to ensure that the classification in the comprehensive classification results is accurate.
  • step 206 includes:
  • Step 802 Calculate the information value of the associated nominal variable, and filter the associated nominal variable according to the information value to obtain a preset number of associated nominal variables as data evaluation variables.
  • This embodiment calculates the proof weight WOE of the IV value of each variable, which can be obtained by sequentially inputting into the formula (1)(2)(3):
  • the user tags of the data object are: good customers, bad customers, then Bad i and Good i respectively represent the number of bad customers and the number of good customers in the i-th group in the variable; Bad total , Good total Respectively represent the total number of bad customers and the total number of good customers in all groups.
  • the obtained IV value filters the associated nominal data to obtain a preset number of associated nominal variables as data evaluation variables.
  • Step 804 Input the data evaluation variables into the preset logistic regression model for data evaluation, and obtain the evaluation score of the associated nominal change.
  • the data evaluation variables filtered according to the IV value are input into the preset logistic regression model to perform the classification operation.
  • the score of each item of the associated nominal variable is calculated by the preset logistic regression model as the evaluation score.
  • the classifier of the machine learning model calculates the score of a certain type of ability data as the evaluation score of the certain ability data.
  • the preset logistic regression model may be a classifier based on the logistic regression model.
  • the selected 10-20 associated nominal variables of the variables with higher IV values are input into the model for training, and the classification probability PL is calculated through logistic regression, and the user corresponding to the user is calculated based on the classification probability PL.
  • the probability that the object to be classified belongs to a certain category, so as to achieve accurate classification of the object to be classified and accurate data push.
  • the IV value of each variable is calculated, and the associated nominal variable is screened according to the IV value, and data with strong predictive ability of the variable is screened out, so that the obtained evaluation score is more accurate.
  • a data object classification device is provided, and the data object classification device corresponds to the data object classification method in the foregoing embodiment one-to-one.
  • the data object classification device includes a data classification module 902, a data classification module 904, a data evaluation module 906, an object screening module 908, and an object classification module 910, wherein:
  • the data division module 902 is used to obtain the basic data of each object to be classified as the data to be processed, and divide the data to be processed into standard data and associated data according to preset screening requirements.
  • the data classification module 904 is used to obtain standard nominal variables according to the standard data, perform feature classification on the standard nominal variables, and perform fusion processing on the classified results to obtain a comprehensive classification result.
  • the data evaluation module 906 is used to obtain the associated nominal variable according to the associated data, and perform data evaluation on the associated nominal variable to obtain the evaluation score of the associated nominal variable.
  • the object screening module 908 is used to determine the target classification of the object to be classified in the comprehensive classification result according to the basic data of each object to be classified, and determine the target score of the object to be classified relative to the evaluation score according to the associated data, and divide the target into The value is weighted to obtain the estimated probability that the object to be classified belongs to the target classification.
  • the object classification module 910 is configured to classify the object to be classified into the target classification as the target object if the evaluation probability is greater than the preset threshold.
  • the data division module 902 includes:
  • the data classification sub-module 9022 is used to classify the to-be-processed data according to the object attributes to obtain the object attribute data;
  • the correlation calculation sub-module 9024 is used to calculate the correlation coefficient between the object attribute data and the preset screening demand data through the Spearman rank correlation coefficient method, as the data correlation level;
  • the correlation determination sub-module 9026 is configured to use the object attribute data as the standard data if the data correlation level meets the preset correlation level; and also used to use the object attribute data as the correlation data if the data correlation level does not meet the preset correlation level.
  • the data classification module 904 includes:
  • the pseudo-splitting sub-module 9042 is used to extract the continuous variable data in the standard data, perform pseudo-splitting of the continuous variable data according to the preset pseudo-splitting point, and obtain the information gain of the continuous variable data before and after the pseudo-splitting;
  • the splitting sub-module 9044 is used to split the continuous variable data with the pseudo split point as the split point if the information gain is greater than the preset gain difference to obtain the discretized data after splitting, and use the split point as the next pseudo split Preset the pseudo-split point for splitting;
  • the split pre-judgment sub-module 9046 is used to stop the split if the number of splits reaches the preset number of splits, and use the discretized data obtained after the last split as a discrete variable;
  • the dimensionality reduction processing sub-module 9048 is used to reduce the dimensionality of discrete variables through data binning, and sort the discrete data obtained after the dimensionality reduction processing according to the characteristic values of the continuous variable data to obtain standard nominal variables.
  • pseudo-split sub-module 9042 includes:
  • the entropy calculation unit 9042a is configured to perform feature data cutting from the continuous variable data according to a preset pseudo-splitting point, and calculate the pseudo-splitting data entropy of the feature data obtained after the cutting;
  • the information gain unit 9042b is used to obtain the continuous data entropy of the continuous variable data, and calculate the difference between the continuous data entropy and the pseudo-splitting data entropy, as the information gain.
  • the data classification module 904 further includes:
  • the feature classification sub-module 9050 is used to input standard nominal variables into at least two preset machine learning models for feature classification, and obtain the feature classification results corresponding to each preset machine learning model;
  • the feature fusion sub-module 9052 is used to perform fusion processing on the feature classification results using the K-fold cross-validation method to obtain a comprehensive classification result.
  • the feature fusion sub-module 9052 includes:
  • the feature cutting unit 9052a is used to segment the feature classification result into a feature classification training set and a feature classification test set;
  • the model verification unit 9052b is used to segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set, and verify the feature classification model obtained through the feature classification feeding set training according to the feature classification verification set, and obtain Feature classification verification data;
  • the model testing unit 9052c is used to input the feature classification test set into the feature classification model for testing to obtain feature classification test data;
  • the re-cutting unit 9052d is used to re-segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set for the next training and verification;
  • the feature fusion unit 9052e is used to stop the segmentation when the number of re-segmentation reaches the preset number of divisions, and process all the obtained feature classification verification data and all the obtained feature classification test data according to the preset fusion conditions to obtain feature classification prediction Data, and use feature classification prediction data as a comprehensive classification result.
  • the data evaluation module 906 includes:
  • variable screening sub-module 9062 is used to calculate the information value of the associated nominal variable, and filter the associated nominal variable according to the information value to obtain a preset number of associated nominal variables as data evaluation variables;
  • the score evaluation sub-module 9064 is used to input the data evaluation variables into the preset logistic regression model for data evaluation to obtain the evaluation scores of the associated nominal variables.
  • the above-mentioned data object classification device classifies the acquired basic data of the data object and then inputs it into the preset classifier for processing to obtain the classification result and the evaluation result, and then summarize the obtained classification results to determine the data object with screening Target classification.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store user order data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a data object classification method.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a data object classification method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touch pad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program
  • the data object classification method in the foregoing embodiment is implemented. Steps, such as step 202 to step 210 shown in FIG. 2, or, when the processor executes the computer program, the function of each module/unit of the data object classification apparatus in the above-mentioned embodiment is realized, for example, the functions of modules 902 to 910 shown in FIG. 9 Features. To avoid repetition, I won’t repeat them here.
  • a computer-readable storage medium is provided.
  • the storage medium is a volatile storage medium or a non-volatile storage medium, and a computer program is stored thereon.
  • the computer program is executed by a processor, the foregoing
  • the steps of the data object classification method in the embodiment, such as step 202 to step 208 shown in FIG. 2, or, when the processor executes the computer program, the function of each module/unit of the data object classification device in the above embodiment is realized, for example, FIG. 9
  • the functions of modules 902 to 910 are shown. To avoid repetition, I won’t repeat them here.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据对象分类方法、装置、计算机设备及可读存储介质,属于大数据领域。所述方法包括获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;根据标准数据得到标准名义变量,对标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;根据关联数据得到关联名义变量,并对关联名义变量进行数据评估,得到关联名义变量的评估分值;根据评估分值确定综合分类结果中各类别中的目标对象。采用本方法提高了数据对象分类的精准度,解决了现有技术中对数据对象分类不准确的技术问题。

Description

数据对象分类方法、装置、计算机设备和存储介质
本申请要求于2019年11月25日提交中国专利局、申请号为201911165269.4,发明名称为“数据对象分类方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
【技术领域】
本申请涉及大数据技术领域,特别是涉及一种数据对象分类方法、装置、计算机设备和存储介质。
【背景技术】
伴随着云计算和大数据的发展,在大数据计算领域涌现出了各种各样的计算模型,用于在各种各样的数据计算场景中进行处理和计算;其中,从海量数据中筛选出部分有用数据这一数据计算场景的应用范围变得越来越广(例如信息推送场景、数据分类场景等),尤其表现为根据一次性输入的大批量筛选需求,从海量用户数据中筛选出满足大批量筛选需求中,各个筛选需求的目标用户群。
现有的解决方案是通过机器学习模型(例如:决策树模型(Gradient Boosting Decision Tree,GBDT))来对数据对象的特征进行提取并分类,并将分类后的数据与筛选需要进行对比,得到不同筛选需要的目标数据。发明人意识到这些方式对于组成结构较为单一的数据,具有较高的分类准确率,但是对于组成结构比较复杂的数据(例如同一对象,其数据来源为两个或多个,不同来源的数据其维度不一定相同,且存在部分关联),现有的机器学习模型无法对关联维度的特征进行有效综合,使得提取到的特征准确度不够,导致对象数据的筛选分类不够准确。
【发明内容】
基于此,有必要针对上述技术问题,本申请提供一种数据对象分类方法、装置、计算机设备及存储介质,以解决现有技术中无法对特征进行准确提取,导致的数据对象分类不准确的技术问题。
一种数据对象分类方法,所述方法包括:
获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;
根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
一种数据对象分类装置,所述装置包括:
数据划分模块,用于获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求将所述待处理数据分为标准数据与关联数据;
数据分类模块,用于根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
数据评估模块,用于根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
对象筛选模块,用于根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
对象分类模块,用于若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
一种计算机设备,包括:一个或多个处理器;存储器;一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种数据对象分类方法,其中,所述数据对象分类方法包括:
获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;
根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现数据对象分类方法,其中,所述数据对象分类方法包括以下步骤:
获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;
根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
上述数据对象分类方法、装置、计算机设备和存储介质,通过对获取到的数据对象的基础数据进行分类后分别输入到预设的分类器中进行处理得到分类结果、评估结果,然后汇总得到的分类结果,确定带筛选数据对象的目标分类。使得最终确定目标分类中包含的待筛选数据对象更具有针对性,更符合筛选需求,提高了数据对象分类的精准度,解决了现有技术中对数据对象分类不准确的技术问题。
【附图说明】
图1为数据对象分类方法的应用环境示意图;
图2为数据对象分类方法的流程示意图;
图3为图2中步骤202的流程示意图;
图4为图2中步骤204的流程示意图;
图5为图4中步骤402的流程示意图;
图6为图2中步骤204的另一流程示意图;
图7为图6中步骤604的一流程示意图;
图8为图2中步骤206的流程示意图;
图9为数据对象分类装置的示意图;
图10为一个实施例中计算机设备的示意图。
【具体实施方式】
本申请实施例提供的数据对象分类方法,可以应用于如图1所示的应用环境中。其中,该应用环境可以包括终端102、网络106以及服务端104,网络106用于在终端102和服务端104之间提供通信链路介质,网络106可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端102通过网络106与服务端104交互,以接收或发送消息等。终端102上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端102可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务端104可以是提供各种服务的服务器,例如对终端102上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的对账方法一般由服务端/终端执行,相应地,对账装置一般设置于服务端/终端设备中。
应该理解,图1中的终端、网络和服务端的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
其中,终端102通过网络与服务端104进行通信。服务端104将终端102作为数据对象从中拉取基础数据,并根据预设筛选对拉取到的基础数据进行分类。分类后的数据经过不同的处理方式处理后分别得到综合分类结果和评估分值,最后根据分类后的数据对数据对象进行分类和分类评估,确定该数据对象属于得到的分类类别的评估概率,若评估概率大于预设阈值则认为分类正确,将该数据对象作为目标对象。其中,终端102和服务端104之间通过网络进行连接,该网络可以是有线网络或者无线网络,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务端104可以用独立的服务器或者是多个组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种数据对象分类方法,以该方法应用于图1中的服务端为例进行说明,包括以下步骤:
步骤202,获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将待处理数据分为标准数据与关联数据。
待处理数据来自于多个数据源,服务器从每一个数据源那里采集需要的数据作为基础数据,然后将得到的所有基础数据作为待处理数据,并根据预设筛选需求数据将待处理数据分 成标准数据和关联数据。其中,数据源可以是一个服务器集群的每台服务器,该数据对象对应的基础数据可以包括服务器的运行时间、硬件参数、日志文件、历史维修记录以及服务器完成解决方案的数量等等,然后将每一台服务器视为一个数据对象,作为待分类对象。
在本实施例中,由于数据对象众多,每个数据对象对应的基础数据维度较广,为了提高存储和获取效率,一般采用分布式的存储方式来进行数据存储,但一般的分布式存储由于宽带传输的速率过慢,使得数据采集效率低,或者数据量太大时,使得数据采集通道拥堵,甚至瘫痪,因而,在本实施例中,采用大数据平台,从分布式存储系统中,根据数据对象的标识,从每个数据源中,获取包含该数据对象的标识的数据,作为该数据对象的基础数据。
其中,标准数据是与预设筛选需求数据相关性强的数据,而关联数据是与预设筛选需求数据相关性相对标准数据较弱或者与预设筛选需求数据没有相关性的数据。
步骤204,根据标准数据得到标准名义变量,对标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果。
应为采集到的基础数据中数据类型比较多样,既可以包含连续变量也可以包括离散变量。其中连续变量比如服务器的温度变化,这些连续变量无法直接应用到后续的数据分析中,因而需要对其进行数据预处理。离散变量指变量值可以按一定顺序一一列举,通常以整数位取值的变量。如服务器单位时间内解决问题的次数,一年中每个月被维修的次数等等,离散变量的数值可以用计数的方法来获取。
一般数据预处理的方式包括:连续变量离散化、数据分箱和热度编码等,本实施例可以通过连续变量离散化获取需要的数据,作为标准名义变量。本实施例中的标准名义变量便是离散变量。然后再对标准名义变量进行特征分类处理以获取标准名义变量中具有区分性的特征以及特征组合,得到一个特征分类结果,最后对分类后的结果进行融合处理,得到预测结果,作为综合分类结果。
步骤206,根据关联数据得到关联名义变量,并对关联名义变量进行数据评估,得到关联名义变量的评估分值。
在对关联数据进行处理时,可以先对关联数据进行筛选操作,得到符合业务场景的数据,然后再对筛选得到的数据进行处理,得到关联名义变量。
更具体地,从关联数据中选出各个变量或者各个参数的数据,并计算变量或参数的信息值(IV值,Information Value)进行筛选操作,然后根据计算得到的IV值选出一定数量的变量或参数。例如,逻辑回归模型选择8-14个IV值符合要求的变量或参数的数据,提升树模型选择20-30个变量或参数的数据。
对于每个变量的IV值,随机从变量的数据中选取i个组别的数据输入到逻辑回归模型中进行评估,因为一个变量所对应的数据对象可以为多个(一般为多个),所以根据每个变量所选取到的数据对象是不相同的。其中,对每一个数据对象都设有用户标签,比如潜在客户、铁杆客户或者无意客户等等。具体地,IV值是基于数据的证据权重(WOE,Weight of Evidence)计算得到的。
步骤208,根据每个待分类对象的基础数据确定待分类对象在综合分类结果中的目标分类,并根据关联数据确定待分类对象相对于评估分值的目标分值,将目标分值进行加权处理,得到待分类对象属于目标分类的评估概率。
对于任意一个数据对象,即待分类对象,都需要将该待分类对象的标准数据放入待处理数据与综合分类结果中进行比对,确定该待分类对象所属在综合分类结果中的类别,再根据该待处理数据的关联数据和上述评估分值确定该待分类对象的目标分值;然后再对目标分值进行加权处理,得到一个评估概率,该评估概率可以用于作为展示待分类对象数据目标分类中的概率。
步骤210,若评估概率大于预设阈值,则将待分类对象归类到目标分类中,作为目标对象。
当评估概率大于预设阈值时,则可以进一步确认该待分类对象是属于该目标分类的,换 而言之,也就是得到了该待分类对象的目标分类,然后对于我一个待分类对象都做如是处理,由此,可得到综合分类结果中的每个类别中所对应的数据对象,实现对待分类对象的筛选与分类。需要说明的是,综合分类结果中可以包含多个类别,在最终的分类结果中,不一定每个类别中均存在符合要求的数据对象,具体以数据处理后的结果为准。
上述数据对象分类方法中,通过对获取到的数据对象的基础数据进行分类后分别输入到预设的分类器中进行处理得到分类结果、评估结果,然后汇总得到的分类结果,确定带筛选数据对象的目标分类。使得最终确定目标分类中包含的待筛选数据对象更具有针对性,更符合筛选需求,提高了数据对象分类的精准度,解决了现有技术中对数据对象分类不准确的技术问题。
在一个实施例中,如图3所示,在步骤202,包括:
步骤302,根据对象属性对待处理数据进行分类,得到对象属性数据。
对象属性是待分类对象的各种属性,比如服务器上保存的有用户的位置数据、某话题下问题的参与数、购买某类物品次数、搜索某处地名的频次等等。那么用户常去的位置便是待处理数据的其中一个对象属性。
步骤304,通过斯皮尔曼等级相关系数方式计算所述对象属性数据与预设筛选需求数据之间的相关系数,作为数据相关等级。
斯皮尔曼等级相关系数用于评估了使用单调函数描述两个变量之间关系的程度的算法。预设筛选需求是可以根据应用场景进行设定的,例如:给用户推送参加某科技研讨会,需要获取目标人群。具体地,通过斯皮尔曼等级相关系数计算某人购买某类物品的次数与某人参加某科技研讨会之间的关联性、某人在某话题下问题的参与数与某人参加某科技研讨会的概率。比如,该对象属性数据为“用户参与电气工程类话题下的问题次数”与“某人参加某科技公司发起的知识产权与企业研发大会”之间的相关性,或者“某人去某地的次数”与“某人参加某科技公司发起的知识产权与企业研发大会”之间的相关性。
针对每一个对象属性的对象属性数据都通过斯皮尔曼等级相关系数计算它们与预设筛选需求数据之间的相关性,作为数据相关等级。其中,得出的结果可以是单调相关,也可以是无任何相关,这些可以通过得出的数据图表中直观得出。
步骤306,若数据相关等级符合预设相关等级,则将对象属性数据作为标准数据。
预设相关等级可以是对象属性数据与预设筛选需求数据是正相关。若数据相关等级也为对象属性数据与预设筛选需求数据是正相关,则将对象属性数据作为标准数据。
步骤308,若数据相关等级不符合预设相关等级,则将对象属性数据作为关联数据。
将不符合预设相关等级的对象属性数据作为关联数据,其中,预设相关等级数据不仅仅只是对象属性数据与预设筛选需求数据是正相关,也可以是对象属性数据与预设筛选需求数据是负相关,这个需要根据需要而定,此处不做限定。
本实施例通过计算对象属性数据与预设筛选需求数据之间的相关程度将待处理数据进行分类后处理,将与预设筛选需求数据具有强相关性的数据集合进行处理,得到处理结果,提高了待分类对象分类的准确度。
在一个实施例中,如图4所示,步骤204,包括:
步骤402,提取标准数据中的连续变量数据,按照预设拟分裂点对连续变量数据进行拟分裂,得到拟分裂前和拟分裂后连续变量数据的信息增益。
拟分裂点可以是按照等差比标记在连续变量数据上的若干点,这些点将连续变量数据分割成若干份的子数据,然后计算每一份子数据的熵,得到一个拟分裂熵,再与连续变量数据的熵进行对比,得到拟分裂后各个子数据的熵与拟分裂前连续变量数据的熵的差值,作为增益变量。
步骤404,若信息增益大于预设增益差值,则将拟分裂点作为分裂点对连续变量数据进行分裂,得到分裂后的离散化数据,并将分裂点作为下一次拟分裂的预设拟分裂点进行分裂。
将信息增益大于预设增益差值的拟分裂点作为分裂点对连续变量数据进行分裂,得到离 散化数据,并获取本次分裂的分裂点作为下一次拟分裂的预设拟分裂点,再进行下一次的分裂。其中,预设的增益差值可以根据业务需要进行调整。
步骤406,若分裂的次数达到预设分裂次数,则停止分裂,并将最后一次分裂后得到的离散化数据作为离散变量。
若分裂的次数达到预设分裂次数,则表示最后一次分裂后得到的离散化数据已经可以满足需要了,则停止分裂,并将最后一次分裂得到的离散化数据作为离散变量。
进一步地,在达到预设分裂次数之前得到的离散化数据并不是真正离散的数据,如:12/3/5/13/34,而是每次获取的都是信息增量符合预设增益差值的分裂点的两侧的数据,该数据可以是某一段时间内的数据。
步骤408,通过数据分箱对离散变量进行降低维度处理,并根据连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到标准名义变量。
使用数据分箱的方式对离散变量进行降低维度,数据分箱的方式包括但不限于:等频分箱和等宽分箱,将分箱后的每一箱离散变量作为一个名义变量,并根据名义变量的特征值由小到大对名义变量进行排序。其中,标准名义变量属于分类变量,其变量值是定性的,即在现有的前提或条件下确定的数值,表现为互不相容的类别或属性。
本实施例中将连续变量数据进行离散化处理,可以提高数据处理的速度,方便存储和运用。
在一个实施例中,如图5所示,步骤402,还包括:
步骤502,根据预设拟分裂点从对连续变量数据进行特征数据切割,并计算切割后得到特征数据的拟分裂数据熵。
根据预设拟分裂点对连续变量数据进行特征数据切割,在一个具体实施例中,某监控端对一台服务器进行温度监控,在上午8:00-9:00这一个小时内每一分钟对该服务器进行一次温度测量,得到60个测量值分别记为:t_1,t_2,t_3,...,t_59,t_60,容易理解地,该测量值为连续变量,如果要获取该小时内的温度变化大致情况,则需要获取上述测量值,并在每两个测量值之间设置拟分裂点,分别记为:n_1,n_2,n_3,...,n_58,n_59,其中,n_1为t_1和t_2之间的拟分裂点,n_2为t_2和t_3之间的拟分裂点。然后计算每两个拟分裂点之间的数据熵作为拟分裂数据熵。
步骤504,获取连续变量数据的连续数据熵,并计算连续数据熵与拟分裂数据熵的差值,作为信息增益。
计算连续变量数据的熵作为连续数据熵,并计算每一个拟分裂数据熵与该连续数据熵之间的差值,将计算得到的差值作为信息增量。其中,信息增量不只是一个,每一个拟分裂熵都对应一个信息增量。直到拟分裂的次数达到预设分裂次数,则停止分裂,最终得到5个收敛区间,分别为:[t_1,t_11]、[t_12,t_23]、[t_24,t_26]、[t_27,t_45]和[t_46,t_60],这5个收敛区间对应的收敛值分别为13℃、16℃、14℃、17℃和18℃,将这5个收敛值作为该小时内温度变化的参考值,即离散变量,由这5个离散变量可以明显看出这一个小时内温度的变化情况,而无需去查看具体每分钟的温度的测量值。
本实施例中,通过对连续变量数据进行离散化处理,提取连续变化数据中具备代表性的数值,简化分析,提高数据处理的速度,也方便存储和运用。
在一个实施例中,如图6所示,步骤204,包括:
步骤602,将标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个预设机器学习模型对应的特征分类结果。
预设机器学习模型包括但不限于:梯度提升树(Gradient Boosting Decison Tree,GBDT)、提升树(Boosting Tree)、随机森林(Random forest)和ID3算法模型等。其中,特征分类包括对标准名义变量进行特征提取,然后根据提取到的特征进行分类处理,得到特征分类结果。
步骤604,采用K折交叉验证方式对特征分类结果进行融合处理,得到综合分类结果。
K折交叉验证(k-fold cross-validation)首先将所有数据即通过每一个预设机器学习模型得到特征分类结果分割成K个子样本,不重复的选取其中一个子样本作为测试集,其他K-1个样本用来训练。共重复K次,平均K次的得到结果或者使用其它指标,最终得到一个单一估测。
通过本实施例,通过K折交叉验证保证每个子样本都参与训练,降低泛化误差。
在一个实施例中,如图7所示,步骤604,包括:
步骤702,将特征分类结果分割为特征分类训练集以及特征分类测试集。
特征分类训练集用于训练模型用,特征分类测试集用于测试通过特征分类训练集训练处的特征分类模型,本实施例中,特征分类训练集为10000,特征分类测试集为2500行。
步骤704,根据预设切割条件分割特征分类训练集,得到特征分类喂养集以及特征分类验证集,并根据特征分类验证集对通过特征分类喂养集训练得到特征分类模型进行验证,得到特征分类验证数据。
预设切割条件为:每次从特征分类训练集中取出一定数量的数据作为特征分类验证集用于模型验证,剩下的数据作为特征分类喂养集用于模型的训练,在获取特征分类验证集时需要保证每次从特征分类训练集中取出的数据都是未曾参与模型验证的,以保证特征分类验证集中的每一行数据都参与到模型的验证,以及每一次进行模型训练的特征分类喂养集中都有与上次一模型训练的数据相比的新的数据。通过该方式能够降低泛化误差。
具体地,每次从特征分类训练集中取出2000行数据作为特征分类验证集,剩余的8000行数据用于模型训练,相当于每次都使用了新的2000条数据验证新训练出的特征分类模型,使用模型对验证集进行验证得到2000条数据,每一次的验证得到2000条数据,10000行特征分类训练集的数据可以分5次验证,得到10000条验证数据,作为特征分类验证数据。
步骤706,将特征分类测试集输入到特征分类模型中测试,得到特征分类测试数据。
将2500行特征分类测试集输入到每一次训练出的特征分类模型中进行预测,每一次都可以得到2500条测试数据,则将该测试数据作为特征分类测试数据。
步骤708,根据预设切割条件重新分割特征分类训练集,得到特征分类喂养集以及特征分类验证集以进行下一次的训练和验证。
每一次分割完成后或者每一次模型训练结束后都可以根据预设切割条件对特征分类训练集重新分割,或者提前根据预设切割条件将特征分类训练集分割预设分割次数,然后每一次模型训练都使用新的特征分类喂养集进行训练等等。
步骤710,当重新分割的次数达到预设分割次数,停止分割,并根据预设融合条件对得到的所有特征分类验证数据以及得到的所有特征分类测试数据进行处理,得到特征分类预测数据,并将特征分类预测数据作为综合分类结果。其中,本实施例的预设分割次数可以是5次。
预设融合条件是对得到的所有的特征分类测试数据、特征分类验证数据进行处理的方式,本实施例是:对通过对每一个预设机器学习模型得到特征分类结果进行模型训练、验证以及测试得到的特征分类测试数据以及特征分类验证数据进行集成。具体地,本实施例可以是只采用3中预设机器学习模型,可以得到6个数据矩阵,即,对于每一个预设机器学习模型的特征分类结果进行模型训练、验证测试后得到的特征分类验证数据都作为一个数据矩阵,每一个特征分类测试数据都也作为一个数据矩阵。
将3个预设机器学习模型对应的特征分类验证数据分别标记为A1、A2、A3并列在一起成10000行3列的矩阵作为训练数据(training data),得到的特征分类测试数据标记为B1、B2、B3合并在一起成2500行3列的矩阵作为测试数据(testing data),让下层学习器基于这样的数据根据预设融合条件进行再训练。其中,预设融合条件是基于每个预设机器学习模型的特征分类验证数据以及特征分类测试数据作为三个特征,其中,将每一个预设机器学习模型对应的特征分类验证数据以及特征分类测试数据作为一个预测分类结果,下层学习器会学习训练在预设分类结果上赋予权重w,来使得最后的分类最为准确。其中,下层学习期可以 是回归预测。
本实施例通过多种预设机器学习模型对特征分类结果进行特征分类,并通过K折交叉验证方式得到的所有特征数据作为预测分类结果,根据预设融合方法对预测分类结果进行赋予权重,得到综合分类结果,以保证得到综合分类结果中的分类准确。
在一个实施例中,如图8所示,步骤206,包括:
步骤802,计算关联名义变量的信息值,并根据信息值对关联名义变量进行筛选,得到预设数量的关联名义变量,作为数据评估变量。
本实施例计算每个变量的IV值的证据权重WOE,可以通过依次输入到公式(1)(2)(3)中得到:
Figure PCTCN2020085805-appb-000001
Figure PCTCN2020085805-appb-000002
IV=∑IV i        (3);
其中,比如对数据对象的用户标签有:好客户、坏客户,那么Bad i,Good i分别表示该变量中第i个组别中的坏客户个数和好客户个数;Bad total,Good total分别表示所有组别中坏客户总数和好客户总数。其中,Bad、Good是对训练数据分类设置的标签。注意当Badi=0时,该组用户的IV值直接设为1。
然后得到的IV值对关联名义数据进行筛选,得到预设数量的关联名义变量,作为数据评估变量。
步骤804,将数据评估变量输入到预设逻辑回归模型中进行数据评估,得到关联名义变的评估分值。
将根据IV值筛选得到的数据评估变量输入到预设逻辑回归模型中,执行分类操作,本实施例通过预设逻辑回归模型计算关联名义变量的各个分项的分数,作为评估分值,通过基于机器学习模型的分类器计算某类能力数据的分数,作为某某能力数据的评估分值。其中,预设逻辑回归模型可以是基于逻辑回归模型的分类器。
具体地,在逻辑回归子模型中,将筛选的10-20个IV值较高的变量的关联名义变量输入模型进行训练,通过逻辑回归计算分类概率PL,基于分类概率PL计算出该用户对应的待分类对象属于某类别的概率,从而实现对待分类对象进行准确归类,实现精准的数据推送。
本实施例通过计算每个变量的IV值,并根据IV值对关联名义变量进行筛选,从中筛选出变量预测能力较强的数据,使得得到的评估分值更加精确。
应该理解的是,虽然图2-图8的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-图8中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图9所示,提供了一种数据对象分类装置,该数据对象分类装置与上述实施例中数据对象分类方法一一对应。该数据对象分类装置包括数据划分模块902、数数据分类模块904、数据评估模块906、对象筛选模块908以及对象分类模块910,其中:
数据划分模块902,用于获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求将待处理数据分为标准数据与关联数据。
数据分类模块904,用于根据标准数据得到标准名义变量,对标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果。
数据评估模块906,用于根据关联数据得到关联名义变量,并对关联名义变量进行数据 评估,得到关联名义变量的评估分值。
对象筛选模块908,用于根据每个待分类对象的基础数据确定待分类对象在综合分类结果中的目标分类,并根据关联数据确定待分类对象相对于评估分值的目标分值,将目标分值进行加权处理,得到待分类对象属于目标分类的评估概率。
对象分类模块910,用于若评估概率大于预设阈值,则将待分类对象归类到目标分类中,作为目标对象。
进一步地,数据划分模块902,包括:
数据分类子模块9022,用于根据对象属性对待处理数据进行分类,得到对象属性数据;
相关计算子模块9024,用于通过斯皮尔曼等级相关系数方式计算对象属性数据与预设筛选需求数据之间的相关系数,作为数据相关等级;
相关判定子模块9026,用于若数据相关等级符合预设相关等级,则将对象属性数据作为标准数据;还用于若数据相关等级不符合预设相关等级,则将对象属性数据作为关联数据。
进一步地,数据分类模块904,包括:
拟分裂子模块9042,用于提取标准数据中的连续变量数据,按照预设拟分裂点对连续变量数据进行拟分裂,得到拟分裂前和拟分裂后连续变量数据的信息增益;
分裂子模块9044,用于若信息增益大于预设增益差值,则将拟分裂点作为分裂点对连续变量数据进行分裂,得到分裂后的离散化数据,并将分裂点作为下一次拟分裂的预设拟分裂点进行分裂;
分裂预判子模块9046,用于若分裂的次数达到预设分裂次数,则停止分裂,并将最后一次分裂后得到的离散化数据作为离散变量;
降维处理子模块9048,用于通过数据分箱对离散变量进行降低维度处理,并根据连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到标准名义变量。
进一步地,拟分裂子模块9042,包括:
熵计算单元9042a,用于根据预设拟分裂点从对连续变量数据进行特征数据切割,并计算切割后得到特征数据的拟分裂数据熵;
信息增益单元9042b,用于获取连续变量数据的连续数据熵,并计算连续数据熵与拟分裂数据熵的差值,作为信息增益。
进一步地,数据分类模块904,还包括:
特征分类子模块9050,用于将标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个预设机器学习模型对应的特征分类结果;
特征融合子模块9052,用于采用K折交叉验证方式对特征分类结果进行融合处理,得到综合分类结果。
进一步地,特征融合子模块9052,包括:
特征切割单元9052a,用于将特征分类结果分割为特征分类训练集以及特征分类测试集;
模型验证单元9052b,用于根据预设切割条件分割特征分类训练集,得到特征分类喂养集以及特征分类验证集,并根据特征分类验证集对通过特征分类喂养集训练得到特征分类模型进行验证,得到特征分类验证数据;
模型测试单元9052c,用于将特征分类测试集输入到特征分类模型中测试,得到特征分类测试数据;
重新切割单元9052d,用于根据预设切割条件重新分割特征分类训练集,得到特征分类喂养集以及特征分类验证集以进行下一次的训练和验证;
特征融合单元9052e,用于当重新分割的次数达到预设分割次数,停止分割,并根据预设融合条件对得到的所有特征分类验证数据以及得到的所有特征分类测试数据进行处理,得到特征分类预测数据,并将特征分类预测数据作为综合分类结果。
进一步地,数据评估模块906,包括:
变量筛选子模块9062,用于计算关联名义变量的信息值,并根据信息值对关联名义变量 进行筛选,得到预设数量的关联名义变量,作为数据评估变量;
分值评估子模块9064,用于将数据评估变量输入到预设逻辑回归模型中进行数据评估,得到关联名义变量的评估分值。
上述数据对象分类装置,通过对获取到的数据对象的基础数据进行分类后分别输入到预设的分类器中进行处理得到分类结果、评估结果,然后汇总得到的分类结果,确定带筛选数据对象的目标分类。使得最终确定目标分类中包含的待筛选数据对象更具有针对性,更符合筛选需求,提高了数据对象分类的精准度,解决了现有技术中对数据对象分类不准确的技术问题。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储用户订单数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种数据对象分类方法。
其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种数据对象分类方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述实施例中数据对象分类方法的步骤,例如图2所示的步骤202至步骤210,或者,处理器执行计算机程序时实现上述实施例中数据对象分类装置的各模块/单元的功能,例如图9所示模块902至模块910的功能。为避免重复,此处不再赘述。
在一个实施例中,提供了一种计算机可读存储介质,所述存储介质为易失性存储介质或非易失性存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述实施例中数据对象分类方法的步骤,例如图2所示的步骤202至步骤208,或者,处理器执行计算机程序时实现上述实施例中数据对象分类装置的各模块/单元的功能,例如图9所示模块902至模块910的功能。为避免重复,此处不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易 失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形、改进或者对部分技术特征进行等同替换,而这些修改或者替换,并不使相同技术方案的本质脱离本发明个实施例技术方案地精神和范畴,都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种数据对象分类方法,所述方法包括:
    获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;
    根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
    根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
    根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
    若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
  2. 根据权利要求1所述的方法,所述根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据,包括:
    根据对象属性对所述待处理数据进行分类,得到对象属性数据;
    通过斯皮尔曼等级相关系数方式计算所述对象属性数据与所述预设筛选需求数据之间的相关系数,作为数据相关等级;
    若所述数据相关等级符合预设相关等级,则将所述对象属性数据作为所述标准数据;
    若所述数据相关等级不符合预设相关等级,则将所述对象属性数据作为所述关联数据。
  3. 根据权利要求1所述的方法,所述标准数据包括连续变量数据,所述根据所述标准数据得到标准名义变量,包括:
    提取所述标准数据中的连续变量数据,按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益;
    若所述信息增益大于预设增益差值,则将所述拟分裂点作为分裂点对所述连续变量数据进行分裂,得到分裂后的离散化数据,并将所述分裂点作为下一次拟分裂的预设拟分裂点进行所述分裂;
    若所述分裂的次数达到预设分裂次数,则停止分裂,并将最后一次所述分裂后得到的所述离散化数据作为离散变量;
    通过数据分箱对所述离散变量进行降低维度处理,并根据所述连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到所述标准名义变量。
  4. 根据权利要求3所述的方法,所述按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益,包括:
    根据所述预设拟分裂点从对所述连续变量数据进行特征数据切割,并计算切割后得到所述特征数据的拟分裂数据熵;
    获取所述连续变量数据的连续数据熵,并计算所述连续数据熵与所述拟分裂数据熵的差值,作为信息增益。
  5. 根据权利要求1所述的方法,所述对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果,包括:
    将所述标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个所述预设机器学习模型对应的特征分类结果;
    采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果。
  6. 根据权利要求5所述的方法,所述采用K折交叉验证方式对所述特征分类结果进行融 合处理,得到综合分类结果,包括:
    将所述特征分类结果分割为特征分类训练集以及特征分类测试集;
    根据预设切割条件分割所述特征分类训练集,得到特征分类喂养集以及特征分类验证集,并根据所述特征分类验证集对通过所述特征分类喂养集训练得到特征分类模型进行验证,得到特征分类验证数据;
    将所述特征分类测试集输入到所述特征分类模型中测试,得到特征分类测试数据;
    根据所述预设切割条件重新分割所述特征分类训练集,得到所述特征分类喂养集以及所述特征分类验证集以进行下一次的训练和验证;
    当所述重新分割的次数达到预设分割次数,停止分割,并根据预设融合条件对得到的所有所述特征分类验证数据以及得到的所有所述特征分类测试数据进行处理,得到特征分类预测数据,并将所述特征分类预测数据作为所述综合分类结果。
  7. 根据权利要求1所述的方法,所述对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值,包括:
    计算所述关联名义变量的信息值,并根据所述信息值对所述关联名义变量进行筛选,得到预设数量的关联名义变量,作为数据评估变量;
    将所述数据评估变量输入到预设逻辑回归模型中进行数据评估,得到所述关联名义变量的评估分值。
  8. 一种数据对象分类装置,包括:
    数据划分模块,用于获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求将所述待处理数据分为标准数据与关联数据;
    数据分类模块,用于根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
    数据评估模块,用于根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
    对象筛选模块,用于根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
    对象分类模块,用于若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
  9. 一种计算机设备,包括:
    一个或多个处理器;
    存储器;
    一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种数据对象分类方法,其中,所述数据对象分类方法包括:
    获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;
    根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
    根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
    根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
    若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标 对象。
  10. 根据权利要求9所述的计算机设备,所述根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据,包括:
    根据对象属性对所述待处理数据进行分类,得到对象属性数据;
    通过斯皮尔曼等级相关系数方式计算所述对象属性数据与所述预设筛选需求数据之间的相关系数,作为数据相关等级;
    若所述数据相关等级符合预设相关等级,则将所述对象属性数据作为所述标准数据;
    若所述数据相关等级不符合预设相关等级,则将所述对象属性数据作为所述关联数据。
  11. 根据权利要求9所述的计算机设备,所述标准数据包括连续变量数据,所述根据所述标准数据得到标准名义变量,包括:
    提取所述标准数据中的连续变量数据,按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益;
    若所述信息增益大于预设增益差值,则将所述拟分裂点作为分裂点对所述连续变量数据进行分裂,得到分裂后的离散化数据,并将所述分裂点作为下一次拟分裂的预设拟分裂点进行所述分裂;
    若所述分裂的次数达到预设分裂次数,则停止分裂,并将最后一次所述分裂后得到的所述离散化数据作为离散变量;
    通过数据分箱对所述离散变量进行降低维度处理,并根据所述连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到所述标准名义变量。
  12. 根据权利要求11所述的计算机设备,所述按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益,包括:
    根据所述预设拟分裂点从对所述连续变量数据进行特征数据切割,并计算切割后得到所述特征数据的拟分裂数据熵;
    获取所述连续变量数据的连续数据熵,并计算所述连续数据熵与所述拟分裂数据熵的差值,作为信息增益。
  13. 根据权利要求9所述的计算机设备,所述对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果,包括:
    将所述标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个所述预设机器学习模型对应的特征分类结果;
    采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果。
  14. 根据权利要求13所述的计算机设备,所述采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果,包括:
    将所述特征分类结果分割为特征分类训练集以及特征分类测试集;
    根据预设切割条件分割所述特征分类训练集,得到特征分类喂养集以及特征分类验证集,并根据所述特征分类验证集对通过所述特征分类喂养集训练得到特征分类模型进行验证,得到特征分类验证数据;
    将所述特征分类测试集输入到所述特征分类模型中测试,得到特征分类测试数据;
    根据所述预设切割条件重新分割所述特征分类训练集,得到所述特征分类喂养集以及所述特征分类验证集以进行下一次的训练和验证;
    当所述重新分割的次数达到预设分割次数,停止分割,并根据预设融合条件对得到的所有所述特征分类验证数据以及得到的所有所述特征分类测试数据进行处理,得到特征分类预测数据,并将所述特征分类预测数据作为所述综合分类结果。
  15. 根据权利要求9所述的计算机设备,所述对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值,包括:
    计算所述关联名义变量的信息值,并根据所述信息值对所述关联名义变量进行筛选,得到预设数量的关联名义变量,作为数据评估变量;
    将所述数据评估变量输入到预设逻辑回归模型中进行数据评估,得到所述关联名义变量的评估分值。
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现数据对象分类方法,其中,所述数据对象分类方法包括以下步骤:
    获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;
    根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;
    根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;
    根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;
    若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
  17. 根据权利要求16所述的计算机可读存储介质,所述根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据,包括:
    根据对象属性对所述待处理数据进行分类,得到对象属性数据;
    通过斯皮尔曼等级相关系数方式计算所述对象属性数据与所述预设筛选需求数据之间的相关系数,作为数据相关等级;
    若所述数据相关等级符合预设相关等级,则将所述对象属性数据作为所述标准数据;
    若所述数据相关等级不符合预设相关等级,则将所述对象属性数据作为所述关联数据。
  18. 根据权利要求16所述的计算机可读存储介质,所述标准数据包括连续变量数据,所述根据所述标准数据得到标准名义变量,包括:
    提取所述标准数据中的连续变量数据,按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益;
    若所述信息增益大于预设增益差值,则将所述拟分裂点作为分裂点对所述连续变量数据进行分裂,得到分裂后的离散化数据,并将所述分裂点作为下一次拟分裂的预设拟分裂点进行所述分裂;
    若所述分裂的次数达到预设分裂次数,则停止分裂,并将最后一次所述分裂后得到的所述离散化数据作为离散变量;
    通过数据分箱对所述离散变量进行降低维度处理,并根据所述连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到所述标准名义变量。
  19. 根据权利要求18所述的计算机可读存储介质,所述按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益,包括:
    根据所述预设拟分裂点从对所述连续变量数据进行特征数据切割,并计算切割后得到所述特征数据的拟分裂数据熵;
    获取所述连续变量数据的连续数据熵,并计算所述连续数据熵与所述拟分裂数据熵的差值,作为信息增益。
  20. 根据权利要求16所述的计算机可读存储介质,所述对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果,包括:
    将所述标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个所述预设机器学习模型对应的特征分类结果;
    采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果。
PCT/CN2020/085805 2019-11-25 2020-04-21 数据对象分类方法、装置、计算机设备和存储介质 WO2021103401A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911165269.4A CN111177500A (zh) 2019-11-25 2019-11-25 数据对象分类方法、装置、计算机设备和存储介质
CN201911165269.4 2019-11-25

Publications (1)

Publication Number Publication Date
WO2021103401A1 true WO2021103401A1 (zh) 2021-06-03

Family

ID=70655374

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085805 WO2021103401A1 (zh) 2019-11-25 2020-04-21 数据对象分类方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN111177500A (zh)
WO (1) WO2021103401A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732536B (zh) * 2020-12-30 2023-01-13 平安科技(深圳)有限公司 数据监控告警方法、装置、计算机设备及存储介质
CN115293282B (zh) * 2022-08-18 2023-08-29 昆山润石智能科技有限公司 制程问题分析方法、设备及存储介质
CN115828147B (zh) * 2023-02-15 2023-04-18 博纯材料股份有限公司 一种基于数据处理的氙气生产控制方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197282A (zh) * 2018-01-10 2018-06-22 腾讯科技(深圳)有限公司 文件数据的分类方法、装置及终端、服务器、存储介质
CN109544150A (zh) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 一种分类模型生成方法及装置、计算设备及存储介质
CN110019488A (zh) * 2018-09-12 2019-07-16 国网浙江省电力有限公司嘉兴供电公司 多源异构数据融合多核分类方法
WO2019169700A1 (zh) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 一种数据分类方法、装置、设备及计算机可读存储介质
CN110413775A (zh) * 2019-06-25 2019-11-05 北京清博大数据科技有限公司 一种数据打标签分类方法、装置、终端及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197282A (zh) * 2018-01-10 2018-06-22 腾讯科技(深圳)有限公司 文件数据的分类方法、装置及终端、服务器、存储介质
WO2019169700A1 (zh) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 一种数据分类方法、装置、设备及计算机可读存储介质
CN110019488A (zh) * 2018-09-12 2019-07-16 国网浙江省电力有限公司嘉兴供电公司 多源异构数据融合多核分类方法
CN109544150A (zh) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 一种分类模型生成方法及装置、计算设备及存储介质
CN110413775A (zh) * 2019-06-25 2019-11-05 北京清博大数据科技有限公司 一种数据打标签分类方法、装置、终端及存储介质

Also Published As

Publication number Publication date
CN111177500A (zh) 2020-05-19

Similar Documents

Publication Publication Date Title
CN111178456B (zh) 异常指标检测方法、装置、计算机设备和存储介质
Gaffney et al. Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus
CN107633265B (zh) 用于优化信用评估模型的数据处理方法及装置
Desai et al. Techniques for sentiment analysis of Twitter data: A comprehensive survey
WO2021103401A1 (zh) 数据对象分类方法、装置、计算机设备和存储介质
Huang et al. Feature screening for ultrahigh dimensional categorical data with applications
US9483544B2 (en) Systems and methods for calculating category proportions
US9002755B2 (en) System and method for culture mapping
CN109191338B (zh) 基于校园一卡通消费数据的学生行为预警方法
US10706359B2 (en) Method and system for generating predictive models for scoring and prioritizing leads
CN114218958A (zh) 工单处理方法、装置、设备和存储介质
Lottering et al. A model for the identification of students at risk of dropout at a university of technology
CN116049379A (zh) 知识推荐方法、装置、电子设备和存储介质
DE102021006293A1 (de) Bestimmung digitaler Personas unter Verwendung datengetriebener Analytik
Wang et al. CPB: a classification-based approach for burst time prediction in cascades
CN116034379A (zh) 使用深度学习和机器学习的活动性水平测量
WO2021129368A1 (zh) 一种客户类型的确定方法及装置
Rezaeenour et al. Developing a new hybrid intelligent approach for prediction online news popularity
CN108038790B (zh) 一种内外数据融合的态势分析系统
CN113688120A (zh) 数据仓库的质量检测方法、装置和电子设备
CN112069807A (zh) 文本数据的主题提取方法、装置、计算机设备及存储介质
O'Neill et al. Rating by ranking: An improved scale for judgement-based labels
Batura et al. Big Data Volumes and Some Approaches to the Creation of Corporate Analytical Systems
CN112541705B (zh) 生成用户行为评估模型的方法、装置、设备以及存储介质
CN117973566B (zh) 训练数据处理方法、装置及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20893572

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20893572

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 230922)

122 Ep: pct application non-entry in european phase

Ref document number: 20893572

Country of ref document: EP

Kind code of ref document: A1