WO2021103401A1 - 数据对象分类方法、装置、计算机设备和存储介质 - Google Patents
数据对象分类方法、装置、计算机设备和存储介质 Download PDFInfo
- Publication number
- WO2021103401A1 WO2021103401A1 PCT/CN2020/085805 CN2020085805W WO2021103401A1 WO 2021103401 A1 WO2021103401 A1 WO 2021103401A1 CN 2020085805 W CN2020085805 W CN 2020085805W WO 2021103401 A1 WO2021103401 A1 WO 2021103401A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- classification
- preset
- variable
- standard
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
Definitions
- This application relates to the field of big data technology, in particular to a data object classification method, device, computer equipment and storage medium.
- the existing solution is to extract and classify the features of the data object through a machine learning model (for example, a decision tree model (Gradient Boosting Decision Tree, GBDT)), and compare the classified data with the filtering requirements to get the difference Filter the required target data.
- a machine learning model for example, a decision tree model (Gradient Boosting Decision Tree, GBDT)
- GDT decision tree model
- the inventor realizes that these methods have a higher classification accuracy for data with a relatively simple composition structure, but for data with a relatively complex composition structure (for example, the same object has two or more data sources, and data from different sources is different.
- the dimensions are not necessarily the same, and there are partial associations), the existing machine learning model cannot effectively synthesize the features of the associated dimensions, so that the accuracy of the extracted features is not enough, and the screening and classification of the object data is not accurate enough.
- This application provides a data object classification method, device, computer equipment and storage medium to solve the inaccurate data object classification technology that cannot be accurately extracted in the prior art. problem.
- a data object classification method includes:
- the target classification of the object to be classified in the comprehensive classification result is determined according to the basic data of each object to be classified, and the target score of the object to be classified relative to the evaluation score is determined according to the associated data. Value, weighting the target score to obtain the estimated probability that the object to be classified belongs to the target classification;
- the object to be classified is classified into the target classification as the target object.
- a data object classification device includes:
- the data division module is used to obtain the basic data of each object to be classified as the data to be processed, and divide the data to be processed into standard data and associated data according to preset screening requirements;
- the data classification module is used to obtain standard nominal variables according to the standard data, perform feature classification on the standard nominal variables, and perform fusion processing on the classified results to obtain a comprehensive classification result;
- a data evaluation module configured to obtain an associated nominal variable according to the associated data, and perform data evaluation on the associated nominal variable to obtain an evaluation score of the associated nominal variable;
- the object screening module is configured to determine the target classification of the object to be classified in the comprehensive classification result according to the basic data of each object to be classified, and determine that the object to be classified is relative to the Evaluating the target score of the evaluation score, weighting the target score to obtain the evaluation probability that the object to be classified belongs to the target classification;
- the object classification module is configured to classify the object to be classified into the target classification as the target object if the evaluation probability is greater than a preset threshold.
- a computer device includes: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be used by the one or more The processor executes, and the one or more computer programs are configured to execute a data object classification method, wherein the data object classification method includes:
- the target classification of the object to be classified in the comprehensive classification result is determined according to the basic data of each object to be classified, and the target score of the object to be classified relative to the evaluation score is determined according to the associated data. Value, weighting the target score to obtain the estimated probability that the object to be classified belongs to the target classification;
- the object to be classified is classified into the target classification as the target object.
- the target classification of the object to be classified in the comprehensive classification result is determined according to the basic data of each object to be classified, and the target score of the object to be classified relative to the evaluation score is determined according to the associated data. Value, weighting the target score to obtain the estimated probability that the object to be classified belongs to the target classification;
- the object to be classified is classified into the target classification as the target object.
- the above data object classification method, device, computer equipment and storage medium classify the acquired basic data of the data object and then input it into the preset classifier for processing to obtain the classification result and the evaluation result, and then summarize the obtained classification As a result, the target classification of the filtered data object is determined. This makes the final determination of the target classification included in the data objects to be screened more targeted, more in line with the screening requirements, improves the accuracy of data object classification, and solves the technical problem of inaccurate data object classification in the prior art.
- Figure 1 is a schematic diagram of the application environment of the data object classification method
- Figure 2 is a schematic flow diagram of a data object classification method
- FIG. 3 is a schematic flowchart of step 202 in FIG. 2;
- FIG. 4 is a schematic diagram of the flow of step 204 in FIG. 2;
- FIG. 5 is a schematic flowchart of step 402 in FIG. 4;
- FIG. 6 is another schematic diagram of the process of step 204 in FIG. 2;
- FIG. 7 is a schematic flowchart of step 604 in FIG. 6;
- FIG. 8 is a schematic flowchart of step 206 in FIG. 2;
- Figure 9 is a schematic diagram of a data object classification device
- Figure 10 is a schematic diagram of a computer device in an embodiment.
- the data object classification method provided by the embodiment of the present application can be applied to the application environment as shown in FIG. 1.
- the application environment may include a terminal 102, a network 106, and a server 104.
- the network 106 is used to provide a communication link medium between the terminal 102 and the server 104.
- the network 106 may include various connection types, such as wired and wireless communications. Link or fiber optic cable, etc.
- the user can use the terminal 102 to interact with the server 104 through the network 106 to receive or send messages and so on.
- Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc., may be installed on the terminal 102.
- the terminal 102 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Experts compress standard audio level 4) Players, laptop portable computers and desktop computers, etc.
- MP3 players Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3
- MP4 Motion Picture Experts compress standard audio level 4
- laptop portable computers and desktop computers etc.
- the server 104 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal 102.
- the account reconciliation method provided in the embodiments of the present application is generally executed by the server/terminal, and correspondingly, the account reconciliation device is generally set in the server/terminal device.
- terminals, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
- the terminal 102 communicates with the server 104 through the network.
- the server 104 uses the terminal 102 as a data object to pull basic data from it, and classifies the pulled basic data according to preset filtering. After the classified data is processed in different processing methods, comprehensive classification results and evaluation scores are obtained. Finally, the data objects are classified and classified evaluation according to the classified data, and the evaluation probability of the data object belonging to the obtained classification category is determined. If the evaluation probability is greater than the preset threshold, the classification is considered correct, and the data object is regarded as the target object.
- the terminal 102 and the server 104 are connected through a network.
- the network can be a wired network or a wireless network.
- the terminal 102 can be, but is not limited to, various personal computers, laptops, smart phones, tablets, and portable wearable devices.
- the server 104 can be implemented by an independent server or a cluster of multiple servers.
- a data object classification method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
- Step 202 Obtain the basic data of each object to be classified as the data to be processed, and divide the data to be processed into standard data and associated data according to preset screening requirement data.
- the data to be processed comes from multiple data sources.
- the server collects the required data from each data source as the basic data, and then uses all the basic data obtained as the data to be processed, and divides the data to be processed into standards according to the preset filtering demand data Data and associated data.
- the data source can be each server in a server cluster, and the basic data corresponding to the data object can include the running time of the server, hardware parameters, log files, historical maintenance records, and the number of solutions completed by the server.
- a server is regarded as a data object, as the object to be classified.
- each data object corresponds to a wide range of basic data dimensions.
- a distributed storage method is generally used for data storage, but the general distributed storage is due to broadband
- the transmission rate is too slow, which makes the data collection efficiency low, or when the data volume is too large, the data collection channel is congested or even paralyzed. Therefore, in this embodiment, a big data platform is used. From the distributed storage system, according to the data The identification of the object, from each data source, the data containing the identification of the data object is obtained as the basic data of the data object.
- the standard data is data that has a strong correlation with the preset screening demand data
- the associated data is data that has a weaker correlation with the preset screening demand data than the standard data or has no correlation with the preset screening demand data.
- Step 204 Obtain standard nominal variables according to the standard data, perform feature classification on the standard nominal variables, and perform fusion processing on the classified results to obtain a comprehensive classification result.
- Discrete variables refer to variables whose values can be listed one by one in a certain order, usually with integer values. For example, the number of times the server solves the problem per unit time, the number of repairs per month in a year, etc., the value of discrete variables can be obtained by counting.
- General data preprocessing methods include: continuous variable discretization, data binning, heat coding, etc.
- the required data can be obtained through continuous variable discretization as a standard nominal variable.
- the standard nominal variables in this embodiment are discrete variables.
- feature classification processing on the standard nominal variables to obtain the distinguishing features and feature combinations in the standard nominal variables, and obtain a feature classification result.
- classification results are fused to obtain the prediction result as a comprehensive classification result.
- Step 206 Obtain the associated nominal variable according to the associated data, and perform data evaluation on the associated nominal variable to obtain the evaluation score of the associated nominal variable.
- the data of each variable or each parameter is selected from the associated data, and the information value (IV value, Information Value) of the variable or parameter is calculated for the filtering operation, and then a certain number of variables are selected based on the calculated IV value Or parameters.
- the logistic regression model selects 8-14 variables or parameters whose IV values meet the requirements
- the boost tree model selects 20-30 variables or parameters.
- each variable randomly select i groups of data from the variable data and input them into the logistic regression model for evaluation. Because a variable can correspond to multiple data objects (generally multiple), so The data objects selected according to each variable are different. Among them, there is a user tag for each data object, such as potential customers, hard-core customers, or unintentional customers, and so on. Specifically, the IV value is calculated based on the weight of evidence (WOE, Weight of Evidence) of the data.
- WE Weight of Evidence
- Step 208 Determine the target classification of the object to be classified in the comprehensive classification result according to the basic data of each object to be classified, and determine the target score of the object to be classified relative to the evaluation score according to the associated data, and perform weighting processing on the target score , Get the evaluation probability that the object to be classified belongs to the target classification.
- any data object that is, the object to be classified
- Step 210 If the estimated probability is greater than the preset threshold, classify the object to be classified into the target classification as the target object.
- the object to be classified belongs to the target classification.
- the target classification of the object to be classified is obtained, and then all of my objects to be classified In this way, the data object corresponding to each category in the comprehensive classification result can be obtained, and the screening and classification of the objects to be classified can be realized.
- the comprehensive classification result can contain multiple categories. In the final classification result, there may not be data objects that meet the requirements in each category, and the specific results after data processing shall prevail.
- the basic data of the acquired data object is classified and then input into the preset classifier for processing to obtain the classification result and the evaluation result, and then summarize the obtained classification results to determine the data object with screening Target classification.
- step 202 includes:
- Step 302 Classify the to-be-processed data according to the object attribute to obtain the object attribute data.
- Object attributes are various attributes of the object to be classified, such as the location data of the user stored on the server, the number of participation in a question under a certain topic, the number of purchases of a certain type of goods, the frequency of searching for a place name, and so on. Then the location that users frequently go to is one of the object attributes of the data to be processed.
- Step 304 Calculate the correlation coefficient between the object attribute data and the preset screening requirement data by Spearman's rank correlation coefficient method, as the data correlation level.
- Spearman's rank correlation coefficient is used to evaluate an algorithm that uses a monotonic function to describe the relationship between two variables.
- the preset screening requirements can be set according to the application scenario. For example, to push users to participate in a certain technology seminar, the target group needs to be obtained. Specifically, the Spearman rank correlation coefficient is used to calculate the correlation between the number of times someone purchases a certain type of item and someone’s participation in a certain technology seminar, the number of people’s participation in a certain topic and the number of someone’s participation in a technology seminar Probability of meeting.
- the object attribute data is the correlation between "the number of times a user participates in questions under electrical engineering topics" and "someone participates in an intellectual property and enterprise R&D conference initiated by a certain technology company", or "someone goes to a certain place The correlation between “number of times” and “someone participates in the intellectual property and enterprise R&D conference initiated by a technology company”.
- the correlation between them and the preset screening requirement data is calculated through the Spearman rank correlation coefficient, as the data correlation level.
- the result obtained can be monotonously correlated or uncorrelated, and these can be obtained intuitively from the resulting data chart.
- Step 306 If the data correlation level meets the preset correlation level, use the object attribute data as standard data.
- the preset correlation level may be that the object attribute data and the preset filtering requirement data are positively correlated. If the data correlation level is also that the object attribute data is positively correlated with the preset filtering requirement data, then the object attribute data is used as the standard data.
- Step 308 If the data correlation level does not meet the preset correlation level, the object attribute data is used as the associated data.
- the object attribute data that does not meet the preset relevant level is regarded as the associated data.
- the preset relevant level data is not only the positive correlation between the object attribute data and the preset filtering requirement data, but also the object attribute data and the preset filtering requirement data. Negative correlation, this needs to be determined according to needs, and there is no limitation here.
- the data to be processed is classified and post-processed by calculating the degree of correlation between the object attribute data and the preset screening requirement data, and the data set that has a strong correlation with the preset screening requirement data is processed to obtain the processing result and improve The accuracy of the classification of objects to be classified.
- step 204 includes:
- Step 402 Extract the continuous variable data in the standard data, and perform pseudo-splitting of the continuous variable data according to the preset pseudo-splitting point to obtain the information gain of the continuous variable data before and after the pseudo-splitting.
- the pseudo-splitting point can be a number of points marked on the continuous variable data according to the arithmetic ratio. These points divide the continuous variable data into several sub-data, and then calculate the entropy of each sub-data to obtain a pseudo-splitting entropy. The entropy of the continuous variable data is compared, and the difference between the entropy of each sub-data after the pseudo-splitting and the entropy of the continuous variable data before the pseudo-splitting is obtained as the gain variable.
- Step 404 If the information gain is greater than the preset gain difference, use the pseudo-split point as the split point to split the continuous variable data to obtain the discretized data after splitting, and use the split point as the preset pseudo-split for the next pseudo-split Point to split.
- the preset gain difference can be adjusted according to business needs.
- Step 406 If the number of splits reaches the preset number of splits, the splitting is stopped, and the discretized data obtained after the last splitting is used as a discrete variable.
- the number of splits reaches the preset number of splits, it means that the discretized data obtained after the last split can meet the needs, the splitting is stopped, and the discretized data obtained from the last split is used as a discrete variable.
- the discretized data obtained before the preset number of splits is reached is not truly discrete data, such as 12/3/5/13/34, but the information increment obtained each time meets the preset gain difference
- the data on both sides of the split point of the value can be data within a certain period of time.
- Step 408 Perform dimensionality reduction processing on the discrete variables through data binning, and sort the discrete data obtained after the dimensionality reduction processing according to the characteristic values of the continuous variable data to obtain standard nominal variables.
- the methods of data binning include but are not limited to: equal frequency binning and equal width binning.
- Each box of discrete variables after binning is regarded as a nominal variable.
- the eigenvalues of the nominal variables are sorted from small to large.
- the standard nominal variable belongs to the categorical variable, and its variable value is qualitative, that is, the value determined under the existing premises or conditions is manifested as mutually incompatible categories or attributes.
- the continuous variable data is discretized, which can increase the speed of data processing and facilitate storage and use.
- step 402 further includes:
- Step 502 Perform feature data cutting from the continuous variable data according to the preset pseudo-splitting point, and calculate the pseudo-splitting data entropy of the feature data obtained after the cutting.
- the continuous variable data is cut into the characteristic data.
- a monitoring terminal monitors the temperature of a server, and performs data cut every minute during the hour from 8:00 to 9:00 in the morning.
- the server performs a temperature measurement and obtains 60 measurement values respectively as: t_1, t_2, t_3,..., t_59, t_60. It is easy to understand that the measurement value is a continuous variable.
- n_1, n_2, n_3,..., n_58, n_59 where n_1 is one of t_1 and t_2
- the pseudo-split point between t_2 and t_3, n_2 is the pseudo-split point between t_2 and t_3. Then calculate the data entropy between every two pseudo-splitting points as the pseudo-splitting data entropy.
- Step 504 Obtain the continuous data entropy of the continuous variable data, and calculate the difference between the continuous data entropy and the pseudo-splitting data entropy as an information gain.
- the entropy of the continuous variable data is calculated as the continuous data entropy, and the difference between the entropy of each data to be split and the entropy of the continuous data is calculated, and the calculated difference is used as the information increment.
- the continuous variable data is discretized to extract representative values in the continuously changing data, simplify analysis, increase the speed of data processing, and facilitate storage and use.
- step 204 includes:
- Step 602 Input standard nominal variables into at least two preset machine learning models for feature classification, and obtain feature classification results corresponding to each preset machine learning model.
- the preset machine learning models include but are not limited to: Gradient Boosting Decison Tree (GBDT), Boosting Tree (Boosting Tree), Random Forest (Random Forest), ID3 algorithm model, etc.
- feature classification includes feature extraction of standard nominal variables, and then classification processing according to the extracted features to obtain feature classification results.
- step 604 a K-fold cross-validation method is used to perform fusion processing on the feature classification results to obtain a comprehensive classification result.
- K-fold cross-validation (k-fold cross-validation) first divides all data, that is, the feature classification results obtained through each preset machine learning model, into K sub-samples, and selects one of the sub-samples as the test set without repetition, and the other K- 1 sample is used for training. A total of K repetitions, the average K times to get the results or use other indicators, and finally get a single estimate.
- K-fold cross-validation is used to ensure that each sub-sample participates in training, which reduces the generalization error.
- step 604 includes:
- Step 702 Segment the feature classification result into a feature classification training set and a feature classification test set.
- the feature classification training set is used to train the model, and the feature classification test set is used to test the feature classification model trained by the feature classification training set.
- the feature classification training set is 10,000 and the feature classification test set is 2500 rows.
- Step 704 Segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set, and verify the feature classification model obtained through the feature classification feeding set training according to the feature classification verification set, to obtain the feature classification verification data .
- the preset cutting conditions are: each time a certain amount of data is taken from the feature classification training set as the feature classification verification set for model verification, the remaining data is used as the feature classification feeding set for model training, and when the feature classification verification set is obtained It is necessary to ensure that the data taken from the feature classification training set each time has not participated in model verification, so as to ensure that each row of data in the feature classification verification set participates in the model verification, and every feature classification feeding set for model training has New data compared with the data of the previous model training. In this way, the generalization error can be reduced.
- Use the model to verify the verification set to obtain 2000 data each time the verification obtains 2000 data, the 10000 rows of feature classification training set data can be verified in 5 times, and 10000 verification data can be obtained as the feature classification verification data.
- Step 706 Input the feature classification test set into the feature classification model for testing to obtain feature classification test data.
- Step 708 Re-segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set for the next training and verification.
- the feature classification training set can be re-segmented according to the preset cutting conditions, or the feature classification training set can be divided according to the preset cutting conditions in advance by the preset number of divisions, and then each model training Both use the new feature classification feeding set for training and so on.
- Step 710 When the number of re-segmentation reaches the preset number of divisions, the segmentation is stopped, and all the obtained feature classification verification data and all the obtained feature classification test data are processed according to the preset fusion conditions to obtain the feature classification prediction data, and The feature classification prediction data is used as a comprehensive classification result.
- the preset number of divisions in this embodiment may be 5 times.
- the preset fusion condition is a method of processing all the obtained feature classification test data and feature classification verification data.
- This embodiment is: model training, verification and testing of the feature classification results obtained by each preset machine learning model
- the obtained feature classification test data and feature classification verification data are integrated.
- only 3 preset machine learning models can be used, and 6 data matrices can be obtained, that is, the feature classification obtained after model training and verification tests are performed on the feature classification results of each preset machine learning model
- the verification data is used as a data matrix, and each feature classification test data is also used as a data matrix.
- the feature classification verification data corresponding to the three preset machine learning models are respectively labeled as A1, A2, and A3 and are arranged together into a matrix of 10,000 rows and 3 columns as training data (training data), and the resulting feature classification test data is labeled as B1, B2 and B3 are merged into a matrix with 2500 rows and 3 columns as testing data, so that the lower-level learner is retrained based on such data according to preset fusion conditions.
- the preset fusion conditions are based on the feature classification verification data and feature classification test data of each preset machine learning model as three features, among which, the feature classification verification data and feature classification test corresponding to each preset machine learning model
- the data is used as a predicted classification result, and the lower-level learner will learn and train to assign a weight w on the preset classification result to make the final classification most accurate.
- the lower learning period can be regression prediction.
- a variety of preset machine learning models are used to perform feature classification on the feature classification results, and all feature data obtained through K-fold cross-validation are used as the predicted classification results, and the predicted classification results are weighted according to the preset fusion method to obtain Comprehensive classification results to ensure that the classification in the comprehensive classification results is accurate.
- step 206 includes:
- Step 802 Calculate the information value of the associated nominal variable, and filter the associated nominal variable according to the information value to obtain a preset number of associated nominal variables as data evaluation variables.
- This embodiment calculates the proof weight WOE of the IV value of each variable, which can be obtained by sequentially inputting into the formula (1)(2)(3):
- the user tags of the data object are: good customers, bad customers, then Bad i and Good i respectively represent the number of bad customers and the number of good customers in the i-th group in the variable; Bad total , Good total Respectively represent the total number of bad customers and the total number of good customers in all groups.
- the obtained IV value filters the associated nominal data to obtain a preset number of associated nominal variables as data evaluation variables.
- Step 804 Input the data evaluation variables into the preset logistic regression model for data evaluation, and obtain the evaluation score of the associated nominal change.
- the data evaluation variables filtered according to the IV value are input into the preset logistic regression model to perform the classification operation.
- the score of each item of the associated nominal variable is calculated by the preset logistic regression model as the evaluation score.
- the classifier of the machine learning model calculates the score of a certain type of ability data as the evaluation score of the certain ability data.
- the preset logistic regression model may be a classifier based on the logistic regression model.
- the selected 10-20 associated nominal variables of the variables with higher IV values are input into the model for training, and the classification probability PL is calculated through logistic regression, and the user corresponding to the user is calculated based on the classification probability PL.
- the probability that the object to be classified belongs to a certain category, so as to achieve accurate classification of the object to be classified and accurate data push.
- the IV value of each variable is calculated, and the associated nominal variable is screened according to the IV value, and data with strong predictive ability of the variable is screened out, so that the obtained evaluation score is more accurate.
- a data object classification device is provided, and the data object classification device corresponds to the data object classification method in the foregoing embodiment one-to-one.
- the data object classification device includes a data classification module 902, a data classification module 904, a data evaluation module 906, an object screening module 908, and an object classification module 910, wherein:
- the data division module 902 is used to obtain the basic data of each object to be classified as the data to be processed, and divide the data to be processed into standard data and associated data according to preset screening requirements.
- the data classification module 904 is used to obtain standard nominal variables according to the standard data, perform feature classification on the standard nominal variables, and perform fusion processing on the classified results to obtain a comprehensive classification result.
- the data evaluation module 906 is used to obtain the associated nominal variable according to the associated data, and perform data evaluation on the associated nominal variable to obtain the evaluation score of the associated nominal variable.
- the object screening module 908 is used to determine the target classification of the object to be classified in the comprehensive classification result according to the basic data of each object to be classified, and determine the target score of the object to be classified relative to the evaluation score according to the associated data, and divide the target into The value is weighted to obtain the estimated probability that the object to be classified belongs to the target classification.
- the object classification module 910 is configured to classify the object to be classified into the target classification as the target object if the evaluation probability is greater than the preset threshold.
- the data division module 902 includes:
- the data classification sub-module 9022 is used to classify the to-be-processed data according to the object attributes to obtain the object attribute data;
- the correlation calculation sub-module 9024 is used to calculate the correlation coefficient between the object attribute data and the preset screening demand data through the Spearman rank correlation coefficient method, as the data correlation level;
- the correlation determination sub-module 9026 is configured to use the object attribute data as the standard data if the data correlation level meets the preset correlation level; and also used to use the object attribute data as the correlation data if the data correlation level does not meet the preset correlation level.
- the data classification module 904 includes:
- the pseudo-splitting sub-module 9042 is used to extract the continuous variable data in the standard data, perform pseudo-splitting of the continuous variable data according to the preset pseudo-splitting point, and obtain the information gain of the continuous variable data before and after the pseudo-splitting;
- the splitting sub-module 9044 is used to split the continuous variable data with the pseudo split point as the split point if the information gain is greater than the preset gain difference to obtain the discretized data after splitting, and use the split point as the next pseudo split Preset the pseudo-split point for splitting;
- the split pre-judgment sub-module 9046 is used to stop the split if the number of splits reaches the preset number of splits, and use the discretized data obtained after the last split as a discrete variable;
- the dimensionality reduction processing sub-module 9048 is used to reduce the dimensionality of discrete variables through data binning, and sort the discrete data obtained after the dimensionality reduction processing according to the characteristic values of the continuous variable data to obtain standard nominal variables.
- pseudo-split sub-module 9042 includes:
- the entropy calculation unit 9042a is configured to perform feature data cutting from the continuous variable data according to a preset pseudo-splitting point, and calculate the pseudo-splitting data entropy of the feature data obtained after the cutting;
- the information gain unit 9042b is used to obtain the continuous data entropy of the continuous variable data, and calculate the difference between the continuous data entropy and the pseudo-splitting data entropy, as the information gain.
- the data classification module 904 further includes:
- the feature classification sub-module 9050 is used to input standard nominal variables into at least two preset machine learning models for feature classification, and obtain the feature classification results corresponding to each preset machine learning model;
- the feature fusion sub-module 9052 is used to perform fusion processing on the feature classification results using the K-fold cross-validation method to obtain a comprehensive classification result.
- the feature fusion sub-module 9052 includes:
- the feature cutting unit 9052a is used to segment the feature classification result into a feature classification training set and a feature classification test set;
- the model verification unit 9052b is used to segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set, and verify the feature classification model obtained through the feature classification feeding set training according to the feature classification verification set, and obtain Feature classification verification data;
- the model testing unit 9052c is used to input the feature classification test set into the feature classification model for testing to obtain feature classification test data;
- the re-cutting unit 9052d is used to re-segment the feature classification training set according to the preset cutting conditions to obtain the feature classification feeding set and the feature classification verification set for the next training and verification;
- the feature fusion unit 9052e is used to stop the segmentation when the number of re-segmentation reaches the preset number of divisions, and process all the obtained feature classification verification data and all the obtained feature classification test data according to the preset fusion conditions to obtain feature classification prediction Data, and use feature classification prediction data as a comprehensive classification result.
- the data evaluation module 906 includes:
- variable screening sub-module 9062 is used to calculate the information value of the associated nominal variable, and filter the associated nominal variable according to the information value to obtain a preset number of associated nominal variables as data evaluation variables;
- the score evaluation sub-module 9064 is used to input the data evaluation variables into the preset logistic regression model for data evaluation to obtain the evaluation scores of the associated nominal variables.
- the above-mentioned data object classification device classifies the acquired basic data of the data object and then inputs it into the preset classifier for processing to obtain the classification result and the evaluation result, and then summarize the obtained classification results to determine the data object with screening Target classification.
- a computer device is provided.
- the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
- the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
- the processor of the computer device is used to provide calculation and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system, a computer program, and a database.
- the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
- the database of the computer equipment is used to store user order data.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer program is executed by the processor to realize a data object classification method.
- the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
- Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
- a computer device is provided.
- the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 10.
- the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
- the processor of the computer device is used to provide calculation and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system and a computer program.
- the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer program is executed by the processor to realize a data object classification method.
- the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
- the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touch pad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
- FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
- the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
- a computer device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
- the processor executes the computer program
- the data object classification method in the foregoing embodiment is implemented. Steps, such as step 202 to step 210 shown in FIG. 2, or, when the processor executes the computer program, the function of each module/unit of the data object classification apparatus in the above-mentioned embodiment is realized, for example, the functions of modules 902 to 910 shown in FIG. 9 Features. To avoid repetition, I won’t repeat them here.
- a computer-readable storage medium is provided.
- the storage medium is a volatile storage medium or a non-volatile storage medium, and a computer program is stored thereon.
- the computer program is executed by a processor, the foregoing
- the steps of the data object classification method in the embodiment, such as step 202 to step 208 shown in FIG. 2, or, when the processor executes the computer program, the function of each module/unit of the data object classification device in the above embodiment is realized, for example, FIG. 9
- the functions of modules 902 to 910 are shown. To avoid repetition, I won’t repeat them here.
- Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (20)
- 一种数据对象分类方法,所述方法包括:获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
- 根据权利要求1所述的方法,所述根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据,包括:根据对象属性对所述待处理数据进行分类,得到对象属性数据;通过斯皮尔曼等级相关系数方式计算所述对象属性数据与所述预设筛选需求数据之间的相关系数,作为数据相关等级;若所述数据相关等级符合预设相关等级,则将所述对象属性数据作为所述标准数据;若所述数据相关等级不符合预设相关等级,则将所述对象属性数据作为所述关联数据。
- 根据权利要求1所述的方法,所述标准数据包括连续变量数据,所述根据所述标准数据得到标准名义变量,包括:提取所述标准数据中的连续变量数据,按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益;若所述信息增益大于预设增益差值,则将所述拟分裂点作为分裂点对所述连续变量数据进行分裂,得到分裂后的离散化数据,并将所述分裂点作为下一次拟分裂的预设拟分裂点进行所述分裂;若所述分裂的次数达到预设分裂次数,则停止分裂,并将最后一次所述分裂后得到的所述离散化数据作为离散变量;通过数据分箱对所述离散变量进行降低维度处理,并根据所述连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到所述标准名义变量。
- 根据权利要求3所述的方法,所述按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益,包括:根据所述预设拟分裂点从对所述连续变量数据进行特征数据切割,并计算切割后得到所述特征数据的拟分裂数据熵;获取所述连续变量数据的连续数据熵,并计算所述连续数据熵与所述拟分裂数据熵的差值,作为信息增益。
- 根据权利要求1所述的方法,所述对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果,包括:将所述标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个所述预设机器学习模型对应的特征分类结果;采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果。
- 根据权利要求5所述的方法,所述采用K折交叉验证方式对所述特征分类结果进行融 合处理,得到综合分类结果,包括:将所述特征分类结果分割为特征分类训练集以及特征分类测试集;根据预设切割条件分割所述特征分类训练集,得到特征分类喂养集以及特征分类验证集,并根据所述特征分类验证集对通过所述特征分类喂养集训练得到特征分类模型进行验证,得到特征分类验证数据;将所述特征分类测试集输入到所述特征分类模型中测试,得到特征分类测试数据;根据所述预设切割条件重新分割所述特征分类训练集,得到所述特征分类喂养集以及所述特征分类验证集以进行下一次的训练和验证;当所述重新分割的次数达到预设分割次数,停止分割,并根据预设融合条件对得到的所有所述特征分类验证数据以及得到的所有所述特征分类测试数据进行处理,得到特征分类预测数据,并将所述特征分类预测数据作为所述综合分类结果。
- 根据权利要求1所述的方法,所述对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值,包括:计算所述关联名义变量的信息值,并根据所述信息值对所述关联名义变量进行筛选,得到预设数量的关联名义变量,作为数据评估变量;将所述数据评估变量输入到预设逻辑回归模型中进行数据评估,得到所述关联名义变量的评估分值。
- 一种数据对象分类装置,包括:数据划分模块,用于获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求将所述待处理数据分为标准数据与关联数据;数据分类模块,用于根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;数据评估模块,用于根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;对象筛选模块,用于根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;对象分类模块,用于若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
- 一种计算机设备,包括:一个或多个处理器;存储器;一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种数据对象分类方法,其中,所述数据对象分类方法包括:获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标 对象。
- 根据权利要求9所述的计算机设备,所述根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据,包括:根据对象属性对所述待处理数据进行分类,得到对象属性数据;通过斯皮尔曼等级相关系数方式计算所述对象属性数据与所述预设筛选需求数据之间的相关系数,作为数据相关等级;若所述数据相关等级符合预设相关等级,则将所述对象属性数据作为所述标准数据;若所述数据相关等级不符合预设相关等级,则将所述对象属性数据作为所述关联数据。
- 根据权利要求9所述的计算机设备,所述标准数据包括连续变量数据,所述根据所述标准数据得到标准名义变量,包括:提取所述标准数据中的连续变量数据,按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益;若所述信息增益大于预设增益差值,则将所述拟分裂点作为分裂点对所述连续变量数据进行分裂,得到分裂后的离散化数据,并将所述分裂点作为下一次拟分裂的预设拟分裂点进行所述分裂;若所述分裂的次数达到预设分裂次数,则停止分裂,并将最后一次所述分裂后得到的所述离散化数据作为离散变量;通过数据分箱对所述离散变量进行降低维度处理,并根据所述连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到所述标准名义变量。
- 根据权利要求11所述的计算机设备,所述按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益,包括:根据所述预设拟分裂点从对所述连续变量数据进行特征数据切割,并计算切割后得到所述特征数据的拟分裂数据熵;获取所述连续变量数据的连续数据熵,并计算所述连续数据熵与所述拟分裂数据熵的差值,作为信息增益。
- 根据权利要求9所述的计算机设备,所述对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果,包括:将所述标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个所述预设机器学习模型对应的特征分类结果;采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果。
- 根据权利要求13所述的计算机设备,所述采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果,包括:将所述特征分类结果分割为特征分类训练集以及特征分类测试集;根据预设切割条件分割所述特征分类训练集,得到特征分类喂养集以及特征分类验证集,并根据所述特征分类验证集对通过所述特征分类喂养集训练得到特征分类模型进行验证,得到特征分类验证数据;将所述特征分类测试集输入到所述特征分类模型中测试,得到特征分类测试数据;根据所述预设切割条件重新分割所述特征分类训练集,得到所述特征分类喂养集以及所述特征分类验证集以进行下一次的训练和验证;当所述重新分割的次数达到预设分割次数,停止分割,并根据预设融合条件对得到的所有所述特征分类验证数据以及得到的所有所述特征分类测试数据进行处理,得到特征分类预测数据,并将所述特征分类预测数据作为所述综合分类结果。
- 根据权利要求9所述的计算机设备,所述对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值,包括:计算所述关联名义变量的信息值,并根据所述信息值对所述关联名义变量进行筛选,得到预设数量的关联名义变量,作为数据评估变量;将所述数据评估变量输入到预设逻辑回归模型中进行数据评估,得到所述关联名义变量的评估分值。
- 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现数据对象分类方法,其中,所述数据对象分类方法包括以下步骤:获取每个待分类对象的基础数据,作为待处理数据,并根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据;根据所述标准数据得到标准名义变量,对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果;根据所述关联数据得到关联名义变量,并对所述关联名义变量进行数据评估,得到所述关联名义变量的评估分值;根据每个所述待分类对象的基础数据确定所述待分类对象在所述综合分类结果中的目标分类,并根据所述关联数据确定所述待分类对象相对于所述评估分值的目标分值,将所述目标分值进行加权处理,得到所述待分类对象属于所述目标分类的评估概率;若所述评估概率大于预设阈值,则将所述待分类对象归类到所述目标分类中,作为目标对象。
- 根据权利要求16所述的计算机可读存储介质,所述根据预设筛选需求数据将所述待处理数据分为标准数据与关联数据,包括:根据对象属性对所述待处理数据进行分类,得到对象属性数据;通过斯皮尔曼等级相关系数方式计算所述对象属性数据与所述预设筛选需求数据之间的相关系数,作为数据相关等级;若所述数据相关等级符合预设相关等级,则将所述对象属性数据作为所述标准数据;若所述数据相关等级不符合预设相关等级,则将所述对象属性数据作为所述关联数据。
- 根据权利要求16所述的计算机可读存储介质,所述标准数据包括连续变量数据,所述根据所述标准数据得到标准名义变量,包括:提取所述标准数据中的连续变量数据,按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益;若所述信息增益大于预设增益差值,则将所述拟分裂点作为分裂点对所述连续变量数据进行分裂,得到分裂后的离散化数据,并将所述分裂点作为下一次拟分裂的预设拟分裂点进行所述分裂;若所述分裂的次数达到预设分裂次数,则停止分裂,并将最后一次所述分裂后得到的所述离散化数据作为离散变量;通过数据分箱对所述离散变量进行降低维度处理,并根据所述连续变量数据的特征值对降低维度处理后得到的离散数据进行排序,得到所述标准名义变量。
- 根据权利要求18所述的计算机可读存储介质,所述按照预设拟分裂点对所述连续变量数据进行拟分裂,得到拟分裂前和拟分裂后所述连续变量数据的信息增益,包括:根据所述预设拟分裂点从对所述连续变量数据进行特征数据切割,并计算切割后得到所述特征数据的拟分裂数据熵;获取所述连续变量数据的连续数据熵,并计算所述连续数据熵与所述拟分裂数据熵的差值,作为信息增益。
- 根据权利要求16所述的计算机可读存储介质,所述对所述标准名义变量进行特征分类,并对分类后的结果进行融合处理,得到综合分类结果,包括:将所述标准名义变量输入到至少2个预设机器学习模型中进行特征分类,得到每个所述预设机器学习模型对应的特征分类结果;采用K折交叉验证方式对所述特征分类结果进行融合处理,得到综合分类结果。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911165269.4A CN111177500A (zh) | 2019-11-25 | 2019-11-25 | 数据对象分类方法、装置、计算机设备和存储介质 |
CN201911165269.4 | 2019-11-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021103401A1 true WO2021103401A1 (zh) | 2021-06-03 |
Family
ID=70655374
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/085805 WO2021103401A1 (zh) | 2019-11-25 | 2020-04-21 | 数据对象分类方法、装置、计算机设备和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111177500A (zh) |
WO (1) | WO2021103401A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112732536B (zh) * | 2020-12-30 | 2023-01-13 | 平安科技(深圳)有限公司 | 数据监控告警方法、装置、计算机设备及存储介质 |
CN115293282B (zh) * | 2022-08-18 | 2023-08-29 | 昆山润石智能科技有限公司 | 制程问题分析方法、设备及存储介质 |
CN115828147B (zh) * | 2023-02-15 | 2023-04-18 | 博纯材料股份有限公司 | 一种基于数据处理的氙气生产控制方法及系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197282A (zh) * | 2018-01-10 | 2018-06-22 | 腾讯科技(深圳)有限公司 | 文件数据的分类方法、装置及终端、服务器、存储介质 |
CN109544150A (zh) * | 2018-10-09 | 2019-03-29 | 阿里巴巴集团控股有限公司 | 一种分类模型生成方法及装置、计算设备及存储介质 |
CN110019488A (zh) * | 2018-09-12 | 2019-07-16 | 国网浙江省电力有限公司嘉兴供电公司 | 多源异构数据融合多核分类方法 |
WO2019169700A1 (zh) * | 2018-03-08 | 2019-09-12 | 平安科技(深圳)有限公司 | 一种数据分类方法、装置、设备及计算机可读存储介质 |
CN110413775A (zh) * | 2019-06-25 | 2019-11-05 | 北京清博大数据科技有限公司 | 一种数据打标签分类方法、装置、终端及存储介质 |
-
2019
- 2019-11-25 CN CN201911165269.4A patent/CN111177500A/zh active Pending
-
2020
- 2020-04-21 WO PCT/CN2020/085805 patent/WO2021103401A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197282A (zh) * | 2018-01-10 | 2018-06-22 | 腾讯科技(深圳)有限公司 | 文件数据的分类方法、装置及终端、服务器、存储介质 |
WO2019169700A1 (zh) * | 2018-03-08 | 2019-09-12 | 平安科技(深圳)有限公司 | 一种数据分类方法、装置、设备及计算机可读存储介质 |
CN110019488A (zh) * | 2018-09-12 | 2019-07-16 | 国网浙江省电力有限公司嘉兴供电公司 | 多源异构数据融合多核分类方法 |
CN109544150A (zh) * | 2018-10-09 | 2019-03-29 | 阿里巴巴集团控股有限公司 | 一种分类模型生成方法及装置、计算设备及存储介质 |
CN110413775A (zh) * | 2019-06-25 | 2019-11-05 | 北京清博大数据科技有限公司 | 一种数据打标签分类方法、装置、终端及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN111177500A (zh) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111178456B (zh) | 异常指标检测方法、装置、计算机设备和存储介质 | |
Gaffney et al. | Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus | |
CN107633265B (zh) | 用于优化信用评估模型的数据处理方法及装置 | |
Desai et al. | Techniques for sentiment analysis of Twitter data: A comprehensive survey | |
WO2021103401A1 (zh) | 数据对象分类方法、装置、计算机设备和存储介质 | |
Huang et al. | Feature screening for ultrahigh dimensional categorical data with applications | |
US9483544B2 (en) | Systems and methods for calculating category proportions | |
US9002755B2 (en) | System and method for culture mapping | |
CN109191338B (zh) | 基于校园一卡通消费数据的学生行为预警方法 | |
US10706359B2 (en) | Method and system for generating predictive models for scoring and prioritizing leads | |
CN114218958A (zh) | 工单处理方法、装置、设备和存储介质 | |
Lottering et al. | A model for the identification of students at risk of dropout at a university of technology | |
CN116049379A (zh) | 知识推荐方法、装置、电子设备和存储介质 | |
DE102021006293A1 (de) | Bestimmung digitaler Personas unter Verwendung datengetriebener Analytik | |
Wang et al. | CPB: a classification-based approach for burst time prediction in cascades | |
CN116034379A (zh) | 使用深度学习和机器学习的活动性水平测量 | |
WO2021129368A1 (zh) | 一种客户类型的确定方法及装置 | |
Rezaeenour et al. | Developing a new hybrid intelligent approach for prediction online news popularity | |
CN108038790B (zh) | 一种内外数据融合的态势分析系统 | |
CN113688120A (zh) | 数据仓库的质量检测方法、装置和电子设备 | |
CN112069807A (zh) | 文本数据的主题提取方法、装置、计算机设备及存储介质 | |
O'Neill et al. | Rating by ranking: An improved scale for judgement-based labels | |
Batura et al. | Big Data Volumes and Some Approaches to the Creation of Corporate Analytical Systems | |
CN112541705B (zh) | 生成用户行为评估模型的方法、装置、设备以及存储介质 | |
CN117973566B (zh) | 训练数据处理方法、装置及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20893572 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20893572 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 230922) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20893572 Country of ref document: EP Kind code of ref document: A1 |