CA3144411A1 - Data classification method, device and system - Google Patents

Data classification method, device and system Download PDF

Info

Publication number
CA3144411A1
CA3144411A1 CA3144411A CA3144411A CA3144411A1 CA 3144411 A1 CA3144411 A1 CA 3144411A1 CA 3144411 A CA3144411 A CA 3144411A CA 3144411 A CA3144411 A CA 3144411A CA 3144411 A1 CA3144411 A1 CA 3144411A1
Authority
CA
Canada
Prior art keywords
classification
data
model
sample data
weights
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3144411A
Other languages
French (fr)
Inventor
Xia Li
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3144411A1 publication Critical patent/CA3144411A1/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed in the present invention are a data classification method, device, and system, related to the field of data mining technology. The described method includes: Sl, acquiring sample data and initializing weights of the sample data; S2, classifying sample data by a classification model, acquiring incorrect classification results by correct classifications of the sample data and the associated incorrectly judged samples; S3, calculating incorrectness rates of classification models based on weights of the incorrectly judged samples, and calculating the classification model weights by the incorrectness rates to update sample data weights by the classification model weights; S4, repeating the iterative classification models in S2 ¨ S3, and selecting target classification models from iterations by incorrectness rates; and S5, classifying classification-pending data by each target classification model. The present invention updates classification models by updating weights of sample data, yielding more accurate classification models suitable for services related to the sample data.

Description

ACCOUNT PROCESSING METHOD, DEVICE, COMPUTER APPARATUS, AND
STORAGE MEDIUM
Technical Field [0001] The present invention relates to the field of data mining technologies, in particular, to a method, a device, and a system for data classification.
Background
[0002] With rapid development of big data technologies, to achieve goals of finding data with one or more features such as precise marketing, feature-based data classification technologies are developed accordingly. However, the current data classification technologies generally use traditional scorecard models based on simplex supervised classification algorithms, wherein model parameters stop updating after generating linear estimations. In the meanwhile, the problem of poor performance by individual classifiers is present and consequently the accuracy cannot be further improved.
Summary
[0003] To solve the current technical problems, a data classification method, device, and system is provided in the embodiment of the present invention. The described technical proposal includes:
[0004] from the first perspective, a data classification method, comprising:
[0005] Si, acquiring sample data and initializing weights of the sample data;
[0006] S2, classifying sample data by a classification model, acquiring incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results;
[0007] S3, calculating incorrectness rate of the classification model based on weights of the incorrectly judged samples, and calculating the classification model weights by the incorrectness rate of the classification model to update sample data weights by the classification model weights;
[0008] S4, repeating the iterative classification models in S2 ¨ S3, and selecting multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates; and Date Recue/Date Received 2021-12-30
[0009] S5, classifying data to be classified by each target classification model, and determining the classification results of the described classification-pending data based on each described classification model weights.
[0010] Furthermore, the described classification model incorrectness rate is obtained from weights of the described incorrectly judged samples by means of
[0011] assigning the mathematical product of the number of the described incorrectly judged samples and the weights of the described incorrectly judged samples as the described classification model incorrectness rate.
[0012] Furthermore, the described calculation of the classification model weights by the described classification model incorrectness rate includes:
[0013] calculating the described calculation of the classification model weights by the following equation:
[0014] ai = 1/2 * log (IE`)
[0015] wherein ai is the weight of the classification model obtained in the lth iteration, and Ei is the incorrectness rate of the classification model obtained in the iteration, with i = 1, 2, 3...
[0016] Furthermore, the described update of the described sample data weights by the described classification model weights includes
[0017] updating the described sample data weights according to the following equation:
[0018] vtiu = e wt¨ij (¨a(,_1)*y(1)*h(,_1)(x(j))) m ¨
[0019] wherein wi_1,1 is the weight of the sample data j in the lth iteration, Z1_1 is a normalization a(1)1)1(x factor of Zi_i = e(-_1*y( *h(_1) (n), a(i_1) is the classification model weight of the lth iteration, with j = 1, 2, 3, ... , N, and N is the total number of the described sample data.
[0020] Furthermore, the described selection of multiple classification models from the multiple classification models obtained by iterations according to the incorrectness rates, includes:
[0021] comparing the described incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, to identify the described target classification models from the described classification models satisfying the described target model selection conditions.
[0022] Furthermore, the described acquisition of sample data includes:
[0023] collecting raw data to extract feature information of the described raw data;

Date Recue/Date Received 2021-12-30
[0024] counting data volume of the described raw data associated with each described feature information; and
[0025] filtering the described feature data based on the data volume of the described raw data associated with the described feature data to identify the remaining feature information as the described sample data.
[0026] Furthermore, the described classification of data to be classified by each target classification model, and the determination the classification results of the described data to be classified based on each described classification model weights include:
[0027] classifying data to be classified by each target classification model to obtain preliminary classification results of the described classification-pending data; and
[0028] performing weighted calculation of the preliminary classification results according to the described target classification model weights, to acquire classification results of the described classification-pending data.
[0029] Furthermore, the described classification of the described sample data by the described classification model and the described acquisition of incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results, include
[0030] classifying the sample data by the described classification models to obtain classification results of the described sample data; and
[0031] comparing the classification results of the described sample data with the described correct classification information to obtain incorrect judgement results of the described sample data, and determining the associated incorrectly judged samples for the incorrect results from the described sample data.
[0032] From the second perspective, a data classification device is provided, comprising:
[0033] a sample acquisition module, configured to acquire sample data and initialize weights of the sample data;
[0034] a classification model iteration module, configured to classify sample data by a classification model, and acquire incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results;
[0035] a calculation module, configured to calculate incorrectness rate of the classification model based on weights of the incorrectly judged samples, and calculate the classification model weights Date Recue/Date Received 2021-12-30 by the incorrectness rate of the classification model to update sample data weights by the classification model weights;
[0036] a target model identification module, configured to select multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates; and
[0037] a classification module, configured to classify data to be classified by each target classification model, and determine the classification results of the described classification-pending data based on each described classification model weights.
[0038] Furthermore, the calculation model is configured to identify the described classification model incorrectness rate by calculating the mathematical product of the number of the described incorrectly judged samples and the weights of the described incorrectly judged samples.
[0039] Furthermore, the calculation model is configured to calculate classification model weights by the following equation:
[0040] ai = 1/2 * log H1 el
[0041] wherein a, is the weight of the classification model obtained in the lth iteration, and Ei is the incorrectness rate of the classification model obtained in the lth iteration, with i = 1, 2, 3...
[0042] Furthermore, the calculation model is configured to update the described sample data weights according to the following equation:
[0043] No 7-7 e wt¨ij (¨a(,_1)*y(1)*h(,_1)(x(j))) ¨
z,_1
[0044] wherein wi_1,1 is the weight of the sample data j in the lth iteration, Z1_1 is a normalization a(1)1)1(x factor of Zi_i = e(-_1*y( *h(_1) (n), a(i_1) is the classification model weight of the lth iteration, with j = 1, 2, 3, ... , N, and N is the total number of the described sample data.
[0045] Furthermore, the target model identification module, comprises
[0046] a selection condition comparison module, configured to compare the described incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, for identifying the described target classification models from the described classification models satisfying the described target model selection conditions.
[0047] Furthermore, the classification module, is configured to
[0048] collecting raw data to extract feature information of the described raw data;

Date Recue/Date Received 2021-12-30
[0049] counting data volume of the described raw data associated with each described feature information; and
[0050] filtering the described feature data based on the data volume of the described raw data associated with the described feature data to identify the remaining feature information as the described sample data.
[0051] Furthermore, the classification module, is configured to
[0052] classify data to be classified by each target classification model to obtain preliminary classification results of the described classification-pending data; and
[0053] perform weighted calculation of the preliminary classification results according to the described target classification model weights, for acquiring classification results of the described classification-pending data.
[0054] Furthermore, the classification iteration module comprises:
[0055] an iterative classification module, configured to classify the sample data by the described classification models to obtain classification results of the described sample data; and
[0056] a classification result comparison module, configured to compare the classification results of the described sample data with the described correct classification information to obtain incorrect judgement results of the described sample data, and determining the associated incorrectly judged samples for the incorrect results from the described sample data.
[0057] From the third perspective, a computer system is provided, comprising:
[0058] one or more processors; and
[0059] a memory connected to the described one or more processors, wherein the described memory is used to store program commands, for implementing the aforementioned methods in the first perspective when the described program commands are executed on the described one or more processors.
[0060] The benefits provided by the technical proposal in embodiments of the present invention include that:
[0061] 1. the present invention adopts incorrectly judged sample iteration updating classification models, wherein the classification model incorrectness rates are calculated by incorrectly-judged sample weights after each iteration. In the process of iteratively updating sample data weights, the incorrectly judged sample weights are increased. The classification model is updated by sample Date Recue/Date Received 2021-12-30 data with updated weights in the next iteration, yielding more accurate classification models for services related to sample data.
[0062] 2. The present invention finally adopts multiple target classification models obtained in the iteration process to classify classification-pending data and determines classification results based on target classification model weights, yielding more accurate classification.
Brief descriptions of the drawings
[0063] For a better explanation of the technical proposal of embodiments in the present invention, the accompanying drawings are briefly introduced in the following.
Obviously, the following drawings represent only a portion of embodiments of the present invention. Those skilled in the art are able to create other drawings according to the accompanying drawings without making creative efforts.
[0064] Fig. 1 is a process flow diagram of a data classification method provided in embodiments of the present invention.
[0065] Fig. 2 is a structure diagram of a data classification device provided in embodiments of the present invention.
[0066] Fig. 3 is a structure diagram of the computer system provided in embodiments of the present invention.
Detailed descriptions
[0067] In order to make the objective, the technical scheme, and the advantages of the present invention clearer, the present invention will be explained further in detail precisely below with references to the accompanying drawings. Obviously, the embodiments described below are only a portion of embodiments of the present invention and cannot represent all possible embodiments.
Based on the embodiments in the present invention, the other applications by those skilled in the art without any creative works are falling within the scope of the present invention.
[0068] In data processing technologies targeting precise marketing, different sample data from service scenarios are combined to classify users based on use feature information. In the current technologies, the user classifications generally adopt simplex supervised classification algorithm, wherein model parameters stop updating after generating linear estimations, and consequently the accuracy cannot be further improved.

Date Recue/Date Received 2021-12-30
[0069] To solve the aforementioned technical problems, a data classification method is provided in embodiments of the present invention. The detailed technical proposal comprises:
[0070] a data classification method as shown in Fig. 1, comprising:
[0071] Si, acquiring sample data and initializing weights of the sample data.
[0072] The described sample data is acquired from raw data related to service scenarios. The initialization of sample data weights, in detail, includes receiving the same initial weights of all sample data pre-set by the user. For example,
[0073] w0,) = = ¨ j = 1,2, ... , N) N
[0074] wherein w0,1 is the initial weight of the sample data j, and N is the number of sample data.
[0075] In an embodiment, the acquisition of sample data includes:
[0076] collecting raw data to extract feature information of the described raw data;
[0077] counting data volume of the described raw data associated with each described feature information; and
[0078] filtering the described feature data based on the data volume of the described raw data associated with the described feature data to identify the remaining feature information as the described sample data.
[0079] In the described raw data acquisition and feature information extraction of the raw data, associated data can be collected from service system databases. Taking the consumer loan business enforcement service as an example, the raw data acquisition and feature information extraction include:
[0080] identifying target users, and selecting target users with higher natural application enforcement rates from multi-dimensional customer labels;
[0081] determining observation window: acquiring target user service enforcement status in the pre-set time window, and selecting a time window with stable enforcement rate periods as the observation window for the raw data acquisition;
[0082] classifying positive and negative samples: classifying target customers into positive and negative samples, wherein the positive samples are customers without service enforcement within the observation window, and the negative samples are customers with service enforcement within the observation window;

Date Recue/Date Received 2021-12-30
[0083] collecting data, matching feature information for the positive and negative samples, including customer identity properties, activity properties, value properties, and other feature information.
[0084] The data volume of the raw data corresponding to the described feature information is counted, so as for the following raw data filtering. The raw data is filtered based on the data volume of the raw data corresponding to the described feature information, wherein the sample data distribution is more uniform under individual feature information.
[0085] In detail, the quantiles of the raw data under each feature information are determined during the counting process. During the feature information filtration, the feature information can be deleted according to different data types based on quantiles. For example, for the value data, the feature information clustering around 5% quantile is deleted. For string data, the feature information clustering around 10% quantile is deleted.
[0086] Furthermore, the feature data filtration in the raw data acquisition further include one or more types of the following:
[0087] pre-selecting feature information based on IV values (information-value values);
[0088] removing feature information cannot withstand "penalty" by Lasso regression and Ridge regression;
[0089] calculating information significance by random forest; and
[0090] verifying and removing feature information with correlation over a threshold by multicollinearity.
[0091] In an embodiment, the sample data acquisition further includes:
[0092] identifying missing values in the raw data, and deleting the feature information with high missing rate in the raw data; and
[0093] identifying outliers in the raw data, and replacing the outliers with quantile data.
[0094] The described evaluation of feature information missing values can be determined by a pre-set missing value threshold.
[0095] S2, classifying sample data by a classification model, acquiring incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results.
[0096] The described classification model can adopt logistic normalization classification model, known as the logistic classifier:

Date Recue/Date Received 2021-12-30
[0097] It, (x) = P (Y = 11X) = __________ 1+ e-660- h ixi-F...- h kxk)
[0098] wherein (i = 0, 1, 2, ..., k) is the initial values of number of k model parameters in the classification models.
[0099] The described sample data is related to the service scenarios.
Therefore, a classification model can be obtained by classifying sample data using classification models, wherein the model parameters suitable for the service scenarios, (i = 0, 1,2, ...,k).
[0100] In an embodiment, the step S2 includes:
[0101] classifying the sample data by the described classification models to obtain classification results of the described sample data; and
[0102] comparing the classification results of the described sample data with the described correct classification information to obtain incorrect judgement results of the described sample data, and determining the associated incorrectly judged samples for the incorrect results from the described sample data.
[0103] The described correct classification information is the true classifications of the sample data. The comparison between the sample classification results with the correct classification information allows to determine the incorrect classification results by the classification models and the incorrectly judged samples corresponding to the incorrect classification results.
[0104] S3, calculating incorrectness rate of the classification model based on weights of the incorrectly judged samples, and calculating the classification model weights by the incorrectness rate of the classification model to update sample data weights by the classification model weights.
[0105] The described classification model incorrectness rate is used to evaluate reliability of the classification models.
[0106] In an embodiment, the classification model incorrectness rate is the mathematical product of the number of the described incorrectly judged samples and the weights of the described incorrectly judged samples, calculated by the equation as the following:
[0107] Ei = E7= wij I (hi(x) # y) = E7= 1-N1 I (hi(x) # y)
[0108] wherein, Ei is the incorrectness rate of the lth iteration model, w1,1 is the weight of the jth sample data in the ith iteration, I (h1 (x) # y) is the number of incorrectly judged samples corresponding to the incorrect classification results in the lth iteration model, and N is the sample number, with i = 1, 2, 3, ..., and j = 1, 2, 3, ..., N.

Date Recue/Date Received 2021-12-30
[0109] The described classification model weights are majorly used to reflect effects by classification results by different classification models for classifying classification-pending data.
[0110] In an embodiment, the classification model weights are calculated by the following equation:
[0111] ai = 1/2 * log EE. i)
[0112] wherein ai is the weight of the classification model obtained in the lth iteration, and Ei is the incorrectness rate of the classification model obtained in the lth iteration, with i = 1, 2, 3...
[0113] In an embodiment, the sample data weights are calculated by the following equation:
wi-11 -a = *y(i)*h = (x(j)))
[0114] vilu = m e( zi_l
[0115] wherein wi_ti is the weight of the sample data j in the lth iteration, Z1_1 is a normalization ( a = (0 = (x) factor of Zi_i = e\-*y *h(,-1) (D , a(i_1) is the classification model weight of the ith iteration, with j = 1, 2, 3, ... , N, and N is the total number of the described sample data.
[0116] The described sample initial weights of the current data are all equal.
The aforementioned adjustment decreases weights of correctly classified sample data and increases weights of correctly judged sample data.
[0117] S4, repeating the iterative classification models in S2 ¨ S3, and selecting multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates.
[0118] The described iteration method adopts the gradient descent approach. In the iterating process, a classification model with new model parameters is generated in each iteration, to better match services related to the current sample data. For example:
[0119] First, the incorrectly judged data No. 20 is selected by random method.
The initial estimation h020(x), true classification value y20, and feature variable value xj20(i = 1, 2, ..., k) are used for SGD estimation for model parameters. The parameters after an iteration are:
Date Recue/Date Received 2021-12-30 o-a(h o( 2 ) x o( 1)) P - h 0( 20) y20)x 2{
P 2 /12¨ a( 0( 2D) x) y70) A 7 2("
k =
a(h 000 (x) y20)
[0120]
/3k - a( Flo( 2 ) - y20) xk ( 2 ) - k
[0121] wherein, a is a learning factor to control the speed of gradient descent and select in between 0 ¨ 1. The value of a determines speed of gradient descent, wherein larger values lead to faster decent. The high decent speed may adversely affect the stability of the aforementioned estimation method; the low decent speed may affect reaching the optimal solution. The value of a adopts the constant parameter of the initial classifier ho (x) in the aforementioned classification model with rounding (to one decimal digit).
[0122] In an embodiment, the step S4 includes:
[0123] comparing the described incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, to identify the described target classification models from the described classification models satisfying the described target model selection conditions. Where if the classification models do not satisfy the target model selection conditions, the step S2 and S3 are repeated.
[0124] The described target model selection conditions are rules of pre-set incorrectness rate threshold. Where if an incorrectness rate satisfies the target model selection conditions, the random gradient descent iteration is terminated. The model parameters generated in the current iteration is assigned as estimated values for optimized parameters for the classifier training. Where if the incorrectness rate does not satisfy the target model selection conditions, the process returns to S2 to redo iterations and proceed to S3 to update model incorrectness rates. A
high value or a low value of incorrectness rate thresholds can affect effectiveness and efficiency of the model parameter estimation, wherein the incorrectness rate can be adjusted according to practical service conditions.
[0125] In an embodiment, the classification models iterate for M times, and M
target classification models are obtained. The multiple classification models are combined to create a strong classifier:
[0126] H (x) = sign(Emn2=1amlin,(x)) Date Recue/Date Received 2021-12-30
[0127] wherein H (x) is the strong classifier, M is the total number of target classification models, am is the weight of the target classification model m, and 16 (x) is the target classification model m.
[0128] The total number of target classification models in the described strong classifier models can be set according to service conditions.
[0129] S5, classifying data to be classified by each target classification model, and determining the classification results of the described classification-pending data based on each described classification model weights.
[0130] In an embodiment, the step S5 includes:
[0131] classifying data to be classified by each target classification model to obtain preliminary classification results of the described classification-pending data; and
[0132] performing weighted calculation of the preliminary classification results according to the described target classification model weights, to acquire classification results of the described classification-pending data.
[0133] As shown in Fig. 2, based on the aforementioned data classification method, a data classification device is disclosed in embodiments of the present invention, including:
[0134] a sample acquisition module 201, configured to acquire sample data and initialize weights of the sample data.
[0135] The described sample data is acquired from raw data related to service scenarios. The initialization of sample data weights, in detail, includes receiving the same initial weights of all sample data pre-set by the user. For example,
[0136] w0,) = =! (j = 1,2, ... ,N) N
[0137] wherein w0,1 is the initial weight of the sample data j, and N is the number of sample data.
[0138] In an embodiment, the sample acquisition module 201 is further configured to:
[0139] collecting raw data to extract feature information of the described raw data;
[0140] counting data volume of the described raw data associated with each described feature information; and
[0141] filtering the described feature data based on the data volume of the described raw data associated with the described feature data to identify the remaining feature information as the described sample data.

Date Recue/Date Received 2021-12-30
[0142] The described sample acquisition module 201 is further configured to filter the feature data in one or more approaches of the following:
[0143] pre-selecting feature information based on IV values (information-value values);
[0144] removing feature information cannot withstand "penalty" by Lasso regression and Ridge regression;
[0145] calculating information significance by random forest; and
[0146] verifying and removing feature information with correlation over a threshold by multicollinearity.
[0147] In an embodiment, the sample acquisition module 201 is further configured to:
[0148] identify missing values in the raw data, and deleting the feature information with high missing rate in the raw data; and
[0149] identify outliers in the raw data, and replace the outliers with quantile data.
[0150] A classification model iteration module 202, is configured to classify sample data by a classification model, and acquire incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results.
[0151] In an embodiment, the classification model iteration module 202 comprises:
[0152] an iterative classification module, configured to classify the sample data by the described classification models to obtain classification results of the described sample data; and
[0153] a classification result comparison module, configured to compare the classification results of the described sample data with the described correct classification information to obtain incorrect judgement results of the described sample data, and determining the associated incorrectly judged samples for the incorrect results from the described sample data.
[0154] The described classification model can adopt logistic normalization classification model, known as the logistic classifier:
[0155] ho(x) = P(Y = 1IX) = _____________ 1+e-660-h61x1-F...-h6kxk)
[0156] wherein Pi (i = 0, 1, 2, ..., k) is the initial values of the number of k model parameters in the classification models.
[0157] A calculation module 203, is configured to calculate incorrectness rate of the classification model based on weights of the incorrectly judged samples, and calculate the classification model weights by the incorrectness rate of the classification model to update sample data weights by the classification model weights.

Date Recue/Date Received 2021-12-30
[0158] In an embodiment, the classification model incorrectness rate is the mathematical product of the number of the described incorrectly judged samples and the weights of the described incorrectly judged samples, calculated by the equation as the following:
[0159] E = = iwijI(hi(x) # y) = =1¨N1 I(hi(x) # y)
[0160] wherein, c1 is the incorrectness rate of the ith iteration model, w7,1 is the weight of the jth sample data in the lth iteration, I (h (x) # y) is the number of incorrectly judged samples corresponding to the incorrect classification results in the ith iteration model, and N is the sample number, with i = 1, 2, 3, ..., and j = 1, 2, 3, ..., N.
[0161] In an embodiment, the classification model weights are calculated by the following equation:
[0162] ai = 1/2 * log (it.!)
[0163] wherein ai is the weight of the classification model obtained in the ith iteration, and Ei is the incorrectness rate of the classification model obtained in the lth iteration, with i = 1, 2, 3...
[0164] In an embodiment, the sample data weights are calculated by the following equation:
[0165] vilu = wt-ij e (-a(,_1)*y(`)*h(,_1)(x(D)) m ¨
z,_,
[0166] wherein wi_1,1 is the weight of the sample data j in the lth iteration, Z1_1 is a normalization a(,),(x factor of Zi_i = E e(-_1*y(0*h(_1) (n), a(i_1) is the classification model weight of the lth iteration, with j = 1, 2, 3, ... , N, and N is the total number of the described sample data.
[0167] A target model identification module 204, configured to select multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates.
[0168] The described iteration method adopts the gradient descent approach. In the iterating process, a classification model with new model parameters is generated in each iteration, to better match services related to the current sample data.
[0169] In an embodiment, the target model identification module 204 comprises:
[0170] a selection condition comparison module, configured to compare the described incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, for identifying the described target classification models from the described classification models satisfying the described target model selection conditions.

Date Recue/Date Received 2021-12-30
[0171] In an embodiment, the target model identification module 204 is particularly configured to
[0172] create a strong classification module by combining the M target classification models from M iterations, by
[0173] H (x) = sign(EL, an,h,,(x))
[0174] wherein H (x) is the strong classifier, M is the total number of target classification models, am is the weight of the target classification model m, and hm(x) is the target classification model m.
[0175] A classification module 205, is configured to classify data to be classified by each target classification model, and determine the classification results of the described classification-pending data based on each described classification model weights.
[0176] In an embodiment, the classification module 205 is particularly configured to
[0177] classify data to be classified by each target classification model to obtain preliminary classification results of the described classification-pending data; and
[0178] perform weighted calculation of the preliminary classification results according to the described target classification model weights, to acquire classification results of the described classification-pending data.
[0179] In the data classification device disclosed in the present invention, the sample acquisition module 201, the classification model iteration module 202, the calculation module 203, and the target model identification module 204 can be configured in the data safety assessment management system. The classification module 205 can be configured in an offline computation platform. The data safety assessment management system (known as a bastion host) can be used in offline data detection and model developments. The online computation platform is used for allocating model execution, such as execution frequency, execution starting time, and other parameter allocation, then writing the model execution results in the database. The offline computation platform can be used with the marketing event allocation module in the CRM
customer relationship management system, for customer group selection and settings for marketing volume, marketing periods, and marketing channels.
[0180] Based on the aforementioned data classification method, a computer system is further provided in the present invention, including:
[0181] one or more processors; and Date Recue/Date Received 2021-12-30
[0182] a memory connected to the described one or more processors, wherein the described memory is used to store program commands, for implementing the aforementioned methods in the first perspective when the described program commands are executed on the described one or more processors.
[0183] In particular, a schematic of the computer system structure, shown in Fig. 3, comprises a processor 310, a video display adapter 311, a disk driver 312, an input/output connection port 313, an internet connection port 314, and a memory 320. The aforementioned processor 310, video display adapter 311, disk driver 312, input/output connection port 313, and internet connection port 314 are connected and communicated via the system bus control 330.
[0184] In particular, the processor 310 can adopt a universal CPU (central processing unit), a microprocessor, an ASIC (application specific integrated circuit) or the use of one or more integrated circuits. The processor is used for executing associated programmes to achieve the technical strategies provided in the present invention.
[0185] The memory 320 can adopt a read-only memory (ROM), a random access memory (RAM), a static memory, a dynamic memory, etc. The memory 320 is used to store the operating system 321 for controlling the electronic apparatus 300, and the basic input output system (BIOS) 322 for controlling the low-level operations of the electronic apparatus 300. In the meanwhile, the memory can also store the internet browser 324, data storage management system 324, the device label information processing system 325, etc. The described device label information processing system 325 can be a program to achieve the aforementioned methods and procedures in the present invention. In summary, when the technical strategies are performed via software or hardware, the codes for associated programs are stored in the memory 320, then called and executed by the processor 310.
[0186] The input/output connection port 313 is used to connect with the input/output modules for information input and output. The input/output modules can be used as components that are installed in the devices (not included in the drawings), or can be externally connected to the devices to provide the described functionalities. In particular, the input devices may include keyboards, mouse, touch screens, microphones, various types of sensors, etc.
The output devices may include monitors, speakers, vibrators, signal lights, etc.
[0187] The internet connection port 314 is used to connect with a communication module (not included in the drawings), to achieve the communication and interaction between the described Date Recue/Date Received 2021-12-30 device and other equipment. In particular, the communication module may be connected by wire connection (such as USB cables or internet cables), or wireless connection (such as mobile data, WIFI, Bluetooth, etc.)
[0188] The system bus control 330 includes a path to transfer data across each component of the device (such as the processor 310, the video display adapter 311, the disk driver 312, the input/output connection port 313, the internet connection port 314 and the memory 320).
[0189] Besides, the described electronic device 300 can access the collection condition information from the collection condition information database 341 via a virtual resource object, so as for conditional statements and other purposes.
[0190] To clarify, although the schematic of the aforementioned device only includes the processor 310, the video display adapter 311, the disk driver 312, the input/output connection port 313, the internet connection port 314, the memory 320 and the system bus control 330, the practical applications may include the other necessary components to achieve successful operations. It is comprehensible for those skilled in the art that the structure of the device may comprise of less components than that in the drawings, to achieve successful operations.
[0191] By the aforementioned descriptions of the applications and embodiments, those skilled in the art can understand that the present invention can be achieved by combination of software and necessary hardware platforms. Based on this concept, the present invention is considered as providing the technical benefits in the means of software products. The mentioned computer software products are stored in the storage media such as ROM/RAM, magnetic disks, compact disks, etc. The mentioned computer software products also include using several commands to have a computer device (such as a personal computer, a server, or a network device) to perform portions of the methods described in each or some of the embodiments in the present invention.
[0192] The embodiments in the description of the present invention are explained step-by-step.
The similar contents can be referred amongst the embodiments, while the differences amongst the embodiments are emphasized. In particular, the system and the corresponding embodiments have similar contents to the method embodiments. Hence, the system and the corresponding embodiments are described concisely, and the related contents can be referred to the method embodiments. The described system and system embodiments are for demonstration only, where the components that are described separately can be physically separated or not. The components shown in individual units can be physical units or not. In other words, the mentioned components Date Recue/Date Received 2021-12-30 can be at a single location or distributed onto multiple network units. All or portions of the modules can be used to achieve the purposes of embodiments of the present invention based on the practical scenarios. Those skilled in the art can understand and apply the associated strategies without creative works.
[0193] The benefits provided by the technical proposal in embodiments of the present invention include that:
[0194] 1. the present invention adopts incorrectly judged sample iteration updating classification models, wherein the classification model incorrectness rates are calculated by incorrectly-judged sample weights after each iteration. In the process of iteratively updating sample data weights, the incorrectly judged sample weights are increased. The classification model is updated by sample data with updated weights in the next iteration, yielding more accurate classification models for services related to sample data.
[0195] 2. The present invention finally adopts multiple target classification models obtained in the iteration process to classify classification-pending data and determines classification results based on target classification model weights, yielding more accurate classification.
[0196] The described available technical proposals can be combined as much as possible for available embodiments of the present invention, and are not further described in detail.
[0197] The aforementioned contents of preferred embodiments of the present invention shall not limit the applications of the present invention. Therefore, all alterations, modifications, equivalence, improvements of the present invention fall within the scope of the present invention.

Date Recue/Date Received 2021-12-30

Claims (73)

Claims:
1. A device comprising:
a sample acquisition module, configured to acquire sample data and initialize weights of the sample data;
a classification model iteration module, configured to:
classify sample data by a classification model; and acquire incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for incorrect results;
a calculation module, configured to:
calculate incorrectness rate of the classification model based on weights of the incorrectly judged samples; and calculate the classification model weights by the incorrectness rate of the classification model to update sample data weights by the classification model weights;
a target model identification module, configured to select multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates; and a classification module, configured to:
classify data to be classified by each target classification model; and detennine the classification results of the classification-pending data based on each classification model weights.
2. The device of claim 1, wherein the calculation model is configured to identify classification model incorrectness rate by calculating a mathematical product of number of the incorrectly judged samples and the weights of the incorrectly judged samples.
3. The device of any one of claims 1 to 2, wherein the calculation model is further configured to calculate classification model weights by:
wherein a, is the weight of the classification model obtained in the ith iteration, and wherein E i is the incorrectness rate of the classification model obtained in the ith iteration, with i = 1, 2, 3...
4. The device of any one of claims 1 to 3, wherein the calculation model is further configured to update the sample data weights according to:
wherein wi_1,1 is the weight of the sample data j in the ith iteration;
wherein Zi_1 is a normalization factor of wherein a(i_l) is the classification model weight of the ith iteration, with j = 1, 2, 3, ... , N; and wherein N is the total number of the sample data.
5. The device of any one of claims 1 to 4, wherein the target model identification module further comprises:

a selection condition comparison module, configured to compare the incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, for identifying the target classification models from the classification models satisfying the target model selection conditions.
6. The device of any one of claims 1 to 5, wherein the classification module is further configured to:
collect raw data to extract feature information of the raw data;
count data volume of the raw data associated with each feature information;
and filter the feature data based on the data volume of the raw data associated with the feature data to identify the remaining feature information as the sample data.
7. The device of any one of claims 1 to 6, wherein the classification module is further configured to:
classify data to be classified by each target classification model to obtain preliminary classification results of the classification-pending data; and perform weighted calculation of the preliminary classification results according to the target classification model weights, for acquiring classification results of the classification-pending data.
8. The device of any one of claims 1 to 7, wherein a classification iteration module comprises:
an iterative classification module, configured to classify the sample data by the classification models to obtain classification results of the sample data; and a classification result comparison module, configured to:
compare the classification results of the sample data with the correct classification information to obtain incorrect judgement results of the sample data; and determine the associated incorrectly judged samples for the incorrect results from the sample data.
9. The device of any one of claims 1 to 8, wherein the sample data is acquired from the raw data related to service scenarios, wherein the initialization of sample data weights includes receiving the same initial weights of all sample data pre-set by the user, by:
wherein w0,1 is initial weight of the sample data j, and N is the number of sample data.
10. The device of any one of claims 1 to 9, wherein the raw data acquisition and feature information extraction of the raw data, associated data are collected from service system databases.
11. The device of any one of claims 1 to 10, wherein quantiles of the raw data under each feature information are determined during the counting process, wherein during the feature information filtration, the feature information are deleted according to different data types based on quantiles.
12. The device of any one of claims 1 to 11, wherein the sample acquisition module is further configured to filter the feature data including one or more of:
pre-selecting feature information based on IV values (information-value values);
removing feature information cannot withstand "penalty" by Lasso regression and Ridge regression;
calculating information significance by random forest; and verifying and removing feature information with correlation over a threshold by multicollinearity.
13. The device of any one of claims 1 to 12, wherein the target model selection conditions are rules of pre-set incorrectness rate threshold, wherein the incorrectness rate satisfies the target model selection conditions, the random gradient descent iteration is terminated, wherein the model parameters generated in the current iteration is assigned as estimated values for optimized parameters for the classifier training, wherein the incorrectness rate does not satisfy the target model selection conditions, the process returns to redo iterations and proceed to update model incorrectness rates, wherein a high value or a low value of incorrectness rate thresholds affects effectiveness and efficiency of the model parameter estimation, wherein the incorrectness rate is adjusted according to practical service conditions.
14. The device of any one of claims 1 to 13, wherein the sample acquisition module is further configured to:
identifying missing values in the raw data;
deleting the feature information with high missing rate in the raw data;
identifying outliers in the raw data;
replacing the outliers with quantile data; and wherein the evaluation of feature information missing values are determined by a pre-set missing value threshold.
15. The device of any one of claims 1 to 14, wherein the classification model adopts logistic normalization classification model, known as the logistic classifier:
wherein (i = 0, 1, 2, ..., k) is the initial values of number of k model parameters in the classification models;

wherein the sample data is related to the service scenarios, wherein a classification model is obtained by classifying sample data using classification models, wherein the model parameters suitable for the service scenarios,
16. The device of any one of claims 1 to 15, wherein the classification model incorrectness rate is the mathematical product of the number of the incorrectly judged samples and the weights of the incorrectly judged samples, calculated by the equation as the following:
wherein, E, is the incorrectness rate of the ith iteration model;
wherein wj is the weight of the jth sample data in the ith iteration;
wherein I(h, (x)y) is the number of incorrectly judged samples corresponding to the incorrect classification results in the ith iteration model; and N is the sample number, with i = 1, 2, 3, ..., and j = 1, 2, 3, ..., N.
17. The device of any one of claims 1 to 16, wherein the target model identification module is further configured to create a strong classification module by combining the M
target classification models from M iterations, by:
wherein H (x) is the strong classifier, M is total number of target classification models, am is weight of the target classification model m, and hm(x) is the target classification model m; and wherein the total number of target classification models in the strong classifier models are set according to service conditions.
18. The device of any one of claims 1 to 17, wherein the sample acquisition module, the classification model iteration module, the calculation module, and the target model identification module are configured in a data safety assessment management system, wherein the classification module is configured in an offline computation platform, wherein the data safety assessment management system is used in offline data detection and model developments.
19. The device of any one of claims 1 to 18, wherein an online computation platform is used for allocating model execution, including one or more of, execution frequency, execution starting time, and parameter allocation, then writing the model execution results in the database, wherein the offline computation platform is used with a marketing event allocation module in a CRM customer relationship management system, for customer group selection and settings for marketing volume, marketing periods, and marketing channels.
20. A computer system comprising:
one or more processors; and a memory connected to the one or more processors, wherein the memory is used to store program commands, wherein the program commands are executed on the one or more processors configured to:
acquire sample data and initializing weights of the sample data;
classify the sample data by a classification model;
acquire incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results;
calculate incorrectness rate of a classification model based on weights of the incorrectly judged samples;

calculate the classification model weights by the incorrectness rate of the classification model to update sample data weights by the classification model weights;
repeat iterative classification models comprising:
classifying the sample data by the classification model;
acquiring the incorrect classification results by the correct classification information of the sample data and the associated incorrectly judged samples for the incorrect results;
calculating the incorrectness rate of the classification model based on the weights of the incorrectly judged samples;
calculating the classification model weights by the incorrectness rate of the classification model to update the sample data weights by the classification model weights;
select multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates;
classify data to be classified by each target classification model; and determine the classification results of classification-pending data based on each classification model weights.
21. The system of claim 20, wherein the classification model incorrectness rate is obtained from weights of the incorrectly judged samples comprising:
assigning a mathematical product of number of the incorrectly judged samples and the weights of the incorrectly judged samples as the classification model incorrectness rate.
22. The system of claim 20, wherein the calculation of the classification model weights by the classification model incorrectness rate comprises:

calculating the calculation of the classification model weights by:
wherein a, is weight of the classification model obtained in ith iteration;
and wherein E , is the incorrectness rate of the classification model obtained in the ith iteration, with i = 1,2,3.
23. The system of claim 20, wherein the update of the sample data weights by the classification model weights comprises:
updating the sample data weights according to:
wherein wii is weight of sample data j in the ith iteration;
wherein 4_, is a normalization factor of:
wherein a(i_1) is the classification model weight of the ith iteration, with j = 1, 2, 3, ...
N; and wherein N is total number of the sample data.
24. The system of any one of claims 20 to 23, wherein the selection of multiple classification models from the multiple classification models obtained by iterations according to the incorrectness rates comprises:
comparing the incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, to identify the target classification models from the classification models satisfying the target model selection conditions.
25. The system of any one of claims 20 to 23, wherein the acquisition of sample data comprises:
collecting raw data to extract feature information of the raw data;
counting data volume of the raw data associated with each feature information;
and filtering feature data based on the data volume of the raw data associated with the feature data to identify remaining feature information as the sample data.
26. The system of any one of claims 20 to 23, wherein the classification of data to be classified by each target classification model, and the determination the classification results of the data to be classified based on each classification model weights comprises:
classifying data to be classified by each target classification model to obtain preliminary classification results of the classification-pending data; and performing weighted calculation of the preliminary classification results according to the target classification model weights, to acquire classification results of the classification-pending data.
27. The system of any one of claims 20 to 23, wherein the classification of the sample data by the classification model and the acquisition of incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results comprises:
classifying the sample data by the classification models to obtain classification results of the sample data;
comparing the classification results of the sample data with the correct classification information to obtain incorrect judgement results of the sample data; and determining the associated incorrectly judged samples for the incorrect results from the sample data.
28. The system of any one of claims 20 to 27, wherein the sample data is acquired from the raw data related to service scenarios, wherein the initialization of sample data weights includes receiving the same initial weights of all sample data pre-set by the user, by:
wherein w0,1 is initial weight of the sample data j, and N is the number of sample data.
29. The system of any one of claims 20 to 28, wherein the raw data acquisition and feature information extraction of the raw data, associated data is collected from service system databases.
30. The system of any one of claims 20 to 29, wherein quantiles of the raw data under each feature information are determined during the counting process, wherein during the feature information filtration, the feature information is deleted according to different data types based on quantiles.
31. The system of any one of claims 20 to 30, wherein the feature data filtration in the raw data acquisition further includes one or more of:
pre-selecting feature information based on IV values (information-value values);
removing feature information cannot withstand "penalty" by Lasso regression and Ridge regression;
calculating information significance by random forest; and verifying and removing feature information with correlation over a threshold by multicollinearity.
32. The system of any one of claims 20 to 31, wherein the target model selection conditions are rules of pre-set incorrectness rate threshold, wherein the incorrectness rate satisfies the target model selection conditions, the random gradient descent iteration is terminated, wherein the model parameters generated in the current iteration is assigned as estimated values for optimized parameters for the classifier training, wherein the incorrectness rate does not satisfy the target model selection conditions, the process returns to redo iterations and proceed to update model incorrectness rates, wherein a high value or a low value of incorrectness rate thresholds affects effectiveness and efficiency of the model parameter estimation, wherein the incorrectness rate is adjusted according to practical service conditions.
33. The system of any one of claims 20 to 32, wherein the classification models iterate for M
times, and M target classification models are obtained, wherein the multiple classification models are combined to create a strong classifier:
wherein H(x) is the strong classifier, M is total number of target classification models, am is weight of the target classification model m, and hm(x) is the target classification model m; and wherein the total number of target classification models in the strong classifier models are set according to service conditions.
34. The system of any one of claims 20 to 33, wherein the sample data acquisition further comprises:
identifying missing values in the raw data;
deleting the feature information with high missing rate in the raw data;
identifying outliers in the raw data;
replacing the outliers with quantile data; and wherein the evaluation of feature information missing values are determined by a pre-set missing value threshold.
35. The system of any one of claims 20 to 34, wherein the classification model adopts logistic normalization classification model, known as the logistic classifier:
wherein (i = 0, 1, 2, ..., k) is the initial values of number of k model parameters in the classification models;
wherein the sample data is related to the service scenarios, wherein the classification model are obtained by classifying sample data using classification models, wherein the model parameters suitable for the service scenarios, A(i = 0, 1, 2, ..., k).
36. The system of any one of claims 20 to 35, wherein the classification model incorrectness rate is the mathematical product of the number of the incorrectly judged samples and the weights of the incorrectly judged samples, calculated by the equation as the following:
wherein, E, is the incorrectness rate of the ith iteration model;
wherein w,,, is the weight of the jth sample data in the ith iteration;
wherein I(h, (x)y) is the number of incorrectly judged samples corresponding to the incorrect classification results in the ith iteration model; and N is the sample number, with i = 1, 2, 3, ..., and j = 1, 2, 3, ..., N.
37. The system of any one of claims 20 to 36, wherein the sample initial weights of the current data are all equal, wherein the adjustment decreases weights of correctly classified sample data and increases weights of correctly judged sample data.
38. A method comprising:

acquiring sample data and initializing weights of the sample data;
classifying the sample data by a classification model;
acquiring incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results;
calculating incorrectness rate of a classification model based on weights of the incorrectly judged samples;
calculating the classification model weights by the incorrectness rate of the classification model to update sample data weights by the classification model weights;
repeating iterative classification models comprising:
classifying the sample data by the classification model;
acquiring the incorrect classification results by the correct classification information of the sample data and the associated incorrectly judged samples for the incorrect results;
calculating the incorrectness rate of the classification model based on the weights of the incorrectly judged samples;
calculating the classification model weights by the incorrectness rate of the classification model to update the sample data weights by the classification model weights;
selecting multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates;
classifying data to be classified by each target classification model; and determining the classification results of classification-pending data based on each classification model weights.
39. The method of claim 38, wherein the classification model incorrectness rate is obtained from weights of the incorrectly judged samples comprising:
assigning a mathematical product of number of the incorrectly judged samples and the weights of the incorrectly judged samples as the classification model incorrectness rate.
40. The method of claim 38, wherein the calculation of the classification model weights by the classification model incorrectness rate comprises:
calculating the calculation of the classification model weights by:
wherein a, is weight of the classification model obtained in ith iteration;
and wherein E , is the incorrectness rate of the classification model obtained in the ith iteration, with i = 1,2,3.
41. The method of claim 38, wherein the update of the sample data weights by the classification model weights comprises:
updating the sample data weights according to:
wherein wi_id is weight of sample data j in the ith iteration;
wherein Zi_iis a normalization factor of:
wherein a(i_1) is the classification model weight of the ith iteration, with j = 1, 2, 3, ...
N; and wherein N is total number of the sample data.
42. The method of any one of claims 38 to 41, wherein the selection of multiple classification models from the multiple classification models obtained by iterations according to the incorrectness rates comprises:
comparing the incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, to identify the target classification models from the classification models satisfying the target model selection conditions.
43. The method of any one of claims 38 to 41, wherein the acquisition of sample data comprises:
collecting raw data to extract feature information of the raw data;
counting data volume of the raw data associated with each feature information;
and filtering feature data based on the data volume of the raw data associated with the feature data to identify remaining feature information as the sample data.
44. The method of any one of claims 38 to 41, wherein the classification of data to be classified by each target classification model, and the determination the classification results of the data to be classified based on each classification model weights comprises:
classifying data to be classified by each target classification model to obtain preliminary classification results of the classification-pending data; and performing weighted calculation of the preliminary classification results according to the target classification model weights, to acquire classification results of the classification-pending data.
45. The method of any one of claims 38 to 41, wherein the classification of the sample data by the classification model and the acquisition of incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results comprises:

classifying the sample data by the classification models to obtain classification results of the sample data;
comparing the classification results of the sample data with the correct classification information to obtain incorrect judgement results of the sample data; and determining the associated incorrectly judged samples for the incorrect results from the sample data.
46. The method of any one of claims 38 to 45, wherein the sample data is acquired from the raw data related to service scenarios, wherein the initialization of sample data weights includes receiving the same initial weights of all sample data pre-set by the user, by:
wherein w0,1 is initial weight of the sample data j, and N is the number of sample data.
47. The method of any one of claims 38 to 46, wherein the raw data acquisition and feature information extraction of the raw data, associated data is collected from service system databases.
48. The method of any one of claims 38 to 47, wherein quantiles of the raw data under each feature information are determined during the counting process, wherein during the feature information filtration, the feature information is deleted according to different data types based on quantiles.
49. The method of any one of claims 38 to 48, wherein the feature data filtration in the raw data acquisition further includes one or more of:
pre-selecting feature information based on IV values (information-value values);
removing feature information cannot withstand "penalty" by Lasso regression and Ridge regression;
calculating information significance by random forest; and verifying and removing feature infonnation with correlation over a threshold by multicollinearity.
50. The method of any one of claims 38 to 49, wherein the target model selection conditions are rules of pre-set incorrectness rate threshold, wherein the incorrectness rate satisfies the target model selection conditions, the random gradient descent iteration is terminated, wherein the model parameters generated in the current iteration is assigned as estimated values for optimized parameters for the classifier training, wherein the incorrectness rate does not satisfy the target model selection conditions, the process returns to redo iterations and proceed to update model incorrectness rates, wherein a high value or a low value of incorrectness rate thresholds affects effectiveness and efficiency of the model parameter estimation, wherein the incorrectness rate is adjusted according to practical service conditions.
51. The method of any one of claims 38 to 50, wherein the classification models iterate for M
times, and M target classification models are obtained, wherein the multiple classification models are combined to create a strong classifier:
wherein H(x) is the strong classifier, M is total number of target classification models, am is weight of the target classification model m, and hm(x) is the target classification model m; and wherein the total number of target classification models in the strong classifier models are set according to service conditions.
52. The method of any one of claims 38 to 51, wherein the sample data acquisition further comprises:
identifying missing values in the raw data;
deleting the feature information with high missing rate in the raw data;

identifying outliers in the raw data;
replacing the outliers with quantile data; and wherein the evaluation of feature information missing values are determined by a pre-set missing value threshold.
53. The method of any one of claims 38 to 52, wherein the classification model adopts logistic normalization classification model, known as the logistic classifier:
wherein (i = 0, 1, 2, ..., k) is the initial values of number of k model parameters in the classification models;
wherein the sample data is related to the service scenarios, wherein the classification model are obtained by classifying sample data using classification models, wherein the model parameters suitable for the service scenarios, fli(i = 0, 1, 2, ..., k).
54. The method of any one of claims 38 to 53, wherein the classification model incorrectness rate is the mathematical product of the number of the incorrectly judged samples and the weights of the incorrectly judged samples, calculated by the equation as the following:
wherein, E, is the incorrectness rate of the lth iteration model;
wherein w,,, is the weight of the jth sample data in the ith iteration;
wherein I(h, (x)y) is the number of incorrectly judged samples corresponding to the incorrect classification results in the lth iteration model; and N is the sample number, with i = 1, 2, 3, ..., and j = 1, 2, 3, ..., N.

Date Recue/Date Received 2022-04-07
55. The method of any one of claims 38 to 54, wherein the sample initial weights of the current data are all equal, wherein the adjustment decreases weights of correctly classified sample data and increases weights of correctly judged sample data.
56. A computer readable physical memory having stored thereon a computer program executed by a computer configured to:
acquire sample data and initializing weights of the sample data;
classify the sample data by a classification model;
acquire incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results;
calculate incorrectness rate of a classification model based on weights of the incorrectly judged samples;
calculate the classification model weights by the incorrectness rate of the classification model to update sample data weights by the classification model weights;
repeat iterative classification models comprising:
classifying the sample data by the classification model;
acquiring the incorrect classification results by the correct classification information of the sample data and the associated incorrectly judged samples for the incorrect results;
calculating the incorrectness rate of the classification model based on the weights of the incorrectly judged samples;
calculating the classification model weights by the incorrectness rate of the classification model to update the sample data weights by the classification model weights;

select multiple target classification models from the multiple classification models obtained by iterations according to the incorrectness rates;
classify data to be classified by each target classification model; and determine the classification results of classification-pending data based on each classification model weights.
57. The memory of claim 55, wherein the classification model incorrectness rate is obtained from weights of the incorrectly judged samples comprising:
assigning a mathematical product of number of the incorrectly judged samples and the weights of the incorrectly judged samples as the classification model incorrectness rate.
58. The memory of claim 55, wherein the calculation of the classification model weights by the classification model incorrectness rate comprises:
calculating the calculation of the classification model weights by:
wherein a is weight of the classification model obtained in ith iteration; and wherein E , is the incorrectness rate of the classification model obtained in the iteration, with i = 1,2,3.
59. The memory of claim 55, wherein the update of the sample data weights by the classification model weights comprises:
updating the sample data weights according to:
wherein wi_1,j is weight of sample data j in the ith iteration;

wherein Zi_iis a normalization factor of:
wherein a(i_1) is the classification model weight of the ith iteration, with j = 1, 2, 3, ...
N; and wherein N is total number of the sample data.
60. The memory of any one of claims 55 to 59, wherein the selection of multiple classification models from the multiple classification models obtained by iterations according to the incorrectness rates comprises:
comparing the incorrectness rate of the multiple classification models obtained by iterations with target model selection conditions, to identify the target classification models from the classification models satisfying the target model selection conditions.
61. The memory of any one of claims 55 to 59, wherein the acquisition of sample data comprises:
collecting raw data to extract feature information of the raw data;
counting data volume of the raw data associated with each feature information;
and filtering feature data based on the data volume of the raw data associated with the feature data to identify remaining feature information as the sample data.
62. The memory of any one of claims 55 to 59, wherein the classification of data to be classified by each target classification model, and the determination the classification results of the data to be classified based on each classification model weights comprises:
classifying data to be classified by each target classification model to obtain preliminary classification results of the classification-pending data; and performing weighted calculation of the preliminary classification results according to the target classification model weights, to acquire classification results of the classification-pending data.
63. The memory of any one of claims 55 to 59, wherein the classification of the sample data by the classification model and the acquisition of incorrect classification results by correct classification information of the sample data and associated incorrectly judged samples for the incorrect results comprises:
classifying the sample data by the classification models to obtain classification results of the sample data;
comparing the classification results of the sample data with the correct classification information to obtain incorrect judgement results of the sample data; and determining the associated incorrectly judged samples for the incorrect results from the sample data.
64. The memory of any one of claims 55 to 63, wherein the sample data is acquired from the raw data related to service scenarios, wherein the initialization of sample data weights includes receiving the same initial weights of all sample data pre-set by the user, by:
wherein w0,1 is initial weight of the sample data j, and N is the number of sample data.
65. The memory of any one of claims 55 to 64, wherein the raw data acquisition and feature information extraction of the raw data, associated data is collected from service system databases.
66. The memory of any one of claims 55 to 65, wherein quantiles of the raw data under each feature information are determined during the counting process, wherein during the feature information filtration, the feature information is deleted according to different data types based on quantiles.
67. The memory of any one of claims 55 to 66, wherein the feature data filtration in the raw data acquisition further includes one or more of:
pre-selecting feature information based on IV values (information-value values);
removing feature information cannot withstand "penalty" by Lasso regression and Ridge regression;
calculating information significance by random forest; and verifying and removing feature information with correlation over a threshold by multicollinearity.
68. The memory of any one of claims 55 to 67, wherein the target model selection conditions are rules of pre-set incorrectness rate threshold, wherein the incorrectness rate satisfies the target model selection conditions, the random gradient descent iteration is terminated, wherein the model parameters generated in the current iteration is assigned as estimated values for optimized parameters for the classifier training, wherein the incorrectness rate does not satisfy the target model selection conditions, the process returns to redo iterations and proceed to update model incorrectness rates, wherein a high value or a low value of incorrectness rate thresholds affects effectiveness and efficiency of the model parameter estimation, wherein the incorrectness rate is adjusted according to practical service conditions.
69. The memory of any one of claims 55 to 68, wherein the classification models iterate for M
times, and M target classification models are obtained, wherein the multiple classification models are combined to create a strong classifier:
wherein H (x) is the strong classifier, M is total number of target classification models, am is weight of the target classification model m, and hm(x) is the target classification model m; and wherein the total number of target classification models in the strong classifier models are set according to service conditions.
70. The memory of any one of claims 55 to 69, wherein the sample data acquisition further comprises:
identifying missing values in the raw data;
deleting the feature information with high missing rate in the raw data;
identifying outliers in the raw data;
replacing the outliers with quantile data; and wherein the evaluation of feature information missing values are determined by a pre-set missing value threshold.
71. The memory of any one of claims 55 to 70, wherein the classification model adopts logistic normalization classification model, known as the logistic classifier:
wherein (i = 0, 1, 2, ..., k) is the initial values of number of k model parameters in the classification models;
wherein the sample data is related to the service scenarios, wherein the classification model are obtained by classifying sample data using classification models, wherein the model parameters suitable for the service scenarios, A(i = 0, 1, 2, ..., k).
72. The memory of any one of claims 55 to 71, wherein the classification model incorrectness rate is the mathematical product of the number of the incorrectly judged samples and the weights of the incorrectly judged samples, calculated by the equation as the following:
wherein, Ei is the incorrectness rate of the ith iteration model;
wherein Avid is the weight of the jth sample data in the ith iteration;
wherein I(h, (x)y) is the number of incorrectly judged samples corresponding to the incorrect classification results in the ith iteration model; and N is the sample number, with i = 1, 2, 3, ..., and j = 1, 2, 3, ..., N.
73. The memory of any one of claims 55 to 72, wherein the sample initial weights of the current data are all equal, wherein the adjustment decreases weights of correctly classified sample data and increases weights of correctly judged sample data.
CA3144411A 2020-12-31 2021-12-30 Data classification method, device and system Pending CA3144411A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011617742.0 2020-12-31
CN202011617742.0A CN112686312A (en) 2020-12-31 2020-12-31 Data classification method, device and system

Publications (1)

Publication Number Publication Date
CA3144411A1 true CA3144411A1 (en) 2022-06-30

Family

ID=75453718

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3144411A Pending CA3144411A1 (en) 2020-12-31 2021-12-30 Data classification method, device and system

Country Status (2)

Country Link
CN (1) CN112686312A (en)
CA (1) CA3144411A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116843341A (en) * 2023-06-27 2023-10-03 湖南工程学院 Credit card abnormal data detection method, device, equipment and storage medium
CN117572105A (en) * 2023-03-02 2024-02-20 广东省源天工程有限公司 Hybrid detection device for hidden defects of power equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117572105A (en) * 2023-03-02 2024-02-20 广东省源天工程有限公司 Hybrid detection device for hidden defects of power equipment
CN116843341A (en) * 2023-06-27 2023-10-03 湖南工程学院 Credit card abnormal data detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112686312A (en) 2021-04-20

Similar Documents

Publication Publication Date Title
WO2018103718A1 (en) Application recommendation method and apparatus, and server
CA3144411A1 (en) Data classification method, device and system
CN110795584B (en) User identifier generation method and device and terminal equipment
CN111797320B (en) Data processing method, device, equipment and storage medium
WO2019061664A1 (en) Electronic device, user's internet surfing data-based product recommendation method, and storage medium
CN112990294B (en) Training method and device of behavior discrimination model, electronic equipment and storage medium
CN110995459A (en) Abnormal object identification method, device, medium and electronic equipment
CN104391879A (en) Method and device for hierarchical clustering
CN113946590A (en) Method, device and equipment for updating integral data and storage medium
CN111861605A (en) Business object recommendation method
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN108647714A (en) Acquisition methods, terminal device and the medium of negative label weight
CN111582912A (en) Portrait modeling method based on deep embedding clustering algorithm
CN115222443A (en) Client group division method, device, equipment and storage medium
CN112887371B (en) Edge calculation method and device, computer equipment and storage medium
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN111159241A (en) Click conversion estimation method and device
CN113159213A (en) Service distribution method, device and equipment
CN113987351A (en) Artificial intelligence based intelligent recommendation method and device, electronic equipment and medium
CN113033938B (en) Method, device, terminal equipment and storage medium for determining resource allocation strategy
CN115439928A (en) Operation behavior identification method and device
CN113553501A (en) Method and device for user portrait prediction based on artificial intelligence
CN113590692A (en) Three-stage crowd mining condition optimization method and system
CN113191570A (en) Fund planning recommendation method, device and equipment based on deep learning
CN112508654A (en) Product information recommendation method and device, computer equipment and storage medium