CA3165582A1

CA3165582A1 - Data processing method and system based on similarity model

Info

Publication number: CA3165582A1
Application number: CA3165582A
Authority: CA
Inventors: Xiang QIAN; Chengcheng XIA
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2018-12-21
Filing date: 2019-09-20
Publication date: 2020-06-25
Also published as: CN109636482B; CN109636482A; WO2020125106A1

Abstract

A similarity model-based data processing method and system, which may effectively improve the conversion rate of customers at reduced costs by using similarity model-based data processing technical means. The method comprises: collecting a plurality of customer data; extracting continuous label data from each piece of customer data, and obtaining multiple groups of discrete label data after binning conversion; calculating the similarity distance for discrete factors in each group of discrete label data, while screening out multiple groups of new discrete label data consisting of discrete factors which contribute significantly; calculating the weight for the discrete factors in the new discrete label data respectively by using the random forest algorithm and the gradient boosting decision tree algorithm, and obtaining weighted results of multiple groups of discrete factors after weighted summation; and calculating the final similarity distance between each piece of customer data and positive sample data respectively by using the Manhattan distance algorithm according to the weighted result of each group of discrete factors and the similarity distance of each discrete factor.

Description

DATA PROCESSING METHOD AND SYSTEM BASED ON SIMILARITY MODEL
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the field of big data analysis technology, and more particularly to a data processing method and a data processing system based on a similarity model.
Description of Related Art

[0002] Being one of the kernel viewpoints in the network marketing conception with attitude, precision marketing is based on precision positioning, and relies on the modern information technical means, big data technology in particular, to create a customized customer communication service system, to enhance efficiencies of communication and service by the enterprise with respect to customers, and to reduce operational cost.
Winning over and converting are two principal processes of internet operation, of which winning over means to promulgate internet products, expose brands, and develop new users of the products. Converting means to convert low consumption value users of internet products to high value users, namely to promote consumption behaviors of users in the internet products, and to enhance operational achievements of the enterprise.

[0003] Means for winning over and converting in the state of the art are mostly based on blind advertising promotion, but it has been found in practice that, since target users are indefinite, what the large input in advertisement brings about is only a limited number of users won over and converted, whereby there is an obvious contradiction between the advertisement cost as input and the conversion rate as acquired, thus exposing the deficiencies of high cost and low efficiency in the state of the art where users are won Date Recue/Date Received 2022-06-21 over and converted through the mode of blind advertising promotion.
SUMMARY OF THE INVENTION

[0004] An objective of the present invention it is to provide a data processing method and a data processing system based on a similarity model, employs data processing technical means based on a similarity model, and can effectively enhance the conversion rate of customers with decreased cost.

[0005] In order to achieve the above objective, according to one aspect, the present invention provides a data processing method based on a similarity model, the method comprises:

[0006] collecting plural pieces of customer data, wherein the customer data are positive-sample data or negative-sample data;

[0007] extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data;

[0008] sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors;

[0009] employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation;

[0010] employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors; and

[0011] screening out any potential customer according to the final similarity distances.

[0012] Preferably, the step of extracting continuous label data from each piece of customer data, Date Recue/Date Received 2022-06-21 subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data includes:

[0013] performing label feature extraction on each piece of customer data, and obtaining plural groups of continuous label initial data;

[0014] performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom; and

[0015] employing an optimum binning strategy to perform optimum binning processing on the various pieces of continuous label data respectively, and correspondingly obtaining plural groups of discrete label data, wherein each group of discrete label data includes plural label features discrete from one another.

[0016] Preferably, the step of performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom includes:

[0017] cleaning and filtering invalid label features in the various groups of continuous label initial data sequentially in accordance with a missing rate filter condition, a tantile filter condition, and a proportion of categories filter condition of the label data, and correspondingly obtaining plural groups of continuous label data.

[0018] Preferably, the step of sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors includes:

[0019] employing an evidence weight algorithm to perform similarity distance calculation on variables of various discrete factors in one group of discrete label data;

[0020] calculating an IV value to which each discrete factor corresponds through an information value formula, and screening out discrete factors with high value degrees on the basis of sizes of the IV values;

Date Recue/Date Received 2022-06-21

[0021] employing a Lasso regression algorithm to screen discrete factors with high identification degrees out of the discrete factors with high value degrees;

[0022] employing a ridge regression algorithm to further screen discrete factors with prominent importance out of the discrete factors with high identification degrees, and constituting plural groups of new discrete label data consisting of prominently contributive discrete factors; and

[0023] respectively invoking other groups of discrete label data to repeat the above calculating steps, and correspondingly obtaining plural groups of new discrete label data.

[0024] Preferably, the step of employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation includes:

[0025] selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing the random forest algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data;

[0026] selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing the gradient boosting decision tree algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data; and

[0027] performing weighted assignment on the importance indices of the various variables of the discrete factors obtained by employing the random forest algorithm and on the importance indices of the various variables of the discrete factors obtained by employing the gradient boosting decision tree algorithm in the same piece of discrete label data, and thereafter performing summation to obtain weight results of plural groups of discrete factors.

[0028] Preferably, the step of employing a Manhattan distance algorithm to calculate a final Date Recue/Date Received 2022-06-21 similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors includes:

[0029] multiplying the weight results of the various groups of discrete factors with the similarity distances of the various discrete factors, and calculating a similarity distance between each discrete factor in the customer data and the positive-sample data; and

[0030] employing the Manhattan distance algorithm to summate the similarity distances of all discrete factors in each piece of customer data, and obtaining a final similarity distance between each piece of customer data and the positive-sample data.

[0031] Exemplarily, the step of screening out any potential customer according to the final similarity distances includes:

[0032] reversely arranging the final similarity distances according to value sizes, screening out top-ranking N pieces of customer data and marking the same as potential customers.

[0033] In comparison with prior-art technology, the marketing method based on a similarity model provided by the present invention achieves the following advantageous effects:

[0034] In the data processing method based on a similarity model provided by the present invention, plural pieces of customer data are collected to construct a dataset, the dataset contains positive-sample data of converted customers and negative-sample data of not converted customers, the label data of each piece of customer data in the dataset is thereafter correspondingly output to obtain plural groups of continuous label data, at this time, in order to verify each label feature in the continuous label data, namely the prominence of contribution of each discrete factor to the model, it is further needed to employ a binning transformation method to subject the various groups of continuous label data respectively to discrete processing, and to correspondingly obtain plural groups of discrete label data, in which one discrete factor in the discrete label data represents one label feature, and by performing similarity distance calculation on the discrete factors in Date Recue/Date Received 2022-06-21 each group of discrete label data, it is realized to score the various discrete factors, for instance, the smaller the value of the calculation result of a discrete factor is, this indicates the closer will be the discrete factor to the contribution degree of the positive-sample data, otherwise it indicates the farther will be the discrete factor to the contribution degree of the positive-sample data, until similarity distance calculations on the discrete factors in the various groups of discrete label data have been completed, obviously invalid discrete factors are eliminated from the various groups of discrete label data to form plural groups of prominently contributive discrete label data, the random forest algorithm and the gradient boosting decision tree algorithm are thereafter respectively employed to calculate importance indices of variables of various discrete factors in each group of discrete label data, weighted summation is performed on the calculation results of the two algorithms to thereafter obtain weight results of the discrete factors, and the Manhattan distance algorithm is finally employed to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors, to realize value estimation of each piece of customer data. As easily understandable, the smaller the final similarity distance is, the closer will be to the positive-sample data, and the higher will be the value of such customers, in other words, they are more likely to be converted to converted customers; to the contrary, the larger the final similarity distance is, the farther will be to the positive-sample data, and the lower will be the value of such customers, in other words, they are less likely to be converted to converted customers, till now, it is possible to screen out potential customers that meet the requirement according to the final similarity distance of each customer, and to hence carry out precision marketing on them.

[0035] Seen as such, the present invention brings about the following technical effects to winning over and converting of platform businesses:

[0036] Through the design of customer value degree appraising function, it is made possible to provide marketing activities of platforms with customer data support; relative to the blind Date Recue/Date Received 2022-06-21 advertising promotion in the state of the art, the present invention markedly reduces promotion cost of marketing activities at the same time of enhancing conversion rate of customers, and guarantees effects of the marketing activities.

[0037] Use of the similarity model can pertinently calculate the final similarity distance of each piece of customer data in accordance with label features in different customer data, hence appraise the value degree of each piece of customer data, and accurately screen out potential high-value customers.

[0038] According to another aspect, the present invention provides a data processing system based on a similarity model, the system is applied in the data processing method based on a similarity model as recited in the foregoing technical solution, and the system comprises:

[0039] an information collecting unit, for collecting plural pieces of customer data, wherein the customer data are positive-sample data or negative-sample data;

[0040] a binning transforming unit, for extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data;

[0041] a label screening unit, for sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors;

[0042] a weight calculating unit, for employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation;

[0043] a similarity distance calculating unit, for employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors; and Date Recue/Date Received 2022-06-21

[0044] a marketing unit, for screening out any potential customer according to the final similarity distances.

[0045] Preferably, the binning transforming unit includes:

[0046] an initial data extracting module, for performing label feature extraction on each piece of customer data, and obtaining plural groups of continuous label initial data;

[0047] a data cleaning module, for performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom; and

[0048] a binning processing module, for employing an optimum binning strategy to perform optimum binning processing on the various pieces of continuous label data respectively, and correspondingly obtaining plural groups of discrete label data, wherein each group of discrete label data includes plural label features discrete from one another.

[0049] Preferably, the label screening unit includes:

[0050] an evidence weight algorithm module, for employing an evidence weight algorithm to perform similarity distance calculation on variables of various discrete factors in one group of discrete label data;

[0051] an information value calculating module, for calculating an IV value to which each discrete factor corresponds through an information value formula, and screening out discrete factors with high value degrees on the basis of sizes of the IV
values;

[0052] a Lasso regression algorithm module, for employing a Lasso regression algorithm to screen discrete factors with high identification degrees out of the discrete factors with high value degrees; and

[0053] a ridge regression algorithm module, for employing a ridge regression algorithm to further screen discrete factors with prominent importance out of the discrete factors with high identification degrees, and constituting plural groups of new discrete label data consisting of prominently contributive discrete factors.

Date Recue/Date Received 2022-06-21

[0054] Preferably, the weight calculating unit includes:

[0055] a random forest algorithm module, for selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing a random forest algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data;

[0056] a gradient boosting decision tree algorithm module, for selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing a gradient boosting decision tree algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data; and

[0057] a weighted assignment module, for performing weighted assignment on the importance indices of the various variables of the discrete factors obtained by employing the random forest algorithm and on the importance indices of the various variables of the discrete factors obtained by employing the gradient boosting decision tree algorithm in the same piece of discrete label data, and thereafter performing summation to obtain weight results of plural groups of discrete factors.

[0058] Preferably, the similarity distance calculating unit includes:

[0059] a label feature similarity distance module, for multiplying the weight results of the various groups of discrete factors with the similarity distances of the various discrete factors, and calculating a similarity distance between each discrete factor in the customer data and the positive-sample data; and

[0060] a customer data similarity distance module, for employing the Manhattan distance algorithm to summate the similarity distances of all discrete factors in each piece of customer data, and obtaining a final similarity distance between each piece of customer data and the positive-sample data.

[0061] In comparison with prior-art technology, the advantageous effects achieved by the data processing system based on a similarity model provided by the present invention are Date Recue/Date Received 2022-06-21 identical with the advantageous effects achievable by the data processing method based on a similarity modal provided by the foregoing technical solution, so these are not redundantly described in this context.
BRIEF DESCRIPTION OF THE DRAWINGS

[0062] The drawings described here are meant to provide further understanding of the present invention, and constitute part of the present invention. The exemplary embodiments of the present invention and the descriptions thereof are meant to explain the present invention, rather than to restrict the present invention. In the drawings:

[0063] Fig. 1 is a flowchart schematically illustrating a data processing method based on a similarity model in Embodiment 1 of the present invention;

[0064] Fig. 2 is an exemplary view illustrating customer data in Fig. 1; and

[0065] Fig. 3 is a block diagram illustrating the structure of a data processing system based on a similarity model in Embodiment 2 of the present invention.

[0066] Reference numerals:

[0067] 1 ¨ information collecting unit 2 ¨ binning transforming unit

[0068] 3 ¨ label screening unit 4 ¨ weight calculating unit

[0069] 5 ¨ similarity distance calculating unit 6 ¨ marketing unit

[0070] 21 ¨ initial data extracting module 22¨ data cleaning module

[0071] 23 ¨ binning processing module 31 ¨ evidence weight algorithm module

[0072] 32 ¨ information value calculating module 33 ¨ Lasso regression algorithm module

[0073] 34 ¨ ridge regression algorithm module 41 ¨ random forest algorithm module Date Recue/Date Received 2022-06-21

[0074] 42 ¨ gradient boosting decision tree 43 ¨ weighted assignment module algorithm module

[0075] 51 ¨ weighted assignment module 52 ¨ customer data similarity distance module DETAILED DESCRIPTION OF THE INVENTION

[0076] To make more lucid and clear the objectives, features and advantages of the present invention, the technical solutions in the embodiments of the present invention are clearly and comprehensively described below with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the embodiments as described are merely partial, rather than the entire, embodiments of the present invention.
All other embodiments obtainable by persons ordinarily skilled in the art on the basis of the embodiments in the present invention without spending creative effort shall all fall within the protection scope of the present invention.

[0077] Embodiment 1

[0078] Fig. 1 is a flowchart schematically illustrating a data processing method based on a similarity model in Embodiment 1 of the present invention. Please refer to Fig. 1, the data processing method based on a similarity model provided by this embodiment comprises:

[0079] collecting plural pieces of customer data, wherein the customer data are positive-sample data or negative-sample data; extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data; sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors; employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on Date Recue/Date Received 2022-06-21 discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation; employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors; and screening out any potential customer according to the final similarity distances.

[0080] In the data processing method based on a similarity model provided by this embodiment, plural pieces of customer data are collected to construct a dataset, the dataset contains positive-sample data of converted customers and negative-sample data of not converted customers, the label data of each piece of customer data in the dataset is thereafter correspondingly output to obtain plural groups of continuous label data, at this time, in order to verify each label feature in the continuous label data, namely the prominence of contribution of each discrete factor to the model, it is further needed to employ a binning transformation method to subject the various groups of continuous label data respectively to discrete processing, and to correspondingly obtain plural groups of discrete label data, in which one discrete factor in the discrete label data represents one label feature, and by performing similarity distance calculation on the discrete factors in each group of discrete label data, it is realized to score the various discrete factors, for instance, the smaller the value of the calculation result of a discrete factor is, this indicates the closer will be the discrete factor to the contribution degree of the positive-sample data, otherwise it indicates the farther will be the discrete factor to the contribution degree of the positive-sample data, until similarity distance calculations on the discrete factors in the various groups of discrete label data have been completed, obviously invalid discrete factors are eliminated from the various groups of discrete label data to form plural groups of prominently contributive discrete label data, the random forest algorithm and the gradient boosting decision tree algorithm are thereafter respectively employed to calculate importance indices of variables of various discrete factors in each group of discrete label data, weighted summation is performed on the calculation results of the two algorithms Date Recue/Date Received 2022-06-21 to thereafter obtain weight results of the discrete factors, and the Manhattan distance algorithm is finally employed to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors, to realize value estimation of each piece of customer data. As easily understandable, the smaller the final similarity distance is, the closer will be to the positive-sample data, and the higher will be the value of such customers, in other words, they are more likely to be converted to converted customers; to the contrary, the larger the final similarity distance is, the farther will be to the positive-sample data, and the lower will be the value of such customers, in other words, they are less likely to be converted to converted customers, till now, it is possible to screen out potential customers that meet the requirement according to the final similarity distance of each customer, and to hence carry out precision marketing on them.

[0081] Seen as such, this embodiment brings about the following technical effects to winning over and converting of platform businesses:

[0082] 1. Through the design of customer value degree appraising function, it is made possible to provide marketing activities of platforms with customer data support;
relative to the blind advertising promotion in the state of the art, this embodiment markedly reduces promotion cost of marketing activities at the same time of enhancing conversion rate of customers, and guarantees effects of the marketing activities.

[0083] 2. Use of the similarity model can pertinently calculate the final similarity distance of each piece of customer data in accordance with label features in different customer data, hence appraise the value degree of each piece of customer data, and accurately screen out potential high-value customers.

[0084] To facilitate comprehension, please refer to Fig. 2, financial platform financing is taken for example for description, customer data can be collected from a database of the financial platform, in which positive-sample data means data of quality customers who Date Recue/Date Received 2022-06-21 have bought financing products, while negative-sample data means data of common customers who have not bought any financing product; during the process of collecting positive-sample data and negative-sample data, a timeline point is firstly selected, and a period of time after the timeline point is then taken as a performance period, data of customers who have bought financing products within the performance period is defined as positive-sample data, and data of customers who have not bought any financing product within the performance period is defined as negative-sample data, more specifically, the positive-sample data and the negative-sample data both contain identification feature attribute discrete factors, such as account numbers of Yihubao, member genders, and member birth dates, etc., historical consumption behavior attribute discrete factors, such as latest shopping payment dates, latest water fee recharging dates, and latest electricity fee recharging dates, etc., member assets status attribute discrete factors, such as recent subscription amounts at Change Treasure, recent subscription amounts for funds, and subscription amounts for periodical financing, etc., and online behavior trajectory attribute discrete factors, such as numbers of in-depth financing pages accessed by members, numbers of in-depth crowd-funding pages accessed by members, and numbers of in-depth insurance pages accessed by members, etc.

[0085] The method of extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data in the foregoing embodiment includes:

[0086] performing label feature extraction on each piece of customer data, and obtaining plural groups of continuous label initial data; performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom; employing an optimum binning strategy to perform optimum binning processing on the various pieces of continuous label data respectively, and correspondingly obtaining plural groups of discrete label data, wherein each group of discrete label data includes plural label features discrete from one another.

Date Recue/Date Received 2022-06-21

[0087] Specifically, the method of performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom includes: cleaning and filtering invalid label features in the various groups of continuous label initial data sequentially in accordance with a missing rate filter condition, a tantile filter condition, and a proportion of categories filter condition of the label data, and correspondingly obtaining plural groups of continuous label data.

[0088] During the process of specific implementation, the entire label features in the various groups of continuous label initial data are firstly counted, and label features that do not satisfy the missing rate filter condition are then cleaned away from the label features, for instance, the missing rate filter condition can be so set that label features with missing rate exceeding 90% are to be cleaned away, afterwards label features that do not satisfy the tantile filter condition are then cleaned away from the remaining label features, for instance, the tantile filter condition can be so set that label features with tantile being smaller than or equal to 0.1 are to be cleaned away, thereafter label features that do not satisfy the missing rate filter condition are again cleaned away from the remaining label features, and continuous label data is finally output; the above steps are repeated to perform data cleaning with respect to the various groups of continuous label initial data respectively, and plural groups of continuous label data can be correspondingly obtained.
This embodiment makes it possible to remove invalid label features through the data cleaning steps, whereby it is avoided that noises as occur should reduce the precision of the model.

[0089] Moreover, the method of employing an optimum binning strategy to perform optimum binning processing on the various pieces of continuous label data respectively, and correspondingly obtaining plural groups of discrete label data includes the following:
Date Recue/Date Received 2022-06-21

[0090] The optimum binning strategy is employed with respect to the continuous label data, i.e., the attribute of the positive-sample data or the negative-sample data is used as a dependent variable, each continuous variable (label feature) serves as an independent variable, and a conditional inference tree algorithm is employed to discretize the continuous variables; it is firstly supposed that all independent variables and dependent variables are independent, chi-square independence test is subsequently carried out thereon, independent variables with P value being smaller than a set threshold are screened out, and split points are finally selected from each screened independent variable through displacement detection, whereby is achieved the objective of discretizing the continuous variables and finally forming discrete label data. As should be stressed, use of the optimum binning strategy to discretize continuous variables pertains to technical means frequently employed in this field of technology, so this is not redundantly described in this embodiment.

[0091] Specifically, the method of sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors in the foregoing embodiment includes:

[0092] employing an evidence weight algorithm to perform similarity distance calculation on variables of various discrete factors in one group of discrete label data;
calculating an IV
value to which each discrete factor corresponds through an information value formula, and screening out discrete factors with high value degrees on the basis of sizes of the IV
values; employing a Lasso regression algorithm to screen discrete factors with high identification degrees out of the discrete factors with high value degrees;
employing a ridge regression algorithm to further screen discrete factors with prominent importance out of the discrete factors with high identification degrees, and constituting plural groups of new discrete label data consisting of prominently contributive discrete factors; and respectively invoking other groups of discrete label data to repeat the above calculating steps, and correspondingly obtaining plural groups of new discrete label data.

Date Recue/Date Received 2022-06-21

[0093] During specific implementation, the evidence weight algorithm in this embodiment indicates a WOE algorithm, the use of which can score the variables of various discrete factors in the discrete label data, the smaller the score of a variable of the discrete factor is, this indicates the higher will be its contribution to the positive sample, and it is further needed, after the variables of the discrete factors have been scored, to normalize them to form similarity distance WOE, in which i expresses the ith discrete factor (label feature), I expresses the jth variable (the variable here can also be understood as a classification) in the ith discrete factor, and the variable is a further definitive description of the discrete factor, for instance, when the discrete factor is member gender, it can be further defined to be classified into two types, the first type is male, and the second type is female;
alternatively, when the discrete factor is date, such as the latest shopping payment date, its further definition can be classified according to time lengths from a timeline point, the first type is within 10 days therefrom, the second type is within 30 days therefrom, and the third type is out of 30 days therefrom; when the discrete factor is numeral, such as the recent subscription amount at Change Treasure, its further definition can be classified according to a numerical gradient, for instance, the first type is within an amount of 5,000 Yuans RMB, the second type is within an amount of 50,000 Yuans RMB, and the third type is outside an amount of 50,000 Yuans RMB, the result of WOEy is in the range of [0,1] on completion of calculation. In practical operation, the number of classifications can be specifically set according to actual circumstances, and this embodiment makes no redundant description thereto. In addition, the evidence weight algorithm is an existing algorithm in this field of technology, however, in order to facilitate comprehension, specific formulae are given in this embodiment for explanation thereof:
#0ii #0,i (p01 #cliT = 41 WOEii = In _____________ ,, In ____ In __ p if # 1 i j #00,

[0094] iT\#l1.

[0095] where WOEy expresses the score of the jth variable in the ith discrete factor, pOu expresses the probability of the jth variable in the ith discrete factor being a negative sample, ply Date Recue/Date Received 2022-06-21 expresses the probability of the jth variable in the ith discrete factor being a positive sample, #0y expresses the number of the jth variable in the ith discrete factor being negative samples, #0,T expresses the total number of variables that are negative samples in the ith discrete factor, My expresses the number of the jth variable in the ith discrete factor being positive samples, and #LT expresses the total number of variables of the ith variable being positive samples.

[0096] After the similarity distance of each discrete factor has been calculated, it is needed to further calculate the IV (information value) value of each discrete factor, and the IV value calculation formula is as follows:
IV =1()0 u -ply) * = -WOE

[0097]

[0098] where n expresses the total number of variables in discrete factor i, and j expresses the jth variable in discrete factor i.

[0099] After the IV value of each discrete factor has been completed, the Lasso regression algorithm is then employed to calculate identification degrees of the various label features, and discrete factors with high identification degrees are screened out therefrom;
optionally, the screening condition of identification degrees is to screen out the minimum X, that satisfies the condition, and to retain discrete factors that satisfy the minimum X, to form a variable set. Subsequently the ridge regression algorithm is then employed to screen out discrete factors with prominent importance from the variable set, and the screening condition of identification degrees is to screen out discrete factors with P value <0.1; through the aforementioned three rounds of screening, prominently contributive discrete label data are finally retained, and the discrete factors remaining at this time can be generally classified into three large types, which are, respectively, customers' own attributes, customer accessing behaviors, and customer transaction behaviors.
As is understandable, X, is a Lagrange operator representing a coefficient of a first-order model Date Recue/Date Received 2022-06-21 norm penalty term in the Lasso regression algorithm.

[0100] As should be noted, both the Lasso regression algorithm and the ridge regression algorithm are regression algorithms frequently employed by persons skilled in this field of technology, and their specific formulae are not redundantly described in this context.

[0101] Preferably, the method of employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation in the foregoing embodiment includes:

[0102] selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing the random forest algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data; selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing the gradient boosting decision tree algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data; and performing weighted assignment on the importance indices of the various variables of the discrete factors obtained by employing the random forest algorithm and on the importance indices of the various variables of the discrete factors obtained by employing the gradient boosting decision tree algorithm in the same piece of discrete label data, and thereafter summating the same data to obtain weight results of plural groups of discrete factors.

[0103] During specific implementation, the random forest algorithm is employed to classify discrete factors in each group of discrete label data, and to obtain importance indices (Wrfl, Wrf2, = .. Wrfn) to which various variables of each discrete factor correspond, the gradient boosting decision tree (GBDT) algorithm is further employed at the same time to classify discrete factors in each group of discrete label data, and to obtain importance indices Date Recue/Date Received 2022-06-21 (WGBDT1, WGBDT2, = = = WGBDTn) to which various variables of each discrete factor correspond, and weighted assignment is thereafter performed on the same discrete label data; preferably, 0.3 is weighted-assigned to the importance indices obtained by employing the random forest algorithm, 0.7 is weighted-assigned to the importance indices obtained by employing the gradient boosting decision tree algorithm, and weight results (W
, 1, W2,..., W) ¨0.3*(Wrfl, Wrf2,= = =, Wrfn)+0.7*(WGBDT1, WGBDT2, WGBDTn) of the various variables of the discrete factor can be obtained after summation. The random forest algorithm and the gradient boosting decision tree algorithm are both algorithmic formulae frequently employed by persons skilled in this field of technology, and are hence not redundantly described in this embodiment.

[0104] Moreover, the method of employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors in the foregoing embodiment includes:

[0105] multiplying the weight results of the various groups of discrete factors with the similarity distances of the various discrete factors, and calculating a similarity distance between each discrete factor in the customer data and the positive-sample data; and employing the Manhattan distance algorithm to summate the similarity distances of all discrete factors in each piece of customer data, and obtaining a final similarity distance between each piece of customer data and the positive-sample data.

[0106] During specific implementation, the final weights (Wi, W2, ... Wn) of the various variables in each discrete factor are multiplied with the score WOE,J of the WOE of the various variables in the discrete factor (Wi*W0Ey) to obtain the similarity distance between the customer and the positive sample on a single discrete factor. The Manhattan distance algorithm is thereafter employed to summate the similarity distances of all discrete factors in each piece of customer data, to obtain a final similarity distance between each piece of customer data and the positive-sample data. The Manhattan Date Recue/Date Received 2022-06-21 distance algorithm formula is as follows:
distance =

[0107] j =

[0108] The n expresses the number of discrete factors in the discrete label data, I, expresses the value of the jth classification of the corresponding ith discrete factor in the positive-sample data, in which I, represents an indicator matrix valuated as 0 or 1, for instance, when the ith discrete factor (such as gender) of a male member user is valuated as j (male), the corresponding Iy (Igender, male) is valuated as 1, and other variables (such as Igender, female) on the ith discrete factor are valuated as 0.

[0109] Specifically, the method of screening out any potential customer according to the final similarity distances in the foregoing embodiment includes: reversely arranging the final similarity distances according to value sizes, screening out top-ranking N
pieces of customer data and marking the same as potential customers. Preferably, N is valuated as 5,000, 5,000 customers with the least final similarity distances are then searched for and marked as "potential quality customers", and precision marketing is thereafter performed thereon, so as to entice them to purchase products of the platform.

[0110] Embodiment 2

[0111] Please refer to Fig. 1 and Fig. 3, this embodiment provides a data processing system based on a similarity model, the system comprises:

[0112] an information collecting unit 1, for collecting plural pieces of customer data, wherein the customer data are positive-sample data or negative-sample data;

[0113] a binning transforming unit 2, for extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data;

[0114] a label screening unit 3, for sequentially performing similarity distance calculation on a Date Recue/Date Received 2022-06-21 discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors;

[0115] a weight calculating unit 4, for employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation;

[0116] a similarity distance calculating unit 5, for employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors; and

[0117] a marketing unit 6, for screening out any potential customer according to the final similarity distances.

[0118] Specifically, the binning transforming unit 2 includes:

[0119] an initial data extracting module 21, for performing label feature extraction on each piece of customer data, and obtaining plural groups of continuous label initial data;

[0120] a data cleaning module 22, for performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom; and

[0121] a binning processing module 23, for employing an optimum binning strategy to perform optimum binning processing on the various pieces of continuous label data respectively, and correspondingly obtaining plural groups of discrete label data, wherein each group of discrete label data includes plural label features discrete from one another.

[0122] Specifically, the label screening unit 3 includes:

[0123] an evidence weight algorithm module 31, for employing an evidence weight algorithm to perform similarity distance calculation on variables of various discrete factors in one group of discrete label data;

Date Recue/Date Received 2022-06-21

[0124] an information value calculating module 32, for calculating an IV value to which each discrete factor corresponds through an information value formula, and screening out discrete factors with high value degrees on the basis of sizes of the IV
values;

[0125] a Lasso regression algorithm module 33, for employing a Lasso regression algorithm to screen discrete factors with high identification degrees out of the discrete factors with high value degrees; and

[0126] a ridge regression algorithm module 34, for employing a ridge regression algorithm to further screen discrete factors with prominent importance out of the discrete factors with high identification degrees, and constituting plural groups of new discrete label data consisting of prominently contributive discrete factors.

[0127] Specifically, the weight calculating unit 4 includes:

[0128] a random forest algorithm module 41, for selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing a random forest algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data;

[0129] a gradient boosting decision tree algorithm module 42, for selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing a gradient boosting decision tree algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data; and

[0130] a weighted assignment module 43, for performing weighted assignment on the importance indices of the various variables of the discrete factors obtained by employing the random forest algorithm and on the importance indices of the various variables of the discrete factors obtained by employing the gradient boosting decision tree algorithm in the same piece of discrete label data, and thereafter performing summation to obtain weight results of plural groups of discrete factors.

[0131] Specifically, the similarity distance calculating unit 5 includes:

Date Recue/Date Received 2022-06-21

[0132] a label feature similarity distance module 51, for multiplying the weight results of the various groups of discrete factors with the similarity distances of the various discrete factors, and calculating a similarity distance between each discrete factor in the customer data and the positive-sample data; and

[0133] a customer data similarity distance module 52, for employing the Manhattan distance algorithm to summate the similarity distances of all discrete factors in each piece of customer data, and obtaining a final similarity distance between each piece of customer data and the positive-sample data.

[0134] In comparison with prior-art technology, the advantageous effects achieved by the data processing system based on a similarity model provided by this embodiment of the present invention are identical with the advantageous effects achievable by the data processing method based on a similarity modal provided by the foregoing Embodiment 1, so these are not redundantly described in this context.

[0135] As understandable to persons ordinarily skilled in the art, the entire or partial steps realizing the method of the present invention can be completed via a program that instructs relevant hardware, the program can be stored in a computer-readable storage medium, and subsumes the various steps of the method in the foregoing embodiment when it is executed, while the storage medium can be an ROM/RAM, a magnetic disk, an optical disk, or a memory card, etc.

[0136] What the above describes is merely directed to specific modes of execution of the present invention, but the protection scope of the present invention is not restricted thereby. Any change or replacement easily conceivable to persons skilled in the art within the technical range disclosed by the present invention shall be covered by the protection scope of the present invention. Accordingly, the protection scope of the present invention shall be based on the protection scope as claimed in the Claims.

Date Recue/Date Received 2022-06-21

Claims

What is claimed is:

1. A data processing method based on a similarity model, characterized in comprising:
collecting plural pieces of customer data, wherein the customer data are positive-sample data or negative-sample data;
extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data;
sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors;
employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation;
employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors; and screening out any potential customer according to the final similarity distances.

2. The method according to Claim 1, characterized in that the step of extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data includes:
performing label feature extraction on each piece of customer data, and obtaining plural groups of continuous label initial data;
performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom; and employing an optimum binning strategy to perform optimum binning processing on the various pieces of continuous label data respectively, and correspondingly obtaining plural groups of discrete label data, wherein each group of discrete label data includes plural label features discrete from one another.

3. The method according to Claim 2, characterized in that the step of performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid label feature therefrom includes:
cleaning and filtering invalid label features in the various groups of continuous label initial data sequentially in accordance with a missing rate filter condition, a tantile filter condition, and a proportion of categories filter condition of the label data, and correspondingly obtaining plural groups of continuous label data.

4. The method according to Claim 1, characterized in that the step of sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors includes:
employing an evidence weight algorithm to perform similarity distance calculation on variables of various discrete factors in one group of discrete label data;
calculating an IV value to which each discrete factor corresponds through an information value formula, and screening out discrete factors with high value degrees on the basis of sizes of the IV values;
employing a Lasso regression algorithm to screen discrete factors with high identification degrees out of the discrete factors with high value degrees;
employing a ridge regression algorithm to further screen discrete factors with prominent importance out of the discrete factors with high identification degrees, and constituting plural groups of new discrete label data consisting of prominently contributive discrete factors; and respectively invoking other groups of discrete label data to repeat the above calculating steps, and correspondingly obtaining plural groups of new discrete label data.

5. The method according to Claim 1, characterized in that the step of employing a random forest Date Recue/Date Received 2022-06-21 algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation includes:
selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing the random forest algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data;
selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing the gradient boosting decision tree algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data; and performing weighted assignment on the importance indices of the various variables of the discrete factors obtained by employing the random forest algorithm and on the importance indices of the various variables of the discrete factors obtained by employing the gradient boosting decision tree algorithm in the same piece of discrete label data, and thereafter performing summation to obtain weight results of plural groups of discrete factors.

6. The method according to Claim 1, characterized in that the step of employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors includes:
multiplying the weight results of the various groups of discrete factors with the similarity distances of the various discrete factors, and calculating a similarity distance between each discrete factor in the customer data and the positive-sample data; and employing the Manhattan distance algorithm to summate the similarity distances of all discrete factors in each piece of customer data, and obtaining a final similarity distance between each piece of customer data and the positive-sample data.

7. The method according to Claim 1, characterized in that the step of screening out any potential Date Recue/Date Received 2022-06-21 customer according to the final similarity distances includes:
reversely arranging the final similarity distances according to value sizes, screening out top-ranking N pieces of customer data and marking the same data as potential customers.

8. A data processing system based on a similarity model, characterized in comprising:
an information collecting unit, for collecting plural pieces of customer data, wherein the customer data are positive-sample data or negative-sample data;
a binning transforming unit, for extracting continuous label data from each piece of customer data, subjecting the same data to binning transformation to thereafter correspondingly obtain plural groups of discrete label data;
a label screening unit, for sequentially performing similarity distance calculation on a discrete factor in each group of discrete label data, and simultaneously screening out plural groups of new discrete label data consisting of prominently contributive discrete factors;
a weight calculating unit, for employing a random forest algorithm and a gradient boosting decision tree algorithm to respectively perform weight calculation on discrete factors in the new discrete label data, and obtaining weight results of plural groups of discrete factors after weighted summation;
a similarity distance calculating unit, for employing a Manhattan distance algorithm to calculate a final similarity distance between each piece of customer data and the positive-sample data on the basis of the weight results of the various groups of discrete factors and similarity distances of the various discrete factors; and a marketing unit, for screening out any potential customer according to the final similarity distances.

9. The system according to Claim 8, characterized in that the binning transforming unit includes:
an initial data extracting module, for performing label feature extraction on each piece of customer data, and obtaining plural groups of continuous label initial data;
a data cleaning module, for performing data cleaning with respect to the various groups of continuous label initial data, and retaining continuous label data after having removed any invalid Date Recue/Date Received 2022-06-21 label feature therefrom; and a binning processing module, for employing an optimum binning strategy to perform optimum binning processing on the various pieces of continuous label data respectively, and correspondingly obtaining plural groups of discrete label data, wherein each group of discrete label data includes plural label features discrete from one another.

10. The system according to Claim 8, characterized in that the label screening unit includes:
an evidence weight algorithm module, for employing an evidence weight algorithm to perform similarity distance calculation on variables of various discrete factors in one group of discrete label data;
an information value calculating module, for calculating an IV value to which each discrete factor corresponds through an information value formula, and screening out discrete factors with high value degrees on the basis of sizes of the IV values;
a Lasso regression algorithm module, for employing a Lasso regression algorithm to screen discrete factors with high identification degrees out of the discrete factors with high value degrees;
and a ridge regression algorithm module, for employing a ridge regression algorithm to further screen discrete factors with prominent importance out of the discrete factors with high identification degrees, and constituting plural groups of new discrete label data consisting of prominently contributive discrete factors.

11. The system according to Claim 8, characterized in that the weight calculating unit includes:
a random forest algorithm module, for selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing a random forest algorithm to calculate importance indices of various variables of the discrete factors in the various pieces of discrete label data;
a gradient boosting decision tree algorithm module, for selecting data in a positive sample as a target variable, taking the discrete factor in each piece of discrete label data as a dependent variable, and employing a gradient boosting decision tree algorithm to calculate importance Date Recue/Date Received 2022-06-21 indices of various variables of the discrete factors in the various pieces of discrete label data; and a weighted assignment module, for performing weighted assignment on the importance indices of the various variables of the discrete factors obtained by employing the random forest algorithm and on the importance indices of the various variables of the discrete factors obtained by employing the gradient boosting decision tree algorithm in the same piece of discrete label data, and thereafter performing summation to obtain weight results of plural groups of discrete factors.

12. The system according to Claim 8, characterized in that the similarity distance calculating unit includes:
a label feature similarity distance module, for multiplying the weight results of the various groups of discrete factors with the similarity distances of the various discrete factors, and calculating a similarity distance between each discrete factor in the customer data and the positive-sample data;
and a customer data similarity distance module, for employing the Manhattan distance algorithm to summate the similarity distances of all discrete factors in each piece of customer data, and obtaining a final similarity distance between each piece of customer data and the positive-sample data.
Date Recue/Date Received 2022-06-21