CN113222632A - Object mining method and device - Google Patents

Object mining method and device Download PDF

Info

Publication number
CN113222632A
CN113222632A CN202010079932.5A CN202010079932A CN113222632A CN 113222632 A CN113222632 A CN 113222632A CN 202010079932 A CN202010079932 A CN 202010079932A CN 113222632 A CN113222632 A CN 113222632A
Authority
CN
China
Prior art keywords
features
feature
prediction
characteristic
mined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010079932.5A
Other languages
Chinese (zh)
Inventor
黄倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Zhenshi Information Technology Co Ltd
Original Assignee
Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Zhenshi Information Technology Co Ltd filed Critical Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority to CN202010079932.5A priority Critical patent/CN113222632A/en
Publication of CN113222632A publication Critical patent/CN113222632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for object mining, and relates to the technical field of computers. One embodiment of the method comprises: acquiring feature data of an object to be mined, calculating the prediction capability of each feature, and then selecting the features for the first time according to the prediction capabilities of the features; performing correlation analysis on the first selected features and performing second selection on the first selected features; compressing and reducing dimensions of the features selected for the second time; and performing model training by using the features after the compression and dimension reduction to obtain an object prediction model, and predicting the object to be mined by using the object prediction model to judge whether the object to be mined is a potential object. The method and the system can more pointedly mine the potential objects with high conversion probability, reduce the investment of sales resources and improve the success rate of object mining.

Description

Object mining method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for object mining.
Background
The potential customers are mined by each company, which is one of the important jobs in the development process of the company, and the methods generally adopted by the company for mining the potential customers at present are as follows:
1) telephone searching method: company sales contact customers directly one-to-one by telephone;
2) introduction methods of the skilled person: finding potential customers through the introduction of old customers or friends;
3) the discussion has the following method: attract customers by developing some discussions.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the three methods are not supported by a clear sales strategy, so that the manpower, financial resources and time are consumed, and the mining success rate is low.
Disclosure of Invention
In view of this, the embodiments of the present invention provide an object mining method and apparatus, which can more specifically mine a potential object with a high transformation probability, reduce the investment of sales resources, and improve the success rate of object mining.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of object mining.
A method of object mining, comprising: acquiring feature data of an object to be mined, calculating the prediction capability of each feature, and then selecting the features for the first time according to the prediction capabilities of the features; performing correlation analysis on the first selected features and performing second selection on the first selected features; compressing and reducing dimensions of the features selected for the second time; and performing model training by using the features after the compression and dimension reduction to obtain an object prediction model, and predicting the object to be excavated by using the object prediction model to judge whether the object to be excavated is a potential object.
Optionally, calculating the predictive power of each feature comprises: judging the feature type of each feature, wherein the feature type comprises a classified feature and a numerical feature; if the characteristic is a numerical characteristic, discretizing the characteristic to obtain a corresponding classified characteristic, and then calculating the prediction capability of the corresponding classified characteristic; and if the features are classified features, directly calculating the prediction capability of the features.
Optionally, the performing of the correlation analysis on the first selected feature includes: and calculating the correlation between the first selected features by calculating chi-square statistic between the features so as to perform correlation analysis on the first selected features.
Optionally, the compressing and dimension reduction on the features selected for the second time includes: and carrying out compression and dimension reduction on the features selected for the second time by a principal component analysis method.
Optionally, the model training using the compressed dimensionality-reduced features comprises: and performing feature combination on the features subjected to the compression and dimension reduction through a binary tree algorithm, and inputting the feature combinations of leaf nodes of the binary tree into a logistic regression model for model training.
Optionally, after determining whether the object to be mined is a potential object, the method further includes: determining data segmentation points of all potential objects by using a data fitting algorithm, and classifying the potential objects according to the data segmentation points; and clustering each classified class of potential objects respectively, and determining the common characteristics of each class of potential objects according to clustering results, wherein the number of the clustered classes is determined by the ratio of the class spacing to the class inner spacing.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for object mining.
An apparatus for object mining, comprising: the first selection module is used for acquiring feature data of an object to be mined, calculating the prediction capability of each feature, and then performing first selection of the features according to the prediction capability of the features; the second selection module is used for carrying out correlation analysis on the first selected characteristics and carrying out second selection on the first selected characteristics; the feature dimension reduction module is used for compressing and reducing dimensions of the features selected for the second time; and the training prediction module is used for carrying out model training by using the features after the compression and dimension reduction to obtain an object prediction model, and predicting the object to be excavated by using the object prediction model to judge whether the object to be excavated is a potential object.
Optionally, the first selecting module is further configured to: judging the feature type of each feature, wherein the feature type comprises a classified feature and a numerical feature; if the characteristic is a numerical characteristic, discretizing the characteristic to obtain a corresponding classified characteristic, and then calculating the prediction capability of the corresponding classified characteristic; and if the features are classified features, directly calculating the prediction capability of the features.
Optionally, the second selecting module is further configured to: and calculating the correlation between the first selected features by calculating chi-square statistic between the features so as to perform correlation analysis on the first selected features.
Optionally, the feature dimension reduction module is further configured to: and carrying out compression and dimension reduction on the features selected for the second time by a principal component analysis method.
Optionally, the training prediction module is further configured to: and performing feature combination on the features subjected to the compression and dimension reduction through a binary tree algorithm, and inputting the feature combinations of leaf nodes of the binary tree into a logistic regression model for model training.
Optionally, the method further comprises a cluster analysis module, configured to: after judging whether the object to be mined is a potential object, determining data segmentation points for all potential objects by using a data fitting algorithm, and classifying the potential objects according to the data segmentation points; and clustering each classified class of potential objects respectively, and determining the common characteristics of each class of potential objects according to clustering results, wherein the number of the clustered classes is determined by the ratio of the class spacing to the class inner spacing.
According to yet another aspect of the embodiments of the present invention, there is provided an electronic device for object mining.
An electronic device for object mining, comprising: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for object mining provided by the embodiment of the invention.
According to yet another aspect of embodiments of the present invention, a computer-readable medium is provided.
A computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of object mining provided by an embodiment of the invention.
One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of obtaining feature data of an object to be mined, calculating the prediction capability of each feature, and then selecting the features for the first time according to the prediction capabilities of the features; performing correlation analysis on the first selected features and performing second selection on the first selected features; then, compressing and reducing dimensions of the features selected for the second time; finally, model training is carried out by using the features after compression and dimension reduction to obtain an object prediction model, the object to be mined is predicted by using the object prediction model to judge whether the object to be mined is a potential object, analysis on behavior features and the like of the object to be mined is realized by a big data mining method, so that the probability that the object to be mined becomes the potential object is predicted, the potential object with high conversion probability can be mined more specifically, the investment of sales resources is reduced, and the success rate of object mining is improved. In addition, the method uses a mode of fusing the binary tree xgboost and the logistic regression model during model training, processes the features by using xgboost, and then performs final model training by using logistic regression, thereby making up the insensitivity of logistic regression to the nonlinear relation and enhancing the accuracy of overall prediction.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a method of object mining according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an implementation of feature combination using a binary tree according to an embodiment of the present invention;
FIG. 3 is a schematic block diagram of a main block diagram of an apparatus for object mining according to an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of main steps of a method of object mining according to an embodiment of the present invention. As shown in fig. 1, the method for object mining according to the embodiment of the present invention mainly includes the following steps S101 to S104.
Step S101: acquiring feature data of an object to be mined, calculating the prediction capability of each feature, and then selecting the features for the first time according to the prediction capabilities of the features.
In the embodiment of the invention, the object to be mined is, for example, a client to be mined, and the object to be mined is the client to be mined. When acquiring the feature data of an object to be mined, the following scenarios are combined for introduction: assuming that a logistics company needs to mine potential objects (i.e. potential customers) from merchants who open a store on a certain e-commerce platform, some merchants already use the logistics and some merchants still do not use the logistics, the historical data of the merchants is used for establishing a functional relationship between basic conditions and behavior characteristics (predictive variables) of the merchants and whether the merchants use the logistics (response variables), so as to predict whether the merchants will become potential customers of the logistics company in the future.
For the logistics company, firstly, acquiring basic data of an object to be mined (namely a merchant who opens a shop on a certain e-commerce platform); then, useful feature data is extracted according to basic data of the object to be mined, and the useful feature data comprises the following steps: basic attributes of the merchant, the operating capacity of the merchant, the sensitivity of the merchant to age, the sensitivity of the merchant to price, etc. And then, selecting characteristics according to the acquired characteristic data of the object to be mined. And selecting the features, namely selecting the features with stronger prediction capability for subsequent modeling. The predictive power of each feature needs to be calculated. After acquiring basic data of an object to be mined, the acquired data can be subjected to data cleaning and conversion, including preprocessing such as vacancy value filling and data quality inspection.
According to an embodiment of the present invention, when calculating the prediction capability of each feature, the following steps may be specifically performed:
judging the feature type of each feature, wherein the feature type comprises a classified feature and a numerical feature;
if the characteristic is a numerical characteristic, discretizing the characteristic to obtain a corresponding classified characteristic, and then calculating the prediction capability of the corresponding classified characteristic;
if the feature is a classified feature, the prediction capability of the feature is directly calculated.
In the embodiment of the invention, the feature fields needing feature prediction capability analysis are classified into classified features and numerical features, the numerical features are also called continuous features, and the less-valued features can be processed as classified features and also can be processed as continuous features. The fields to be analyzed for feature prediction capability are determined according to the service and the existing data. The field is classified mainly according to the category attribute of the field, for example, the value of the field "business grade" is A, B, C, D, and the field is classified. The value of the field 'merchant sales amount' is the specific sales amount data of the merchant 10.1, 1000.5, 20.3, etc., and the field is numerical.
The advantage of the classification type feature is that the distribution of each value can be seen, and the segmentation processing is performed in the next step as the continuity feature, and the target data of partial values are merged. The prediction capability needs to be calculated separately for both the categorical and numerical attributes. For numerical features (continuous features), before the prediction capability is calculated, segmentation processing needs to be performed, that is: the continuous data is discretized into several categories through transformation, and the discretization principle is to make the characteristic and the response variable linear. And calculating the same classification type characteristic of the prediction capability after discretization processing. When discretization processing is carried out, for example, the 'sales amount of a merchant' is continuous, then a new variable is defined as amount classification, when the sales amount is less than 10, the amount classification value is 1, when the sales amount is 10-100, the amount classification value is 2, when the sales amount is 100-1000, the amount classification value is 3, when the sales amount is more than 1000, the amount classification value is 4, then the amount classification is the result of discretization of the sales amount, and the two variables have a certain linear relation.
When the prediction capability of the classification type feature is calculated, the classification type feature X is assumed to have n classes, and the prediction capability calculation formula of X is as follows:
Figure BDA0002379930870000071
wherein, ai/aTIs the proportion of the merchants using the logistics in this class to all merchants using the logistics in the sample, ni/nTIs the proportion of the group of merchants who have not used the logistics to all of the unused logistics merchants in the sample.
By calculating the prediction capability of each feature, setting a prediction capability threshold value and taking the feature with the prediction capability higher than the prediction capability threshold value as the feature selected for the first time, the feature with strong prediction capability can be selected for further processing and model training.
Step S102: and carrying out correlation analysis on the first selected characteristics and carrying out second selection on the first selected characteristics.
In order to better introduce the influence of different features on the model during model training, the correlation between the selected features is as weak as possible, so that the first selected feature needs to be subjected to correlation analysis and then to secondary selection.
When the correlation analysis is performed on the first selected features, the correlation degree between the first selected features may be specifically calculated by calculating chi-square statistic between the features, so as to perform the correlation analysis on the first selected features. Where chi-squared statistic refers to a measure of the difference between the distribution of data and a selected expected or hypothetical distribution. Suppose that: h0 indicates that the row classification variable is not associated with the column classification variable; h1 indicates that the row classification variable is associated with the column classification variable, then:
Figure BDA0002379930870000081
wherein f iseTo the desired frequency, f0To observe frequency, x2The degree of freedom is (gamma-1) (c-1), gamma is the number of rows and c is the number of columns.
After chi-squared statistics between two features are obtained, the features can be selected a second time. Chi-square statistic x2Describes how much the observed value correlates with the expected value, if x2The smaller the value of (A), the stronger the correlation between the two, and the screening is required, one of which can be randomly reserved. In general, x can be determined from a threshold corresponding to a chi-squared distribution with a significance level of 0.05 and a degree of freedom of (γ -1) (c-1)2Whether large or small. If the correlation of the two features is not strong, both may be retained.
Step S103: and carrying out compression and dimension reduction on the features selected for the second time. In the embodiment of the invention, the compression and dimension reduction are carried out on the features selected for the second time by a principal component analysis method. And the principal component analysis method is adopted to compress and reduce the dimension of the features, so that the linear correlation among the features can be further eliminated. Principal component analysis, also called principal component analysis, aims to convert multiple indexes into a few comprehensive indexes (i.e. principal components) by using the idea of dimension reduction, wherein each principal component can reflect most information of an original variable and the contained information is not repeated.
Principal component Y1,Y2,…,YpExpressed as a linear combination of the original feature parameters, noted as the algebraic form:
Figure BDA0002379930870000091
wherein, Yi=ui' X is the ith principal component of the original characteristic parameter, ui=(ui1,ui2,…,uip) ' as a coefficient vector, the linear combination is constrained by the following constraint:
1、ui'ui=1;
2. when i ≠ j, YiAnd YjAre mutually orthogonal;
3、Y1is X1,X2,…,XpThe one with the largest variance in all linear combinations of (2), Y2Is at Y1On the premise that the variance is maximum, X is1,X2,…,XpThe largest variance in all linear combinations of (1), and so on, YpIs at Yp-1On the premise that the variance is maximum, X is1,X2,…,XpThe maximum variance in all linear combinations of (1).
According to the above steps S101 to S103, the model input data can be prepared by performing the feature processing (feature selection, feature compression dimension reduction, etc.).
Step S104: and performing model training by using the features after the compression and dimension reduction to obtain an object prediction model, and predicting the object to be mined by using the object prediction model to judge whether the object to be mined is a potential object.
When the features after the compression and dimension reduction are used for model training, the features after the compression and dimension reduction can be specifically combined through a binary tree algorithm, and the feature combinations of leaf nodes of the binary tree are input into a logistic regression model for model training.
The binary problem generally employs a logistic regression model. Logistic regression is a generalized linear model, and adding a sigma function to make its output value within [0, 1] can be regarded as the probability value of an event. But logistic regression does not work well with non-linear relationships. To solve this problem, the present invention uses a combination of features to solve this problem. If the characteristics "merchant price sensitivity" and "merchant age sensitivity" are non-linear with the final prediction. And the merchant price sensitivity + merchant aging sensitivity is linear with the prediction result. Such as; merchant price sensitivity (1) and merchant age sensitivity (1), one union operation is a combination of characteristic values. In the embodiment of the present invention, one _ hot (an unique hot code, which is intuitively a code system including how many states have how many bits, only one bit is 1, and all others are 0) is used to represent the feature value, so that "price sensitivity 1 →" price sensitivity + aging sensitivity 1 ".
The difficulty of feature value combination is also part of feature engineering, which is how to combine which feature values are valid. When the features are combined, the features to be combined need to be selected first, and when the features are selected, a tree model is mostly used for selecting the most important feature values. In embodiments of the present invention, the tree model functions as a combination of features. Taking the cart tree as an example, because it is a binary tree, each node has two branches. Each leaf node as an output.
Fig. 2 is a schematic diagram illustrating an implementation principle of feature combination using a binary tree according to an embodiment of the present invention, as shown in fig. 2. In this embodiment, if sample x eventually falls on the corresponding leaf node of the girl, "age < 15 and No male" can be considered to have a linear relationship with the final result. Thus, each leaf node can be regarded as a combined feature value, and the feature value has a linear relationship with the result.
In the embodiment of the invention, an eXtreme Gradient Boosting model (hereinafter referred to as xgboost) is adopted to discover the linear relation, and the feature combination corresponding to the leaf node of xgboost is taken as the selected feature combination and input into a logistic regression model as a new feature to train and predict the model. In addition, the selected feature combinations may also be encoded before being input to the logistic regression model, for example: the selected combination of features is encoded using one _ hot encoding. During model training, a binary tree xgboost and logistic regression model fusion mode is used, characteristics are processed by using xgboost, and then final model training is performed by using logistic regression.
In the model training process, the trained model can be evaluated, and model parameters can be adjusted according to the evaluation result, specifically, the model evaluation can be performed through an ROC curve, a KS curve, a lift graph, a GINI coefficient, and the like. And after the evaluation is passed, obtaining an object prediction model, then predicting all current objects to be mined by using the object prediction model, and giving the probability that each object to be mined is converted into a potential object next. And arranging the objects to be mined in a descending order according to the probability of the objects to be mined as potential objects, and taking the objects to be mined which are larger than a given probability threshold value as the potential objects.
According to another embodiment of the present invention, after determining whether the object to be mined is a potential object, the method may further include:
determining data segmentation points of all potential objects by using a data fitting algorithm, and classifying the potential objects according to the data segmentation points;
and clustering each classified class of potential objects respectively, and determining the common characteristics of each class of potential objects according to clustering results, wherein the number of the clustered classes is determined by the ratio of the class spacing to the class inner spacing.
According to the historical data of sales of a large number of merchants and unit prices of products, the nodes of data segmentation can be determined by using a data fitting algorithm, and then potential objects are divided into four categories: high value merchants (high sales, high unit price), core merchants (low unit volume, high unit price), key merchants (high sales, low unit price), and long-term tracking of merchants (low unit volume, low unit price).
For each type of potential object, performing clustering from the existing dimensionalities of the potential object, such as logistics timeliness, logistics price, logistics service and the like, wherein the clustering mainly uses Euclidean distance, namely:
Figure BDA0002379930870000111
where x and y are feature values of the same feature of two potential objects, respectively.
When clustering is carried out, the clustered class number can be screened by utilizing the ratio of the class spacing to the class inner spacing, and the larger the ratio is, the more reasonable the clustered class number is. In addition, other distance algorithms can also be used in clustering, for example: mahalanobis distance, etc.
Finally, according to the clustering result, summarizing the focus points of each type of potential objects when using logistics, namely: the common characteristics of each class of potential objects, and thus the corresponding marketing strategy, are available at the time of sale.
Fig. 3 is a schematic block diagram of a main block of an object excavating apparatus according to an embodiment of the present invention. As shown in fig. 3, the apparatus 300 for object mining according to the embodiment of the present invention mainly includes a first selecting module 301, a second selecting module 302, a feature dimension reducing module 303, and a training prediction module 304.
The first selection module 301 is configured to obtain feature data of an object to be mined, calculate a prediction capability of each feature, and perform first selection of the features according to the prediction capabilities of the features;
a second selecting module 302, configured to perform correlation analysis on the first selected feature and perform second selection on the first selected feature;
the feature dimension reduction module 303 is configured to perform compression dimension reduction on the features selected for the second time;
and the training prediction module 304 is configured to perform model training using the features after the compression and the dimension reduction to obtain an object prediction model, and predict the object to be mined using the object prediction model to determine whether the object to be mined is a potential object.
According to an embodiment of the present invention, the first selecting module 301 may further be configured to:
judging the feature type of each feature, wherein the feature type comprises a classified feature and a numerical feature;
if the characteristic is a numerical characteristic, discretizing the characteristic to obtain a corresponding classified characteristic, and then calculating the prediction capability of the corresponding classified characteristic;
and if the features are classified features, directly calculating the prediction capability of the features.
According to another embodiment of the present invention, the second selecting module 302 may further be configured to:
and calculating the correlation between the first selected features by calculating chi-square statistic between the features so as to perform correlation analysis on the first selected features.
According to yet another embodiment of the invention, the feature dimension reduction module 303 may be further configured to:
and carrying out compression and dimension reduction on the features selected for the second time by a principal component analysis method.
According to yet another embodiment of the invention, the training prediction module 304 may be further configured to:
and performing feature combination on the features subjected to the compression and dimension reduction through a binary tree algorithm, and inputting the feature combinations of leaf nodes of the binary tree into a logistic regression model for model training.
According to another embodiment of the present invention, the apparatus 300 for object mining may further include a cluster analysis module (not shown in the figure) for:
after judging whether the object to be mined is a potential object, determining data segmentation points for all potential objects by using a data fitting algorithm, and classifying the potential objects according to the data segmentation points;
and clustering each classified class of potential objects respectively, and determining the common characteristics of each class of potential objects according to clustering results, wherein the number of the clustered classes is determined by the ratio of the class spacing to the class inner spacing.
According to the technical scheme of the embodiment of the invention, the characteristic data of the object to be mined is obtained, the prediction capability of each characteristic is calculated, and then the characteristic is selected for the first time according to the prediction capability of the characteristic; performing correlation analysis on the first selected features and performing second selection on the first selected features; then, compressing and reducing dimensions of the features selected for the second time; finally, model training is carried out by using the features after compression and dimension reduction to obtain an object prediction model, the object to be mined is predicted by using the object prediction model to judge whether the object to be mined is a potential object, analysis on behavior features and the like of the object to be mined is realized by a big data mining method, so that the probability that the object to be mined becomes the potential object is predicted, the potential object with high conversion probability can be mined more specifically, the investment of sales resources is reduced, and the success rate of object mining is improved. In addition, the method uses a mode of fusing the binary tree xgboost and the logistic regression model during model training, processes the features by using xgboost, and then performs final model training by using logistic regression, thereby making up the insensitivity of logistic regression to the nonlinear relation and enhancing the accuracy of overall prediction.
Fig. 4 illustrates an exemplary system architecture 400 to which the method of object mining or the apparatus of object mining of an embodiment of the present invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the method for object mining provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the apparatus for object mining is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use with a terminal device or server implementing an embodiment of the invention is shown. The terminal device or the server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor, and may be described as: a processor comprises a first selection module, a second selection module, a feature dimension reduction module and a training prediction module. The names of the units or modules do not limit the units or modules, for example, the first selection module may be further described as a module that obtains feature data of an object to be mined, calculates the prediction capability of each feature, and then performs the first selection of the features according to the prediction capabilities of the features.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring feature data of an object to be mined, calculating the prediction capability of each feature, and then selecting the features for the first time according to the prediction capabilities of the features; performing correlation analysis on the first selected features and performing second selection on the first selected features; compressing and reducing dimensions of the features selected for the second time; and performing model training by using the features after the compression and dimension reduction to obtain an object prediction model, and predicting the object to be excavated by using the object prediction model to judge whether the object to be excavated is a potential object.
According to the technical scheme of the embodiment of the invention, the characteristic data of the object to be mined is obtained, the prediction capability of each characteristic is calculated, and then the characteristic is selected for the first time according to the prediction capability of the characteristic; performing correlation analysis on the first selected features and performing second selection on the first selected features; then, compressing and reducing dimensions of the features selected for the second time; finally, model training is carried out by using the features after compression and dimension reduction to obtain an object prediction model, the object to be mined is predicted by using the object prediction model to judge whether the object to be mined is a potential object, analysis on behavior features and the like of the object to be mined is realized by a big data mining method, so that the probability that the object to be mined becomes the potential object is predicted, the potential object with high conversion probability can be mined more specifically, the investment of sales resources is reduced, and the success rate of object mining is improved. In addition, the method uses a mode of fusing the binary tree xgboost and the logistic regression model during model training, processes the features by using xgboost, and then performs final model training by using logistic regression, thereby making up the insensitivity of logistic regression to the nonlinear relation and enhancing the accuracy of overall prediction.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of object mining, comprising:
acquiring feature data of an object to be mined, calculating the prediction capability of each feature, and then selecting the features for the first time according to the prediction capabilities of the features;
performing correlation analysis on the first selected features and performing second selection on the first selected features;
compressing and reducing dimensions of the features selected for the second time;
and performing model training by using the features after the compression and dimension reduction to obtain an object prediction model, and predicting the object to be excavated by using the object prediction model to judge whether the object to be excavated is a potential object.
2. The method of claim 1, wherein computing the predictive power of each feature comprises:
judging the feature type of each feature, wherein the feature type comprises a classified feature and a numerical feature;
if the characteristic is a numerical characteristic, discretizing the characteristic to obtain a corresponding classified characteristic, and then calculating the prediction capability of the corresponding classified characteristic;
and if the features are classified features, directly calculating the prediction capability of the features.
3. The method of claim 1, wherein performing a correlation analysis on the first selected feature comprises:
and calculating the correlation between the first selected features by calculating chi-square statistic between the features so as to perform correlation analysis on the first selected features.
4. The method of claim 1, wherein performing a compressed dimensionality reduction on the second selected feature comprises:
and carrying out compression and dimension reduction on the features selected for the second time by a principal component analysis method.
5. The method of claim 1, wherein model training using the compressed dimensionality reduced features comprises:
and performing feature combination on the features subjected to the compression and dimension reduction through a binary tree algorithm, and inputting the feature combinations of leaf nodes of the binary tree into a logistic regression model for model training.
6. The method of claim 1, wherein after determining whether the object to be mined is a potential object, further comprising:
determining data segmentation points of all potential objects by using a data fitting algorithm, and classifying the potential objects according to the data segmentation points;
and clustering each classified class of potential objects respectively, and determining the common characteristics of each class of potential objects according to clustering results, wherein the number of the clustered classes is determined by the ratio of the class spacing to the class inner spacing.
7. An apparatus for object mining, comprising:
the first selection module is used for acquiring feature data of an object to be mined, calculating the prediction capability of each feature, and then performing first selection of the features according to the prediction capability of the features;
the second selection module is used for carrying out correlation analysis on the first selected characteristics and carrying out second selection on the first selected characteristics;
the feature dimension reduction module is used for compressing and reducing dimensions of the features selected for the second time;
and the training prediction module is used for carrying out model training by using the features after the compression and dimension reduction to obtain an object prediction model, and predicting the object to be excavated by using the object prediction model to judge whether the object to be excavated is a potential object.
8. The apparatus of claim 7, wherein the first selecting module is further configured to:
judging the feature type of each feature, wherein the feature type comprises a classified feature and a numerical feature;
if the characteristic is a numerical characteristic, discretizing the characteristic to obtain a corresponding classified characteristic, and then calculating the prediction capability of the corresponding classified characteristic;
and if the features are classified features, directly calculating the prediction capability of the features.
9. An electronic device for object mining, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202010079932.5A 2020-02-04 2020-02-04 Object mining method and device Pending CN113222632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010079932.5A CN113222632A (en) 2020-02-04 2020-02-04 Object mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010079932.5A CN113222632A (en) 2020-02-04 2020-02-04 Object mining method and device

Publications (1)

Publication Number Publication Date
CN113222632A true CN113222632A (en) 2021-08-06

Family

ID=77085667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010079932.5A Pending CN113222632A (en) 2020-02-04 2020-02-04 Object mining method and device

Country Status (1)

Country Link
CN (1) CN113222632A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988374A (en) * 2021-09-27 2022-01-28 上海东普信息科技有限公司 Method, device, equipment and storage medium for identifying high-quality user through targeted recommendation

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489108A (en) * 2013-08-22 2014-01-01 浙江工商大学 Large-scale order form matching method in community commerce cloud
CN104699717A (en) * 2013-12-10 2015-06-10 中国银联股份有限公司 Data mining method
CN105678570A (en) * 2015-12-31 2016-06-15 北京京东尚科信息技术有限公司 Method and apparatus for identifying potential users of E-commerce
CN106203679A (en) * 2016-06-27 2016-12-07 武汉斗鱼网络科技有限公司 A kind of customer loss Forecasting Methodology and system
CN106204106A (en) * 2016-06-28 2016-12-07 武汉斗鱼网络科技有限公司 A kind of specific user's recognition methods and system
CN107240033A (en) * 2017-06-07 2017-10-10 国家电网公司客户服务中心 The construction method and system of a kind of electric power identification model
CN107818471A (en) * 2016-09-12 2018-03-20 湖南移商动力网络技术有限公司 One kind is based on O2O electric business user data method for digging under big data environment
CN108108861A (en) * 2018-03-06 2018-06-01 中国银行股份有限公司 The Forecasting Methodology and device of a kind of potential customers
CN108256052A (en) * 2018-01-15 2018-07-06 成都初联创智软件有限公司 Automobile industry potential customers' recognition methods based on tri-training
CN108629375A (en) * 2018-05-08 2018-10-09 广东工业大学 Power customer sorting technique, system, terminal and computer readable storage medium
CN109087196A (en) * 2018-08-20 2018-12-25 北京玖富普惠信息技术有限公司 Credit-graded approach, system, computer equipment and readable medium
CN109300039A (en) * 2018-12-05 2019-02-01 山东省城市商业银行合作联盟有限公司 The method and system of intellectual product recommendation are carried out based on artificial intelligence and big data
CN110009479A (en) * 2019-03-01 2019-07-12 百融金融信息服务股份有限公司 Credit assessment method and device, storage medium, computer equipment
CN110059112A (en) * 2018-09-12 2019-07-26 中国平安人寿保险股份有限公司 Usage mining method and device based on machine learning, electronic equipment, medium
CN110084627A (en) * 2018-01-23 2019-08-02 北京京东金融科技控股有限公司 The method and apparatus for predicting target variable
CN110148023A (en) * 2019-05-15 2019-08-20 山大地纬软件股份有限公司 The electric power integral Method of Commodity Recommendation and system that logic-based returns
CN110428270A (en) * 2019-08-07 2019-11-08 佰聆数据股份有限公司 The potential preference client recognition methods of the channel of logic-based regression algorithm
CN110688429A (en) * 2019-08-14 2020-01-14 中国平安人寿保险股份有限公司 Target employee screening method and device, computer equipment and storage medium

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489108A (en) * 2013-08-22 2014-01-01 浙江工商大学 Large-scale order form matching method in community commerce cloud
CN104699717A (en) * 2013-12-10 2015-06-10 中国银联股份有限公司 Data mining method
CN105678570A (en) * 2015-12-31 2016-06-15 北京京东尚科信息技术有限公司 Method and apparatus for identifying potential users of E-commerce
CN106203679A (en) * 2016-06-27 2016-12-07 武汉斗鱼网络科技有限公司 A kind of customer loss Forecasting Methodology and system
CN106204106A (en) * 2016-06-28 2016-12-07 武汉斗鱼网络科技有限公司 A kind of specific user's recognition methods and system
CN107818471A (en) * 2016-09-12 2018-03-20 湖南移商动力网络技术有限公司 One kind is based on O2O electric business user data method for digging under big data environment
CN107240033A (en) * 2017-06-07 2017-10-10 国家电网公司客户服务中心 The construction method and system of a kind of electric power identification model
CN108256052A (en) * 2018-01-15 2018-07-06 成都初联创智软件有限公司 Automobile industry potential customers' recognition methods based on tri-training
CN110084627A (en) * 2018-01-23 2019-08-02 北京京东金融科技控股有限公司 The method and apparatus for predicting target variable
CN108108861A (en) * 2018-03-06 2018-06-01 中国银行股份有限公司 The Forecasting Methodology and device of a kind of potential customers
CN108629375A (en) * 2018-05-08 2018-10-09 广东工业大学 Power customer sorting technique, system, terminal and computer readable storage medium
CN109087196A (en) * 2018-08-20 2018-12-25 北京玖富普惠信息技术有限公司 Credit-graded approach, system, computer equipment and readable medium
CN110059112A (en) * 2018-09-12 2019-07-26 中国平安人寿保险股份有限公司 Usage mining method and device based on machine learning, electronic equipment, medium
CN109300039A (en) * 2018-12-05 2019-02-01 山东省城市商业银行合作联盟有限公司 The method and system of intellectual product recommendation are carried out based on artificial intelligence and big data
CN110009479A (en) * 2019-03-01 2019-07-12 百融金融信息服务股份有限公司 Credit assessment method and device, storage medium, computer equipment
CN110148023A (en) * 2019-05-15 2019-08-20 山大地纬软件股份有限公司 The electric power integral Method of Commodity Recommendation and system that logic-based returns
CN110428270A (en) * 2019-08-07 2019-11-08 佰聆数据股份有限公司 The potential preference client recognition methods of the channel of logic-based regression algorithm
CN110688429A (en) * 2019-08-14 2020-01-14 中国平安人寿保险股份有限公司 Target employee screening method and device, computer equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
TRIET H.M. LE 等: "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model", 《CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS》, vol. 172, pages 10 - 16 *
星环科技人工智能平台团队: "《机器学习实战基于Sophon平台的机器学习理论与实践》", 31 January 2020, 北京:机械工业出版社, pages: 192 - 193 *
李煜 等: "数据挖掘在物流客户细分中的应用", 《现代商贸工业》, vol. 25, no. 7, pages 44 - 47 *
杨静: "信用评分卡的建立与应用", 《中国优秀硕士学位论文全文数据库 经济与管理科学辑》, vol. 2018, no. 11, pages 159 - 141 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988374A (en) * 2021-09-27 2022-01-28 上海东普信息科技有限公司 Method, device, equipment and storage medium for identifying high-quality user through targeted recommendation

Similar Documents

Publication Publication Date Title
US20210027146A1 (en) Method and apparatus for determining interest of user for information item
CN113763093A (en) User portrait-based item recommendation method and device
CN111178687A (en) Financial risk classification method and device and electronic equipment
CN111160847A (en) Method and device for processing flow information
CN112529477A (en) Credit evaluation variable screening method, device, computer equipment and storage medium
CN111191677B (en) User characteristic data generation method and device and electronic equipment
CN112950359A (en) User identification method and device
CN112734352A (en) Document auditing method and device based on data dimensionality
CN112418258A (en) Feature discretization method and device
CN113205189B (en) Method for training prediction model, prediction method and device
CN113222632A (en) Object mining method and device
CN116029766A (en) User transaction decision recognition method, incentive strategy optimization method, device and equipment
CN112256566B (en) Fresh-keeping method and device for test cases
CN113342969A (en) Data processing method and device
CN113743906A (en) Method and device for determining service processing strategy
CN110197056B (en) Relation network and associated identity recognition method, device, equipment and storage medium
CN112819619A (en) Transaction processing method and device
CN113743973A (en) Method and device for analyzing market hotspot trend
CN112434083A (en) Event processing method and device based on big data
CN113362097A (en) User determination method and device
CN112906723A (en) Feature selection method and device
CN111582648A (en) User policy generation method and device and electronic equipment
CN110895564A (en) Potential customer data processing method and device
CN111046894A (en) Method and device for identifying vest account
CN115169321B (en) Logistics content text checking method and device, electronic equipment and computer medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination