CN118277905A - Customer identification method and device, equipment, storage medium and program product - Google Patents

Customer identification method and device, equipment, storage medium and program product Download PDF

Info

Publication number
CN118277905A
CN118277905A CN202410378556.8A CN202410378556A CN118277905A CN 118277905 A CN118277905 A CN 118277905A CN 202410378556 A CN202410378556 A CN 202410378556A CN 118277905 A CN118277905 A CN 118277905A
Authority
CN
China
Prior art keywords
data
clients
client
characteristic data
customer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410378556.8A
Other languages
Chinese (zh)
Inventor
王璐瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202410378556.8A priority Critical patent/CN118277905A/en
Publication of CN118277905A publication Critical patent/CN118277905A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a client identification method, which can be applied to the technical field of big data, the technical field of artificial intelligence, and the technical field of financial science and technology. The client identification method comprises the following steps: reading n groups of client characteristic data of n clients to be identified from a database; based on initial model parameters set for the isolated forest model, constructing a plurality of binary tree classification models by utilizing n groups of client feature data, wherein the n groups of client feature data are distributed at node positions with different depths in the binary tree classification models; and determining target clients from the n clients to be identified according to the node positions corresponding to the n groups of client characteristic data, wherein the target clients are primarily determined as second-class clients. The present disclosure also provides a customer identification apparatus, device, storage medium, and program product.

Description

Customer identification method and device, equipment, storage medium and program product
Technical Field
The present disclosure relates to the technical field of big data, the technical field of artificial intelligence, and the technical field of financial science and technology, and in particular, to a method, an apparatus, a device, a medium, and a program product for identifying a client.
Background
Unbalanced data refers to data in which the number of positive and negative types of samples used for classification in machine learning is too large, and a certain type is dominant in number. Generally, we call the category with more samples as the major category and the category with less samples as the minor category. Unbalanced data commonly appears in practical application, the prior probability distribution of the data is assumed to be balanced in a traditional classification algorithm, samples in a data set are treated equally, the occurrence meaning and the misclassification cost of the unbalanced data are the same, the traditional classifier is biased to most classes and ignores few classes when facing the unbalanced data, the classification accuracy of the few classes is low, the generalization performance is low, and in general, few samples are more concerned.
In the early promotion stage of new products, a large number of clients are often required to be identified based on client data so as to perform accurate customer service according to the identification result. However, because the product is in the early promotion stage, the number of positive sample data (namely the data of a truly purchased customer) is very small, so that the sample data is in a state of unbalance to a large extent, and the accuracy of customer classification prediction by using a traditional classifier is low based on the sample data.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a client identification method, apparatus, device, medium, and program product.
In a first aspect of the present disclosure, a method for identifying a customer is provided, including:
Reading n groups of client characteristic data of n clients to be identified from a database, wherein the n groups of client characteristic data comprise first type characteristic data and second type characteristic data, the first type characteristic data are used for representing the characteristics of first type clients which are not concerned by users in the n clients to be identified, the second type characteristic data are used for representing the characteristics of second type clients which are concerned by users in the n clients to be identified, the number of the first type clients is greater than that of the second type clients, and the second type clients are target clients for developing target business for the users;
Based on initial model parameters set for the isolated forest model, constructing a plurality of binary tree classification models by utilizing n groups of client feature data, wherein the n groups of client feature data are distributed at node positions with different depths in the binary tree classification models;
And determining target clients from the n clients to be identified according to the node positions corresponding to the n groups of client characteristic data, wherein the target clients are primarily determined as second-class clients.
According to the embodiment of the disclosure, the initial model parameters at least comprise an initial segmentation coefficient and an initial discrimination threshold parameter, wherein the initial segmentation coefficient is used for determining initial data segmentation points in the process of constructing a plurality of binary tree classification models, and the initial discrimination threshold parameter is used for determining the deepest positions of corresponding initial leaf nodes of the client characteristic data of the target client in the binary tree classification models.
According to an embodiment of the present disclosure, the range of values of the initial segmentation coefficients is: a-1, wherein the numerical range of a is as follows: the value of 0.5-1, a is determined according to the number ratio of the first type of clients in the total historical clients involved in developing the target business in the preset historical time period.
According to an embodiment of the present disclosure, a method for constructing a kth segmentation process in a plurality of binary tree classification models using n sets of customer feature data based on initial model parameters set for an isolated forest model includes:
the kth characteristic data to be segmented is arranged in an mth layer of sub-nodes to be segmented, wherein the kth characteristic data to be segmented is obtained by carrying out k-1 th segmentation processing on n groups of client characteristic data;
according to the characteristic value of the target characteristic dimension in the kth characteristic data to be segmented and the initial segmentation coefficient, calculating a kth segmentation point for segmenting the kth characteristic data to be segmented in the mth layer of child nodes;
And placing the first process data in the m+1th layer of child nodes, and simultaneously terminating the segmentation of the second process data, wherein the first process data is the customer characteristic data with the value smaller than the kth segmentation point in the kth time to-be-segmented characteristic data, and the second process data is the customer characteristic data with the value greater than or equal to the kth segmentation point in the kth time to-be-segmented characteristic data.
According to an embodiment of the present disclosure, determining a target client from n clients to be identified according to node positions corresponding to each of n sets of client feature data includes:
According to the node positions corresponding to the n groups of client characteristic data, calculating the distinguishing degree values of the n groups of client characteristic data;
and determining the target client from the n clients to be identified according to the distinguishing degree value of each of the n sets of client characteristic data.
According to an embodiment of the present disclosure, determining a target client from n clients to be identified according to respective discrimination values of n sets of client feature data includes:
Determining at least one set of target customer characteristic data with a distinguishing value greater than an initial distinguishing threshold parameter from n sets of customer characteristic data;
and determining the customer to be identified corresponding to the target customer characteristic data as a target customer.
According to an embodiment of the present disclosure, the client identification method further includes:
Reading a plurality of groups of customer characteristic test data of a plurality of test customers from a database, wherein data in the customer characteristic test data are marked with data labels, and the data labels are used for marking the customer types of the test customers corresponding to the customer characteristic test data;
based on the initial model parameters, constructing a plurality of binary tree test models by utilizing a plurality of groups of customer characteristic test data;
According to the node positions of the multiple groups of client characteristic test data in the binary tree test model, outputting the predicted client types of multiple test clients;
and adjusting initial model parameters according to the predicted client types and the data labels of the plurality of test clients.
According to an embodiment of the present disclosure, the client identification method further includes:
extracting target customer characteristic data corresponding to the second type of customers from the n groups of customer characteristic data;
Constructing data based on the target customer characteristic data to generate construction characteristic data;
Generating balanced sample data by using the construction feature data and the n groups of customer feature data;
The customer identification neural network is trained using the balanced sample data.
According to an embodiment of the present disclosure, the customer characteristic data includes at least: the method comprises the steps of client attribute data, client portrait data, client transaction behavior data and reference behavior data of clients for reference products, wherein the product characteristics of the reference products are the same as or similar to the product characteristics of target products executed by users to develop target businesses.
A second aspect of the present disclosure provides a customer identification device, comprising: the first reading module is used for reading n groups of client characteristic data of n clients to be identified from the database, wherein the n groups of client characteristic data comprise first type characteristic data and second type characteristic data, the first type characteristic data are used for representing the characteristics of first type clients which are not concerned by users in the n clients to be identified, the second type characteristic data are used for representing the characteristics of second type clients which are concerned by users in the n clients to be identified, the number of the first type clients is larger than that of the second type clients, and the second type clients are target clients for users to develop target business;
The first construction module is used for constructing a plurality of binary tree classification models by utilizing n groups of client characteristic data based on initial model parameters set for the isolated forest model, wherein the n groups of client characteristic data are distributed at node positions with different depths in the binary tree classification models; and
And the determining module is used for determining target clients from n clients to be identified according to the node positions corresponding to the n groups of client characteristic data, wherein the target clients are primarily determined as second-class clients.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more computer programs, wherein the one or more processors execute the one or more computer programs to implement the steps of the method.
A fourth aspect of the present disclosure also provides a computer readable storage medium having stored thereon a computer program or instructions which, when executed by a processor, implement the steps of the above method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program or instructions which, when executed by a processor, performs the steps of the method described above.
According to the embodiment of the disclosure, the data classification problem is converted into the data identification problem, namely, data with different characteristics from the majority class data is found from all sample data, and the data is regarded as minority class data. Specifically, based on an isolated forest model, in the process of constructing a plurality of binary tree classification models according to customer characteristic data and initial model parameters, all sample data are input into a binary tree classification model to realize that all sample data are classified according to conditions, so that most sample data and few sample data can be obtained according to the customer characteristic data, screening of the few sample data is realized, and the identification precision and accuracy of the few sample data are improved by converting passive classification of the few sample data into active identification. For unbalanced samples, the traditional classifier has lower prediction accuracy. Based on the improvement, the traditional classifier is abandoned, the isolated forest suitable for unbalanced data is adopted for classification prediction, and the recognition effect of the isolated forest on a few types of samples is good, so that the classification method has better recognition precision compared with the traditional classifier when applied to the scene.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a client identification method, apparatus, device, medium and program product according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a customer identification method according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a schematic diagram of a binary tree classification model according to an embodiment of the disclosure;
FIG. 4 schematically illustrates a leaf node location schematic in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a method of customer identification according to another embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of a method of determining a target client according to an embodiment of the disclosure;
FIG. 7 schematically illustrates a flow chart of a method of determining a target client according to another embodiment of the disclosure;
FIG. 8 schematically illustrates a block diagram of a client identification device according to an embodiment of the present disclosure; and
Fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement a customer identification method according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. in compliance with relevant laws and regulations and standards, necessary security measures are taken, no prejudice to the public order colloquia is provided, and corresponding operation entries are provided for the user to select authorization or rejection.
In the scenario of using personal information to make an automated decision, the method, the device and the system provided by the embodiment of the disclosure provide corresponding operation inlets for users, so that the users can choose to agree or reject the automated decision result; if the user selects refusal, the expert decision flow is entered. The expression "automated decision" here refers to an activity of automatically analyzing, assessing the behavioral habits, hobbies or economic, health, credit status of an individual, etc. by means of a computer program, and making a decision. The expression "expert decision" here refers to an activity of making a decision by a person who is specializing in a certain field of work, has specialized experience, knowledge and skills and reaches a certain level of expertise.
It should be noted that the method, apparatus, device, storage medium and program product for identifying a client according to the embodiments of the present disclosure may be applied to the technical field of big data, the technical field of artificial intelligence, and the technical field of financial science, and may also be applied to any field other than the technical field of big data, the technical field of artificial intelligence, and the technical field of financial science, and the application fields of the method, apparatus, device, storage medium and program product for identifying a client according to the embodiments of the present disclosure are not limited.
Unbalanced data commonly appears in practical applications, for example, in the identification of junk mail, the normal mail ratio is far greater than the junk mail ratio; in the detection of credit card fraud, the normal user duty cycle is much greater than the user duty cycle at which fraud is present, etc. For a new product, most users are in a sightseeing state in the early promotion stage, and choose not to purchase the product temporarily, so the user who does not purchase the product often has a much larger ratio than the user who purchases the product. Therefore, in the early promotion stage of the product, whether the user buying the product or not forms a group of unbalanced data. In practical application scenarios, the characteristics of the users who have purchased the product (such users with relatively small occupation) need to be analyzed to better mine the users who have not purchased the product (such users with relatively large occupation) or optimize the sales strategy, etc. However, users who have purchased the product (such users that are small in size) can provide a limited amount of positive sample data, which is detrimental to traditional classifier learning, and the accuracy of training results is low.
In view of this, an embodiment of the present disclosure provides a client identifying method, including:
Reading n groups of client characteristic data of n clients to be identified from a database, wherein the n groups of client characteristic data comprise first type characteristic data and second type characteristic data, the first type characteristic data are used for representing the characteristics of first type clients which are not concerned by users in the n clients to be identified, the second type characteristic data are used for representing the characteristics of second type clients which are concerned by users in the n clients to be identified, the number of the first type clients is greater than that of the second type clients, and the second type clients are target clients for developing target business for the users;
Based on initial model parameters set for the isolated forest model, constructing a plurality of binary tree classification models by utilizing n groups of client feature data, wherein the n groups of client feature data are distributed at node positions with different depths in the binary tree classification models;
And determining target clients from the n clients to be identified according to the node positions corresponding to the n groups of client characteristic data, wherein the target clients are primarily determined as second-class clients.
Fig. 1 schematically illustrates an application scenario diagram of a client identification method according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.
The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the client identifying method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the client identifying means provided by the embodiments of the present disclosure may be generally provided in the server 105. The client identification method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the client identifying means provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The client identification method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 7 based on the scenario described in fig. 1.
According to the embodiment of the disclosure, according to the popularization experience of the historical product, the consumption characteristics and purchasing capacity of the historical clients in the historical time period are analyzed, and in the early popularization stage of the product, a small part of users select to directly purchase the product, and a large part of users select to look for the product and temporarily do not purchase the product. Unbalanced data in the application scene of the embodiment of the disclosure also accords with the data distribution characteristics and the natural law.
In an application scenario of an embodiment of the present disclosure, there is a set of unbalanced data. Taking a pension product as an example, some users have purchased the pension product and some users have not purchased the pension product. In embodiments of the present disclosure, positive sample data that has purchased the pension product needs to be identified, i.e., the number of users that have purchased the pension product and, in particular, which users have purchased the pension product, with the customer that has purchased the pension product being determined as the target customer.
Fig. 2 schematically illustrates a flow chart of a client identification method according to an embodiment of the present disclosure. Fig. 3 schematically illustrates a schematic diagram of a binary tree classification model according to an embodiment of the disclosure. This is explained in detail below in connection with fig. 2 and 3.
As shown in fig. 2, the customer identification of this embodiment includes operations S210 to S230.
In operation S210, n sets of client feature data of n clients to be identified are read from the database, where the n sets of client feature data include first class feature data and second class feature data, the first class feature data is used to characterize features of first class clients that are not focused by users in the n clients to be identified, the second class feature data is used to characterize features of second class clients focused by users in the n clients to be identified, the number of the first class clients is greater than the number of the second class clients, and the second class clients are target clients for users to develop target services.
According to the embodiment of the disclosure, the client to be identified is any user in the application scene; each group of customer characteristic data is a characteristic value corresponding to the customer to be identified, wherein the customer characteristic data can comprise a characteristic value of a customer attribute, a characteristic value of a customer behavior, a characteristic value of a customer purchasing capability and the like. Each customer to be identified corresponds to a group of customer characteristic data, and then n customers to be identified correspond to n groups of customer characteristic data, wherein n is a positive integer greater than 1.
According to the embodiment of the disclosure, n groups of customer characteristic data are directly read from a database to form unbalanced data in the application scene. The unbalanced data composed of n sets of customer characteristic data includes first-class characteristic data and second-class characteristic data. The first type of characteristic data is used for representing the characteristics of a first type of customers not concerned by the user, and under the application scene of the disclosure, the first type of characteristic data can be understood as customer characteristic data of the users who do not purchase the product temporarily; the second category of characteristic data is used to characterize the characteristics of the second category of customers of interest to the user, i.e. the customer characteristic data of the user who has purchased the product. Because the unbalanced data distribution in the application scene accords with the distribution characteristics and the natural rule of the historical data, the first type of characteristic data is far greater than the second type of characteristic data. For example, there are 900 users (users who have not purchased the product) corresponding to the first type of feature data, and 10 users (users who have purchased the product) corresponding to the second type of feature data.
In operation S220, a plurality of binary tree classification models are constructed using n sets of customer characteristic data based on initial model parameters set for the isolated forest model, wherein the n sets of customer characteristic data are distributed at node positions of different depths in the binary tree classification models.
According to an embodiment of the present disclosure, feature data of a plurality of clients are included in each layer of corresponding root node and child node of the binary tree, and feature data of at least one client is included in a leaf node; taking the number n of clients in the root node equal to 10 as an example, as shown in fig. 3, the root node includes 10 sets of client feature data 1-10 corresponding to 10 users. 1-10 in fig. 3 represent 1-10 sets of customer characteristic data, wherein each set of customer characteristic data may include characteristic values of customer attributes, such as gender, age, etc. of the user; the characteristic value of the customer behavior may be included, whether the user purchased a similar product or the number of purchases, etc. In the following examples, the customer characteristic data corresponding to the number is directly indicated by the number.
According to the embodiment of the disclosure, in the operation process of the isolated forest algorithm, data of the current layer data node in the binary tree are cut every time: initial model parameters, such as parameters for determining binary tree segmentation points (including segmentation features and feature segmentation values, for example), are first determined, and then the data of the current data node is segmented by using the initial model parameters, so that the left subtree and the right subtree of the binary tree are determined, and the tree construction is completed. As shown in fig. 3, the construction of the binary tree classification model is started based on the initial model parameters set by the isolated forest model and 10 sets of customer feature data of the root node. As shown in fig. 3, an example of constructing a binary tree is taken as an illustration, and a specific manner of constructing a preset number of other binary trees is the same as the construction, and is not described herein again. Specifically:
(1) And (3) splitting the client characteristic data in the root node. For example, based on the segmentation feature and/or the feature segmentation value determined by the model parameter, the 10 sets of client feature data of the root node are subjected to segmentation processing, the segmentation points are generated between the maximum value and the minimum value of the specified dimension and the specified segmentation point in the 10 sets of client feature data, a hyperplane is formed based on the segmentation points, the 10 sets of client feature data are divided into 2 subspaces, the client feature data smaller than the segmentation points in the specified dimension are placed on the left side to obtain a first layer child node 1 formed by 1-7, and the client feature data greater than or equal to the segmentation points in the specified dimension are placed on the right side to obtain a first layer child node 2 formed by 8-10.
(2) The segmentation process of the client feature data in the first layer child node 1 is based on the segmentation features and/or feature segmentation values determined by the model parameters. For example, 1 to 8 sets of customer characteristic data in the first layer of sub-nodes 1 are subjected to segmentation processing to obtain second layer of sub-nodes 1 formed by 1 to 5 and second layer of sub-nodes 2 formed by 6 to 8; the splitting of the first layer child node 2 is terminated.
(3) The segmentation process of the client feature data in the second level child node 1 is based on the segmentation features and/or feature segmentation values determined by the model parameters. Dividing 1-5 groups of customer characteristic data in the second layer of sub-nodes 1 to obtain a third layer of sub-nodes 1 formed by 1-4 and a third layer of sub-nodes 5 formed by 5; the splitting of the layer two child node 2 is terminated.
And the like, stopping the segmentation until the preset segmentation processing termination condition is met. The preset termination condition of the segmentation process can be the height or the layer number of the limiting tree, for example, the segmentation can be terminated after 20 layers are obtained; it may also be that the current node cannot be segmented until, for example, the 5th set of customer characteristic data of the third layer child node 2 in fig. 3 cannot be segmented any more.
Therefore, a binary tree classification model is constructed based on initial model parameters set by the isolated forest model. Specifically, an isolated forest composed of a plurality of binary trees may also be constructed based on the same manner, and will not be described herein.
In operation S230, a target client is determined from the n clients to be identified according to the node positions corresponding to the n sets of client characteristic data, wherein the target client is primarily determined as a second type client.
In the above embodiment, after a plurality of binary tree classification models (a plurality of binary trees are not shown in the figure) are constructed according to n sets of client feature data, a target client is determined from n clients to be identified according to node positions corresponding to each set of client feature data in the plurality of binary tree classification models. Specifically, the distinguishing degree value of each layer of nodes is calculated, the deepest position m of the corresponding initial leaf node of the client characteristic data of the target client in the binary tree classification model is determined according to the distinguishing degree value, and all clients contained in the leaf nodes which are subjected to division termination above the mth layer are counted as the target client.
The traditional classification algorithm assumes that the prior probability distribution of the data is balanced and treats the samples in the data set equally. When the traditional classification algorithm faces unbalanced data, the sample data are regarded as balanced data, and in the training process, the model is trained based on the unbalanced data, the characteristics of the sample data of a plurality of classes are often obeyed, the result is biased to the plurality of classes, the characteristics of the sample data of a few classes are weakened, and therefore the classification accuracy of the sample data of the few classes is low.
For example, in the application scenario of the embodiments of the present disclosure, applied in the early promotion stage of a new product, a large number of customers often need to be identified based on customer data in order to perform accurate customer service according to the identification result. However, because the product is in the early promotion stage, the number of positive sample data (namely the data of a truly purchased customer) is very small, so that the sample data is in a state of unbalance to a large extent, and the accuracy of customer classification prediction by using a traditional classifier is low based on the sample data.
The technical improvement of the method of the embodiment of the disclosure is as follows: in an application scenario of the embodiment of the present disclosure, analysis needs to be performed on data features of a minority class in unbalanced data. In order to avoid the above-mentioned problems in the existing classification algorithm, the data classification problem is converted into the data identification problem, namely, data with different characteristics from the majority class data is found from all sample data, and is regarded as minority class data. Specifically, based on an isolated forest model, in the process of constructing a plurality of binary tree classification models according to customer characteristic data and initial model parameters, all sample data are input into a binary tree classification model to realize that all sample data are classified according to conditions, so that most sample data and few sample data can be obtained according to the customer characteristic data, screening of the few sample data is realized, and the identification precision and accuracy of the few sample data are improved by converting passive classification of the few sample data into active identification.
For unbalanced samples, the traditional classifier has lower prediction accuracy. Based on the improvement, the traditional classifier is abandoned, the isolated forest suitable for unbalanced data is adopted for classification prediction, and the recognition effect of the isolated forest on a few types of samples is good, so that the classification method has better recognition precision compared with the traditional classifier when applied to the scene.
According to an embodiment of the present disclosure, the customer characteristic data includes at least: the method comprises the steps of client attribute data, client portrait data, client transaction behavior data and reference behavior data of clients for reference products, wherein the product characteristics of the reference products are the same as or similar to the product characteristics of target products executed by users to develop target businesses.
According to embodiments of the present disclosure, the customer attribute data may contain customer base information such as customer age, gender, occupation, etc.; customer representation data, such as customer authentication information, etc.; customer transaction behavior data, such as product information historically purchased by the customer, the number of times the customer purchased the type of product, etc.; the reference behavior data of the client aiming at the reference product, wherein the product characteristics of the reference product are the same as or similar to the product characteristics of the target product, and the possible behavior of the client on the target product can be predicted according to the reference behavior data of the client on the reference product, for example, the target product for developing the target service is a pension product, and the reference product is a pension product which is already promoted; the customer has purchased an "a pension product" and the customer may also wish to purchase a "B pension product".
According to embodiments of the present disclosure, customer attribute data for each dimension includes customer attribute values for a plurality of customers. In the process of constructing the binary tree classification model according to the client characteristic data, one characteristic dimension can be selected from the client attribute data once per iteration, and all the client attribute data are segmented according to the initial segmentation coefficient based on the selected dimension. Illustratively, as shown in fig. 3, selecting a characteristic dimension as "client age", and performing segmentation processing on client attribute data of the root node according to an initial segmentation coefficient; and selecting the characteristic dimension as an interest degree value of the customer to the product, and carrying out segmentation processing on the customer attribute data of the first layer of sub-nodes 1 according to the initial segmentation coefficient.
Fig. 4 schematically illustrates a leaf node position schematic in accordance with an embodiment of the present disclosure.
According to the embodiment of the disclosure, after the root node is segmented based on the segmentation feature and/or the feature segmentation value determined by the model parameter, as shown in fig. 3, a first layer of child nodes 1 and a first layer of child nodes 2 are obtained, wherein when the segmentation process is performed for the second time, only the first layer of child nodes 1 are segmented, and when the segmentation process is terminated for the first layer of child nodes 2, the first layer of child nodes 2 are regarded as leaf nodes, as shown by leaf nodes 1,2,3 and 4 in fig. 4. Model parameters in the isolated forest model are explained below in connection with fig. 4.
According to an embodiment of the present disclosure, model parameters in an isolated forest model include at least a segmentation coefficient and a discrimination threshold parameter. The distinguishing threshold parameter is used for determining the position of the deepest leaf node corresponding to the client characteristic data of the second class of clients in the binary tree classification model. The segmentation coefficients are used to determine data segmentation points (segmentation features and/or feature segmentation values) during construction of the plurality of binary tree classification models. As an alternative embodiment, the segmentation factor may be set based on historical data distribution characteristics and historical user characteristics.
According to an embodiment of the disclosure, the initial model parameters set by the isolated forest model at least comprise an initial segmentation coefficient and an initial discrimination threshold parameter. The initial segmentation coefficient and the initial discrimination threshold parameter are respectively a first preset segmentation coefficient and a first preset discrimination threshold parameter. At this stage, the initial model parameters are not adjusted or optimized.
According to an embodiment of the present disclosure, the initial model parameters may further include an initial number of a plurality of binary tree classification models, wherein the initial number of binary tree classification models may be defined or obtained by any manner in the prior art, and will not be described herein. In embodiments of the present disclosure, the initial number of binary tree classification models may also be defined based on the number of sets of customer characteristic data.
According to an embodiment of the present disclosure, the range of values of the initial segmentation coefficients is: a-1, wherein the numerical range of a is as follows: the value of 0.5-1, a is determined according to the number ratio of the first type of clients in the total historical clients involved in developing the target business in the preset historical time period.
According to an embodiment of the present disclosure, an initial segmentation factor is determined based on a total number of historical clients involved in developing a target service for a predetermined historical period of time and a number of first class clients involved in developing the target service for the predetermined historical period of time. The time for carrying out the target service in the current time period is similar to the time for carrying out the target service in the preset historical time period, the data distribution characteristics and the user characteristics of the target service have similarity, the value can be referred, and the ratio of the number of the first type of clients for carrying out the target service in the current time period to the current total number of clients can be predicted based on the ratio of the number of the first type of clients for carrying out the target service in the preset historical time period to the total number of historical clients.
Illustratively, a predetermined historical period of time, e.g., 15 days of 1 month, is on line with "a pension product", the number of customers of the first category being 900 persons, the number of customers of the second category being 100 persons, the calculated availability a = 0.9; the current time period is 2 months and 15 days of the same year, and the online "B pension product" directly defines the numerical range of the initial segmentation coefficient P 0 to be 0.9-1. Still more illustratively, a predetermined historical period of time, e.g., 1 month and 15 days, is on-line with "a financial product", the number of first class customers being 8000, the number of second class customers being 2000, the calculated availability a = 0.8; the current time period is 1 month 15 days of the next year, and the online line is 'B financial product', which directly defines the numerical range of the initial segmentation coefficient P 0 to be 0.8-1.
The method of the embodiment of the disclosure improves the traditional isolated forest algorithm. The dividing point of each data division is set, and the size of the limiting value is determined based on the historical quantity proportion of a certain type of users in the historical statistics. Therefore, the number of division points for each data division execution is more accurate, the data division is close to the actual data distribution characteristic, the division identification of minority data can be accelerated, the processing speed of an algorithm is increased, and the performance of a computer processor is improved.
Fig. 5 schematically illustrates a flow chart of a method of customer identification according to another embodiment of the present disclosure. This is explained in detail below in connection with fig. 4 and 5.
As shown in fig. 5, after determining the initial segmentation coefficients according to the above embodiment, the method of constructing the kth segmentation process in the plurality of binary tree classification models using n sets of client characteristic data based on the initial model parameters set for the isolated forest model includes operations S510 to S530.
In operation S510, the kth feature data to be segmented is placed in the mth layer sub-node to be segmented, where the kth feature data to be segmented is obtained by performing the kth-1 th segmentation on the n sets of client feature data.
According to the embodiment of the disclosure, for example, the kth time in the random selection iteration process, as shown in fig. 4, after the kth-1 iteration segmentation processing is performed on the mth-1 layer child node 1, the mth layer child node 1 formed by 1-5 sets of customer feature data and the leaf node 2 formed by 6-8 sets of customer feature data are obtained.
In operation S520, a kth segmentation point for segmenting the kth feature data to be segmented in the mth layer child node 1 is calculated according to the feature value of the target feature dimension in the kth feature data to be segmented and the initial segmentation coefficient.
According to an embodiment of the present disclosure, the target feature dimension is used to characterize the feature dimension in the kth segmentation process, and may be, for example, one of "the number of purchases of the product by the customer", "the sex of the customer", "the interest value of the customer in the pension product", and the like; the feature value of the target feature dimension is a specific value corresponding to the target feature dimension, for example, the target feature dimension is "the number of times the customer purchases the product, and the feature value can include 10 times, 20 times, 3 times, 8 times, 0 times, and the like; the target feature dimension is "customer gender", and the feature value may include male and female; the target feature dimension is a "customer interest value for pension products," which may include 2.5, 3.9, 7, 9, etc.
According to the embodiment of the disclosure, according to the characteristic value of the target characteristic dimension in the kth time to-be-segmented characteristic data and the initial segmentation coefficient, the kth segmentation point for segmenting the kth time to-be-segmented characteristic data in the mth layer child node 1 is calculated. Specifically, an average value of feature values of a plurality of clients under the target feature dimension, which are included in the node corresponding to the kth feature data to be segmented, may be calculated, and then a product of the average value and the initial segmentation coefficient is calculated, so as to obtain the kth segmentation point.
The method includes the steps of setting a target feature dimension in initial model parameters as ' the number of times of purchasing the product by a customer in a preset time period ', setting 1000 persons in total number of customers involved in currently developing target business, and calculating to obtain an average value of ' the number of times of purchasing the product by the customer in the preset time period ' of 5 times according to 1000 feature values corresponding to the 1000 persons in the target dimension '. The specific calculation manner for determining the initial segmentation coefficient has been disclosed in the above embodiments, and will not be described herein. The calculated initial segmentation factor was set to 0.6. Further, the average value of the feature values is multiplied by the initial segmentation coefficient to obtain the kth segmentation point, i.e., 5×0.6=3 times. Thus, the kth division point is obtained.
Further, the m-th layer sub-node 1 is subjected to a segmentation process according to the target feature dimension and the k-th segmentation point in the k-th time feature data to be segmented, so that the hyperplane divides the m-th layer sub-node 1 into two sub-spaces, namely, first process data (m+1th layer sub-node 1 shown in fig. 4) and second process data (leaf node 3 shown in fig. 4).
In operation S530, the first process data is placed in the m+1th layer child node, and the segmentation of the second process data is terminated, where the first process data is the customer characteristic data with a value smaller than the kth segmentation point in the kth feature data to be segmented, and the second process data is the customer characteristic data with a value greater than or equal to the kth segmentation point in the kth feature data to be segmented.
According to an embodiment of the present disclosure, as shown in fig. 4, first process data is placed at the m+1st layer child node 1, and a k+1st division process is prepared for 1 to 4 sets of customer characteristic data. Meanwhile, as shown in fig. 4, the segmentation of the second process data (leaf node 3) is terminated, and the client corresponding to the client characteristic data in the second process data (leaf node 3) is treated as a second class of clients temporarily.
According to embodiments of the present disclosure, the present disclosure enables an isolated forest model to be reasonably optimized by reasonably predicting initial segmentation coefficients based on historical data characteristics and historical experience. In the process of constructing the binary tree classification model according to the initial segmentation coefficients, only the child nodes which are obtained by segmentation and smaller than the segmentation points are needed to be segmented, and the leaf nodes (client characteristic data) which are larger than or equal to the segmentation points are not needed to be segmented. In the process of segmentation, the segmentation is stopped by data (temporarily identified as few types of data) which are larger than the segmentation points, and the setting of the segmentation points can ensure the processing precision to a certain extent, so that the method reduces a large number of unnecessary segmentation iterative computations on the basis of ensuring the processing precision, further improves the computation rate and improves the processing performance of a processor.
Fig. 6 schematically illustrates a flow chart of a method of determining a target client according to an embodiment of the disclosure.
As shown in fig. 6, determining a target client from n clients to be identified according to the node positions corresponding to the n sets of client characteristic data includes operations S610 to S620.
In operation S610, respective discrimination values of the n sets of client feature data are calculated according to respective corresponding node positions of the n sets of client feature data;
According to an embodiment of the present disclosure, a discrimination value for each set of customer characteristic data is determined based on the location of each set of customer characteristic data in the binary tree. The discrimination value of each set of customer feature data in the dataset is related to its depth in the orphan tree, the shallower the depth, the higher the discrimination value score, the deeper the depth, and the lower the discrimination value score. Specific calculation of the discrimination value refers to formula (1):
Wherein, x i is used for representing each group of customer characteristic data, h (x i) is the height of x i in each tree, n is the number of groups of single binary tree customer characteristic data, c (n) is the average path depth of the binary tree constructed by n groups of customer characteristic data, and the average path depth h (x i) of a sample x i is used for carrying out standardization processing, and the specific calculation is shown in a reference formula (2):
Where H (n-1) is a harmonic number, this value can be estimated as: ln (n-1) +euler constant.
According to the embodiment of the disclosure, the higher the leaf node position where the customer characteristic data x i is located, i.e. the smaller the path depth, the larger the discrimination value; and vice versa. Illustratively, as shown in fig. 4, the leaf node 2 is higher than the leaf node 3, and the path depth is small, the leaf node 2 is larger than the discrimination value of the leaf node 3.
In operation S620, a target client is determined from the n clients to be identified according to the respective discrimination values of the n sets of client characteristic data.
According to an embodiment of the present disclosure, one embodiment of determining the target client according to the discrimination value of each set of client feature data may be to set a discrimination threshold, compare the discrimination value of each set of client feature data to the discrimination threshold, and determine the target client.
As an alternative embodiment, fig. 7 schematically illustrates a flow chart of a method of determining a target client according to another embodiment of the present disclosure. As shown in fig. 7, in this embodiment, operations S710 to S720 are included in addition to operations S610 to S620 described above with reference to fig. 6. For brevity of description, descriptions of operations S610 to S620 are omitted here, wherein: and determining that the target client comprises an operation S710 and an operation S720 from n clients to be identified according to the distinguishing degree value of each of the n sets of client characteristic data.
At least one set of target customer characteristic data having a discrimination value greater than an initial discrimination threshold parameter is determined from the n sets of customer characteristic data in operation S710.
According to embodiments of the present disclosure, an initial discrimination threshold parameter is predefined, wherein the initial discrimination threshold parameter may be derived from historical experience. After the distinguishing value of each leaf node is calculated according to the above embodiment, the distinguishing value of each leaf node is compared with the initial distinguishing threshold parameter, the depth corresponding to the leaf node with the distinguishing value larger than the initial distinguishing threshold parameter in the binary tree is determined as the distinguishing depth, the node corresponding to the distinguishing depth is determined as the first target leaf node, and the client feature data corresponding to the first target leaf node is determined as the first group of target client feature data. According to the embodiment of the disclosure, since the higher the leaf node position is, the larger the discrimination value is, after determining that the first target leaf node with the deepest path is obtained, the discrimination values of all the leaf nodes of the discrimination depth layer and above are larger than the initial discrimination threshold parameter. As shown in fig. 4, if it is determined that the leaf node 4 is the target leaf node, the discrimination values of the leaf node 4 of the m+2 layer and the leaf nodes 1 to 3 of the upper layer are all larger than the initial discrimination threshold parameter. Then the customer characteristic data corresponding to all leaf nodes having a discrimination value greater than the initial discrimination threshold parameter may be determined to be target customer characteristic data. (leaf nodes are nodes which are distinguished first in the layer, are nodes which are concerned by users, correspond to minority sample data, non-leaf nodes, namely child nodes, in each layer still need to be cut, and the types of data features contained in the non-leaf nodes are not clear, so that the nodes are nodes which are not concerned by users at present and are temporarily determined to be majority sample data).
In operation S720, the customer to be identified corresponding to the target customer characteristic data is determined as the target customer.
According to the embodiment of the disclosure, the client corresponding to the target client characteristic data can be determined as the target client. Thus, the target client can be initially determined to be a second client, namely a client with smaller sample data volume.
According to the embodiment of the disclosure, after the target client (the client who has purchased the product) is determined, further, specific analysis can be performed on the client to more accurately realize popularization and release marketing of the product.
According to an embodiment of the present disclosure, the above-mentioned client identification method is performed by an electronic device, and specifically includes:
In response to a read data instruction generated by the controller, n sets of customer characteristic data of n customers to be identified are read from the database, and the n sets of customer characteristic data are stored in the register. The n groups of client feature data comprise first type feature data and second type feature data, the first type feature data is used for representing the features of first type clients which are not concerned by users in the n clients to be identified, the second type feature data is used for representing the features of second type clients which are concerned by users in the n clients to be identified, the number of the first type clients is larger than that of the second type clients, and the second type clients are target clients for users to develop target services.
In response to the model building instructions generated by the controller, generating, by the operator, a plurality of binary tree classification models based on the initial model parameters set by the isolated forest model and the n sets of customer characteristic data, and storing the initial model parameters in the register. Wherein, n groups of customer characteristic data are distributed at node positions with different depths in the binary tree classification model.
In response to the recognition target client instruction generated by the controller, the following operations are performed by the operator: and determining target clients from the n clients to be identified according to the node positions corresponding to the n groups of client characteristic data, wherein the target clients are preliminarily determined as second-class clients, and the second-class clients are marked in the register.
According to an embodiment of the present disclosure, in the process of executing the above-described client identification method by the electronic device, in response to the calculation instruction generated by the controller, the operator calculates the initial segmentation coefficient according to the following rule. Specifically, the numerical range of the initial segmentation coefficients is: a-1, wherein the numerical range of a is as follows: the value of 0.5-1, a is determined according to the number ratio of the first type of clients in the total historical clients involved in developing the target business in the preset historical time period.
According to an embodiment of the present disclosure, in a process of executing the above-described client identification method by an electronic device, in response to a model construction instruction generated by a controller, generating, by an operator, a plurality of binary tree classification models based on an initial model parameter set by an isolated forest model and n sets of client feature data, the method of executing a kth segmentation process by the operator includes:
The method is executed by an arithmetic unit: the kth characteristic data to be segmented is arranged in an mth layer of sub-nodes to be segmented, wherein the kth characteristic data to be segmented is obtained by carrying out k-1 th segmentation processing on n groups of client characteristic data; according to the characteristic value of the target characteristic dimension in the kth characteristic data to be segmented and the initial segmentation coefficient, calculating a kth segmentation point for segmenting the kth characteristic data to be segmented in the mth layer of child nodes; and placing the first process data in the m+1th layer of child nodes, and simultaneously terminating the segmentation of the second process data, wherein the first process data is the customer characteristic data with the value smaller than the kth segmentation point in the kth time to-be-segmented characteristic data, and the second process data is the customer characteristic data with the value greater than or equal to the kth segmentation point in the kth time to-be-segmented characteristic data.
The method of the embodiment of the disclosure is more outstanding improvement compared with the conventional technology: the improvement of the internal performance of the computer is achieved by reasonable expectation of the initial segmentation coefficients in conventional algorithms. Specifically, the division points are randomly selected in the traditional isolated forest model, all data to be processed are divided according to the randomly selected division points until the data cannot be divided, the data to be divided is large in quantity, the iteration times are needed, the height of the binary tree obtained by construction is high, multiple operations are needed, and the requirement on the storage amount of a register is large. Compared with the traditional isolated forest model, the method and the device have the advantage that the isolated forest model is reasonably optimized by reasonably predicting the initial segmentation coefficient according to the historical data characteristics and the historical experience. In the process of executing the division processing by the arithmetic unit, only the leaf nodes (client feature data) smaller than the division points obtained by the division need to be divided, and the leaf nodes (client feature data) larger than or equal to the division points do not need to be divided. Therefore, the operation amount of the arithmetic unit is greatly reduced, and the operation efficiency and the operation speed of the arithmetic unit are improved. Meanwhile, the number of iterative segmentation is greatly reduced, and the data volume of the whole intermediate parameters is small. The memory space is reduced to a great extent, the hardware requirement on the register is reduced, and the memory efficiency is improved.
According to the embodiment of the disclosure, based on the client identification method in the above disclosed embodiment, model parameters can be adjusted in the isolated forest model according to the test result based on the test set, and the method specifically comprises steps 11 to 14.
In step 11, a plurality of sets of customer feature test data of a plurality of test customers are read from a database, wherein data in the customer feature test data are marked with data tags, and the data tags are used for marking customer types of the test customers corresponding to the customer feature test data.
According to an embodiment of the present disclosure, after determining the target clients (clients of the second class) according to the above-described embodiment of the present disclosure, the client type of each client may be marked, and the specific marking manner is not limited herein. Thus, model parameters can be optimized by identifying the labels of each customer. And reading a plurality of groups of customer characteristic test data of a plurality of test customers from a database, wherein each group of customer characteristic test data carries a label of a first type customer or a second type customer.
In step 12, a plurality of binary tree test models are constructed using the plurality of sets of customer characteristic test data based on the initial model parameters.
According to the embodiment of the disclosure, each group of customer characteristic test data carrying a tag is placed at a leaf node position in a binary tree test model, and a plurality of groups of customer characteristic test data are segmented by using initial model parameters to obtain a plurality of binary tree classification models. The method for constructing the binary tree test model is consistent with the method for constructing the binary tree classification model, and will not be described herein.
In step 13, the predicted client types of the plurality of test clients are output according to the leaf node positions of the plurality of sets of client characteristic test data in the binary tree test model.
According to the embodiment of the disclosure, after a plurality of binary trees are constructed, the distinguishing degree value of each group of customer characteristic test data is calculated, and at least one group of target customer characteristic test data is determined according to the comparison of the distinguishing degree value of each group of customer characteristic test data and the initial distinguishing degree threshold value parameter.
In step 14, initial model parameters are adjusted based on the predicted client types and data tags for the plurality of test clients.
According to the embodiment of the disclosure, based on the initial model parameters and the customer feature test data with the labels as inputs, comparing the customer feature test data with the labels with the obtained target customer feature test data results, if the obtained target customer feature test data results are inconsistent with the customer feature test data results with the labels, adjusting the initial model parameters so that the final training results show that the obtained target customer feature test data results are close to consistency with the customer feature test data results with the labels, and stopping training, and determining the model parameters obtained at this time as optimal model parameters.
According to an embodiment of the present disclosure, the grouping trial calculation is performed according to different model parameters. The different model parameters set in the grouping may be that the order of setting the target feature dimensions of each iteration segmentation is different, or that the segmentation points of each iteration are different, or that the differentiation threshold parameters are different, etc. Further, the same sample data and different identifications are adopted to obtain the target customer characteristic test data result corresponding to each group of model parameters. And selecting a group of model parameters, which are closest to the target customer characteristic test data result and the customer characteristic test data carrying the label, as optimal model parameters.
According to an embodiment of the present disclosure, based on the client identification method in the above disclosed embodiment, after determining that the target client is obtained, the class data may be further expanded based on the client feature data of the target client, which specifically includes steps 21 to 24.
In step 21, target customer characteristic data corresponding to the second type of customers is extracted from the n sets of customer characteristic data.
In step 22, data construction is performed based on the target customer characteristic data, and construction characteristic data is generated.
According to the embodiment of the disclosure, the data construction is performed according to the target customer characteristic data, specifically, the characteristic value under each dimension is constructed according to the customer characteristic data corresponding to the second class of customers aiming at different customer characteristic dimensions. Among them, it is preferable to construct data of feature data common to the second type of clients. For example, the age range of "client age" in the second class of client feature data is 60 to 80 years old, and thus, when constructing data, age data of each client in the age range of 60 to 80 years old can be randomly generated. For another example, the second class of customer characteristic data has an interest value range of 7 to 10, and thus, when constructing data, interest value data of each customer having an interest value range of 7 to 10 can be randomly generated. In step 23, balanced sample data is generated using the construction feature data and the n sets of customer feature data.
According to embodiments of the present disclosure, balanced sample data refers to data with a small difference in the number of positive and negative types of samples used for classification in machine learning, wherein the two types of data are equal or close in number, i.e., considered as data volume balance. Illustratively, there are 1000 users in the unbalanced data sample according to the embodiment of the present disclosure, where if 200 users are identified as the second type of users through the above-mentioned client identification method, the first type of users has 800 users. In order to equalize the unbalanced data, the data is structured so that the amount of the customer characteristic data of the second class of users is the same as or close to the amount of the customer characteristic data of the first class of users. Specifically, by constructing the customer characteristic data of the second type user to around the customer characteristic data amount of 800 users in the above manner, for example, based on the customer characteristic data of 200 users, the customer characteristic data of 600 customers or the customer data of 650 customers are constructed in the above data construction manner, etc., so that the first type user (800 users) and the second type user (750 or 850 users) achieve data balance.
At step 24, the customer identification neural network is trained using the balanced sample data.
In accordance with embodiments of the present disclosure, the customer identification neural network may employ any of the neural network models of the prior art, without limitation. After the target client is identified according to the embodiment of the disclosure, data construction is performed based on the characteristic data of the target client to obtain a relatively balanced data sample, so that the balance sample can be trained by using other neural network models, and the balance sample can be trained by using other neural network models, so that further accurate analysis of the target client is realized, and the accuracy of target client identification is further improved.
On the premise that enough high-quality training samples exist, the prediction accuracy of the customer recognition neural network obtained through training is generally higher than that of an isolated forest model. But is based on the problem that the number of positive sample data (i.e. data of a real purchasing customer) is very small in the early promotion phase of a new product, the sample data is in a state of unbalance to a large extent. The neural network model with higher prediction accuracy cannot be obtained through training. According to the method, a large number of balanced data samples are constructed, a neural network model with high recognition accuracy can be obtained through training, and the neural network model is used for customer recognition and can improve the accuracy of customer classification prediction.
Based on the client identification method, the disclosure also provides a client identification device. The device will be described in detail below in connection with fig. 8.
Fig. 8 schematically illustrates a block diagram of a client identification device according to an embodiment of the present disclosure.
As shown in fig. 8, the customer identification device 800 of this embodiment includes a first reading module 810, a first constructing module 820, and a determining module 830.
The first reading module 810 is configured to read n sets of client feature data of n clients to be identified from the database, where the n sets of client feature data include first class feature data and second class feature data, the first class feature data is used to characterize features of a first class of clients that are not focused by a user among the n clients to be identified, the second class feature data is used to characterize features of a second class of clients focused by the user among the n clients to be identified, the number of the first class of clients is greater than the number of the second class of clients, and the second class of clients is target clients for developing target services for the user. In an embodiment, the first reading module 810 may be used to perform the operation S210 described above, which is not described herein.
The first construction module 820 is configured to construct a plurality of binary tree classification models using n sets of customer feature data based on initial model parameters set for the isolated forest model, wherein the n sets of customer feature data are distributed at node positions of different depths in the binary tree classification models. In an embodiment, the first construction module 820 may be used to perform the operation S220 described above, which is not described herein.
The determining module 830 is configured to determine a target client from n clients to be identified according to node positions corresponding to the n sets of client feature data, where the target client is primarily determined as a second type of client. In an embodiment, the determining module 830 may be configured to perform the operation S230 described above, which is not described herein.
According to the embodiment of the disclosure, the first construction module is based on the isolated forest model, and in the process of constructing a plurality of binary tree classification models according to the customer characteristic data and the initial model parameters, all sample data are input into the binary tree classification model to realize that all sample data are classified according to conditions, so that most sample data and few sample data can be obtained according to the customer characteristic data, screening of the few sample data is realized, and the identification precision and accuracy of the few sample data are improved by converting the passive classification of the few sample data into active identification.
Any of the plurality of modules of the first reading module 810, the first constructing module 820, and the determining module 830 may be combined in one module to be implemented, or any of the plurality of modules may be split into a plurality of modules according to an embodiment of the present disclosure. Or at least some of the functionality of one or more of the modules may be combined with, and implemented in, at least some of the functionality of other modules. According to embodiments of the present disclosure, at least one of the first reading module 810, the first building module 820, and the determining module 830 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging the circuits, or in any one of or a suitable combination of three of software, hardware, and firmware. Or at least one of the first reading module 810, the first construction module 820 and the determination module 830 may be at least partially implemented as a computer program module which, when executed, may perform the corresponding functions.
According to the embodiment of the disclosure, the initial model parameters at least comprise an initial segmentation coefficient and an initial discrimination threshold parameter, wherein the initial segmentation coefficient is used for determining initial data segmentation points in the process of constructing a plurality of binary tree classification models, and the initial discrimination threshold parameter is used for determining the deepest positions of corresponding initial leaf nodes of the client characteristic data of the target client in the binary tree classification models.
According to an embodiment of the present disclosure, the range of values of the initial segmentation coefficients is: a-1, wherein the numerical range of a is as follows: the value of 0.5-1, a is determined according to the number ratio of the first type of clients in the total historical clients involved in developing the target business in the preset historical time period.
According to an embodiment of the present disclosure, the first build module 820 includes a first placement sub-module, a first calculation sub-module, and a second placement sub-module.
The first placement sub-module is used for placing the kth time to-be-segmented characteristic data into the mth layer sub-node to be segmented, wherein the kth time to-be-segmented characteristic data is obtained by carrying out k-1 time segmentation processing on n groups of customer characteristic data;
The first calculation sub-module is used for calculating a kth segmentation point for segmenting the kth feature data in the m-th layer sub-node according to the feature value of the target feature dimension in the kth feature data to be segmented and the initial segmentation coefficient;
And the second placement sub-module is used for placing the first process data in the m+1th layer sub-node and simultaneously stopping the segmentation of the second process data, wherein the first process data is the customer characteristic data with the value smaller than the kth segmentation point in the kth characteristic data to be segmented, and the second process data is the customer characteristic data with the value larger than or equal to the kth segmentation point in the kth characteristic data to be segmented.
According to an embodiment of the present disclosure, the determination module 830 includes a second calculation sub-module and a determination sub-module.
The second computing sub-module is used for computing the distinguishing degree value of each of the n groups of client characteristic data according to the node positions corresponding to each of the n groups of client characteristic data;
And the determining submodule is used for determining target clients from n clients to be identified according to the distinguishing degree values of the n groups of client characteristic data.
According to an embodiment of the present disclosure, the determination submodule includes a first determination unit and a second determination unit.
A first determining unit configured to determine at least one set of target customer characteristic data having a discrimination value greater than an initial discrimination threshold parameter from among n sets of customer characteristic data;
And the second determining unit is used for determining the client to be identified corresponding to the target client characteristic data as the target client.
According to an embodiment of the present disclosure, the customer identification device 800 of this embodiment further includes a second reading module, a second building module, an output module, and an adjustment module.
The second reading module is used for reading a plurality of groups of customer characteristic test data of a plurality of test customers from the database, wherein the data in the customer characteristic test data are marked with data labels, and the data labels are used for marking the customer types of the test customers corresponding to the customer characteristic test data;
The second construction module is used for constructing a plurality of binary tree test models by utilizing a plurality of groups of customer characteristic test data based on the initial model parameters;
The output module is used for outputting the predicted client types of a plurality of test clients according to the positions of the leaf nodes of the plurality of groups of client characteristic test data in the binary tree test model;
And the adjusting module is used for adjusting the initial model parameters according to the predicted client types and the data labels of the plurality of test clients.
According to an embodiment of the present disclosure, the client identifying device 800 of this embodiment further includes an extraction module, a first generation module, a second generation module, and a training module.
The extraction module is used for extracting target customer characteristic data corresponding to the second type of customers from the n groups of customer characteristic data;
The first generation module is used for carrying out data construction based on the target customer characteristic data and generating construction characteristic data;
the second generation module is used for generating balance sample data by utilizing the construction characteristic data and n groups of client characteristic data;
and the training module is used for training the customer identification neural network by using the balance sample data.
According to an embodiment of the present disclosure, the customer characteristic data includes at least: the method comprises the steps of client attribute data, client portrait data, client transaction behavior data and reference behavior data of clients for reference products, wherein the product characteristics of the reference products are the same as or similar to the product characteristics of target products executed by users to develop target businesses.
Fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement a customer identification method according to an embodiment of the disclosure.
As shown in fig. 9, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to an input/output (I/O) interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to an input/output (I/O) interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the customer identification method provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (13)

1.A method of customer identification, the method comprising:
Reading n groups of client characteristic data of n clients to be identified from a database, wherein the n groups of client characteristic data comprise first type characteristic data and second type characteristic data, the first type characteristic data are used for characterizing the characteristics of first type clients which are not focused by users in the n clients to be identified, the second type characteristic data are used for characterizing the characteristics of second type clients focused by the users in the n clients to be identified, the number of the first type clients is larger than that of the second type clients, and the second type clients are target clients of target service development of the users;
Constructing a plurality of binary tree classification models by utilizing the n groups of client characteristic data based on initial model parameters set for the isolated forest model, wherein the n groups of client characteristic data are distributed at node positions with different depths in the binary tree classification models;
And determining target clients from the n clients to be identified according to the node positions corresponding to the n groups of client characteristic data, wherein the target clients are preliminarily determined to be the second-class clients.
2. The method according to claim 1, characterized in that:
The initial model parameters at least comprise an initial segmentation coefficient and an initial distinguishing degree threshold parameter, wherein the initial segmentation coefficient is used for determining initial data segmentation points in the process of constructing a plurality of binary tree classification models, and the initial distinguishing degree threshold parameter is used for determining the deepest positions of corresponding initial leaf nodes of the client characteristic data of the target client in the binary tree classification models.
3. The method according to claim 2, characterized in that:
The numerical range of the initial segmentation coefficient is as follows: a-1, wherein the numerical range of a is as follows: and 0.5-1, wherein the value of a is determined according to the quantity ratio of the first-class clients in the total historical clients involved in carrying out target business in a preset historical time period.
4. The method of claim 2, wherein constructing a kth segmentation process in a plurality of binary tree classification models using the n sets of customer characteristic data based on initial model parameters set for an isolated forest model comprises:
the kth characteristic data to be segmented is arranged in an mth layer sub-node to be segmented, wherein the kth characteristic data to be segmented is obtained by carrying out k-1 th segmentation processing on the n groups of client characteristic data;
According to the characteristic value of the target characteristic dimension in the kth characteristic data to be segmented and the initial segmentation coefficient, calculating a kth segmentation point for segmenting the kth characteristic data to be segmented in the mth layer sub-node;
And placing the first process data in the m+1th layer of child nodes, and simultaneously terminating the segmentation of the second process data, wherein the first process data is the customer characteristic data with the numerical value smaller than the kth segmentation point in the kth characteristic data to be segmented, and the second process data is the customer characteristic data with the numerical value greater than or equal to the kth segmentation point in the kth characteristic data to be segmented.
5. The method of claim 2, wherein determining a target client from the n clients to be identified based on the node locations for each of the n sets of client characteristic data comprises:
according to the node positions corresponding to the n groups of client characteristic data, calculating the distinguishing degree values of the n groups of client characteristic data;
And determining target clients from the n clients to be identified according to the distinguishing degree values of the n groups of client characteristic data.
6. The method of claim 5, wherein determining a target client from the n clients to be identified based on the respective discrimination values of the n sets of client characteristic data comprises:
determining at least one set of target customer characteristic data with a distinguishing value greater than an initial distinguishing threshold parameter from the n sets of customer characteristic data;
And determining the customer to be identified corresponding to the target customer characteristic data as the target customer.
7. The method as recited in claim 1, further comprising:
Reading a plurality of groups of customer characteristic test data of a plurality of test customers from a database, wherein data in the customer characteristic test data are marked with data labels, and the data labels are used for marking the customer types of the test customers corresponding to the customer characteristic test data;
Constructing a plurality of binary tree test models by utilizing the plurality of groups of customer characteristic test data based on the initial model parameters;
outputting the predicted client types of the plurality of test clients according to the node positions of the plurality of sets of client characteristic test data in the binary tree test model;
And adjusting the initial model parameters according to the predicted client types of the plurality of test clients and the data labels.
8. The method as recited in claim 1, further comprising:
Extracting target customer characteristic data corresponding to the second type of customers from the n groups of customer characteristic data;
constructing data based on the target customer characteristic data to generate construction characteristic data;
Generating balanced sample data by utilizing the construction feature data and the n groups of client feature data;
training a customer identification neural network using the balanced sample data.
9. The method according to claim 1, characterized in that:
The customer characteristic data includes at least: the method comprises the steps of client attribute data, client portrait data, client transaction behavior data and reference behavior data of clients for reference products, wherein the product characteristics of the reference products are the same as or similar to the product characteristics of target products executed by users to develop target businesses.
10. A customer identification device, the device comprising:
The first reading module is used for reading n groups of client characteristic data of n clients to be identified from a database, wherein the n groups of client characteristic data comprise first type characteristic data and second type characteristic data, the first type characteristic data are used for characterizing the characteristics of first type clients which are not focused by users in the n clients to be identified, the second type characteristic data are used for characterizing the characteristics of second type clients focused by the users in the n clients to be identified, the number of the first type clients is larger than that of the second type clients, and the second type clients are target clients for developing target business for the users;
The first construction module is used for constructing a plurality of binary tree classification models by utilizing the n groups of client characteristic data based on initial model parameters set for the isolated forest model, wherein the n groups of client characteristic data are distributed at node positions with different depths in the binary tree classification models; and
And the determining module is used for determining target clients from the n clients to be identified according to the node positions corresponding to the n groups of client characteristic data, wherein the target clients are primarily determined as the second-class clients.
11. An electronic device, comprising:
one or more processors;
a memory for storing one or more computer programs,
Characterized in that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 9.
12. A computer-readable storage medium, on which a computer program or instructions is stored, characterized in that the computer program or instructions, when executed by a processor, implement the steps of the method according to any one of claims 1-9.
13. A computer program product comprising a computer program or instructions which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 9.
CN202410378556.8A 2024-03-29 2024-03-29 Customer identification method and device, equipment, storage medium and program product Pending CN118277905A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410378556.8A CN118277905A (en) 2024-03-29 2024-03-29 Customer identification method and device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410378556.8A CN118277905A (en) 2024-03-29 2024-03-29 Customer identification method and device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN118277905A true CN118277905A (en) 2024-07-02

Family

ID=91635463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410378556.8A Pending CN118277905A (en) 2024-03-29 2024-03-29 Customer identification method and device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN118277905A (en)

Similar Documents

Publication Publication Date Title
CN110310114B (en) Object classification method, device, server and storage medium
CN112633962B (en) Service recommendation method and device, computer equipment and storage medium
CN111783039B (en) Risk determination method, risk determination device, computer system and storage medium
CN113051911B (en) Method, apparatus, device, medium and program product for extracting sensitive words
CN112487284A (en) Bank customer portrait generation method, equipment, storage medium and device
CN112819024B (en) Model processing method, user data processing method and device and computer equipment
CN113570222A (en) User equipment identification method and device and computer equipment
CN112990311A (en) Method and device for identifying admitted client
CN114579878A (en) Training method of false news discrimination model, false news discrimination method and device
CN115795345A (en) Information processing method, device, equipment and storage medium
CN118277905A (en) Customer identification method and device, equipment, storage medium and program product
CN116861226A (en) Data processing method and related device
CN114513578A (en) Outbound method, device, computer equipment and storage medium
CN113052512A (en) Risk prediction method and device and electronic equipment
CN118152811A (en) Data processing method and device, equipment, storage medium and program product
CN118504752A (en) Determination method of transaction risk prediction model, transaction risk prediction method, device, equipment, storage medium and program product
CN118427585A (en) Training method, recognition method and device for abnormal transaction recognition model
CN118115240A (en) Product recommendation method, model training method, device, equipment, medium and product
CN118070137A (en) Object classification method, model training method, device, equipment and medium
CN116881443A (en) Group data analysis method, device, equipment and storage medium
CN114565453A (en) Model construction method, credit risk assessment device, credit risk assessment equipment and credit risk assessment medium
CN116385042A (en) Information output method and device, electronic equipment and computer readable storage medium
CN118552230A (en) Business element configuration method, device, equipment, medium and program product
CN114677202A (en) Type identification method, training method and device, electronic equipment and storage medium
CN117893308A (en) Data processing method, data processing device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination