CN115099366A - Classification prediction method and device and electronic equipment - Google Patents

Classification prediction method and device and electronic equipment Download PDF

Info

Publication number
CN115099366A
CN115099366A CN202210873058.1A CN202210873058A CN115099366A CN 115099366 A CN115099366 A CN 115099366A CN 202210873058 A CN202210873058 A CN 202210873058A CN 115099366 A CN115099366 A CN 115099366A
Authority
CN
China
Prior art keywords
community
sample
initial
samples
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210873058.1A
Other languages
Chinese (zh)
Inventor
陈德蕾
陈龙
陈树华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dingxiang Technology Co ltd
Original Assignee
Beijing Dingxiang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dingxiang Technology Co ltd filed Critical Beijing Dingxiang Technology Co ltd
Priority to CN202210873058.1A priority Critical patent/CN115099366A/en
Publication of CN115099366A publication Critical patent/CN115099366A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a classification prediction method, a classification prediction device and electronic equipment, wherein after a first prediction result of a classification model is obtained, an initial incidence relation network comprising a plurality of nodes can be constructed based on attribute incidence relations among samples in an initial sample set; setting an initial class label for a sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network, and performing community division on the first association relationship network; and updating the initial class labels of the samples corresponding to the nodes in the same community by taking the community as a unit according to the community division result to obtain a second prediction result. The invention can improve the reliability of the prediction result of the two-classification model, thereby reducing the risk of loss of related business departments.

Description

Classification prediction method and device and electronic equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a classification prediction method, an apparatus, and an electronic device.
Background
With the development of computer technology, some work in society has changed from manual processing to machine processing. Machine learning binary models are representative techniques that have been widely applied in financial, insurance, etc. scenarios. Taking the credit card application as an example for verification, the credit condition of the applicant can be analyzed by using a two-classification model, so as to judge whether the applicant meets the passing requirement of the credit card application; however, different credit card applicants may have similar addresses or belong to the same unit, the credit situations among them may affect each other, and if one of the applicants has a higher possibility of fraud default, the possibility of fraud default of the applicants closely related to the one will also increase. For example, if an insurance agent purchases insurance, a two-class model may be used to analyze the information of the insurance agent to determine whether the insurance agent is trustworthy, but if multiple agents are optimized for the model determination of a single agent through member increase, mutual insurance, or hanging order, the likelihood that each of the multiple agents is trustworthy is also reduced.
However, the correlation between samples cannot be represented by the features used in the conventional two-classification model, so that the model cannot utilize additional information implied by the correlation between samples during classification prediction, so that the reliability of the prediction result is low, a conclusion obtained after the prediction result is analyzed by a relevant business department is likely to be wrong, and the relevant business department may be irretrievably lost in the future.
Disclosure of Invention
In view of this, the present invention provides a classification prediction method, a classification prediction apparatus, and an electronic device, so as to improve reliability of a prediction result of a binary classification model, thereby reducing a risk of loss generated by a relevant business department.
In a first aspect, an embodiment of the present invention provides a classification prediction method, where the method includes: predicting an initial sample set of a target scene by using a binary model corresponding to the target scene to obtain a first prediction result; the target scene is a predetermined scene to be classified and predicted, and the first prediction result comprises the classification probability and/or the classification score of each sample in the initial sample set; constructing an initial incidence relation network containing a plurality of nodes based on the attribute incidence relation among the samples in the initial sample set; each node in the initial incidence relation network corresponds to each sample in the initial sample set one by one; representing the incidence relation among the samples in the initial sample set by using connecting lines among the nodes in the initial incidence relation network, wherein each connecting line is provided with a weight for representing the incidence size among the samples; setting an initial category label for a sample corresponding to each node in the initial incidence relation network based on the first prediction result to obtain a first incidence relation network; based on the first prediction result and the weight of each connecting line in the first incidence relation network, carrying out community division on the first incidence relation network to obtain a first community set; wherein each community in the first community set comprises at least two nodes; and for each community in the first community set, updating the initial class labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes contained in the community to obtain a second prediction result.
As a possible implementation, the step of setting an initial category label for a sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network includes: determining a classification score for each sample in the initial set of samples based on the first prediction result; determining a seed node in the initial incidence relation network based on the classification score of each sample in the initial sample set; the classification score of the sample corresponding to the seed node is smaller than a first score threshold value; setting the initial class label of the sample corresponding to the seed node as a first class label representing a positive sample, and setting the initial class labels of the samples corresponding to other nodes except the seed node in the initial incidence relation network as second class labels representing a negative sample to obtain the first incidence relation network.
As a possible implementation, the step of determining a classification score for each sample in the initial sample set based on the first prediction result includes: directly obtaining the classification probability of each sample in the initial sample set from the first prediction result; determining a classification score for each sample in the initial sample set based on the classification probability for each sample in the initial sample set; or, directly obtaining the classification score of each sample in the initial sample set from the first prediction result.
As a possible implementation, the step of determining a classification score of each sample in the initial sample set based on the classification probability of each sample in the initial sample set includes: calculating a classification score for each sample in the initial sample set based on the classification probability for each sample in the initial sample set in the first prediction result according to the following formula:
Figure BDA0003757417480000031
wherein basepoint is the benchmark score, Pdo is the step length, and prob is the classification probability of the sample.
As a possible implementation, the step of constructing an initial association relationship network including a plurality of nodes based on the attribute association relationship among the samples in the initial sample set includes: if the attribute incidence relation exists between the two samples in the initial sample set, establishing a connection line between nodes corresponding to the two samples; and setting corresponding weight for each connecting line according to the association times and/or the relationship compactness among the samples.
As a possible implementation, the step of performing community division on the first association network based on the first prediction result and the weight of each connection line in the first association network to obtain a first community set includes: based on the classification score of each sample in the initial sample set, deleting the first incidence relation network to obtain a second incidence relation network; the classification score of the sample corresponding to each node in the second incidence relation network is smaller than or equal to a second score threshold value; the second score threshold is greater than the first score threshold; according to the weight of each connecting line in the second incidence relation network, carrying out community division on the second incidence relation network by adopting a community discovery algorithm to obtain an initial community set; and eliminating communities which only contain one node in the initial community set to obtain the first community set.
As a possible implementation, for each community in the first community set, the step of updating the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes included in the community to obtain the second prediction result includes: and updating the initial class labels of the samples corresponding to the nodes in the community according to the number of the nodes corresponding to the samples with the first class labels in the community and the classification scores of the samples corresponding to all the nodes contained in the community, so as to obtain the second prediction result.
As a possible implementation, for each community in the first community set, the step of updating the initial category label of the sample corresponding to the node in the community according to the number of nodes corresponding to the sample with the first category label in the community and the classification scores of the samples corresponding to all the nodes included in the community to obtain the second prediction result includes: for each community in the first community set, calculating the percentage of the number of nodes corresponding to the samples with the first type labels in the community according to the number of the nodes corresponding to the samples with the first type labels in the community; for each community in the first community set, calculating the average classification score of the samples corresponding to the nodes in the community according to the classification scores of the samples corresponding to all the nodes contained in the community; for each community in the first community set, if the number percentage of nodes corresponding to the samples with the first class labels in the community is greater than or equal to a preset percentage threshold value, and the average classification score of the samples corresponding to the nodes in the community is smaller than a third percentage threshold value, replacing the initial class labels of the samples corresponding to all the nodes in the community with the first class labels to obtain final class labels of the samples corresponding to all the nodes in the community; wherein the third score threshold is greater than or equal to the first score threshold; for each community in the first community set, if the number percentage of the nodes corresponding to the samples with the first class labels in the community is smaller than a preset percentage threshold, or the average classification score of the samples corresponding to the nodes in the community is larger than or equal to a third percentage threshold, maintaining the initial class labels of the samples corresponding to all the nodes in the community unchanged, and obtaining the final class labels of the samples corresponding to all the nodes in the community; and taking each sample with the final class label as the second prediction result.
As a possible implementation, the target scenario is a credit evaluation scenario or an insurance agent reputation evaluation scenario.
In a second aspect, an embodiment of the present invention further provides a classification prediction apparatus, where the apparatus includes: the prediction module is used for predicting the initial sample set of the target scene by using the two classification models corresponding to the target scene to obtain a first prediction result; the target scene is a predetermined scene to be classified and predicted, and the first prediction result comprises the classification probability and/or the classification score of each sample in the initial sample set; the incidence relation network construction module is used for constructing an initial incidence relation network containing a plurality of nodes based on the attribute incidence relation among the samples in the initial sample set; each node in the initial incidence relation network corresponds to each sample in the initial sample set one by one; representing the incidence relation among the samples in the initial sample set by using connecting lines among the nodes in the initial incidence relation network, wherein each connecting line is provided with a weight for representing the incidence size among the samples; a label setting module, configured to set an initial category label for a sample corresponding to each node in the initial association relationship network based on the first prediction result, so as to obtain a first association relationship network; the community division module is used for carrying out community division on the first incidence relation network based on the first prediction result and the weight of each connecting line in the first incidence relation network to obtain a first community set; wherein each community in the first community set comprises at least two nodes; and the label updating module is used for updating the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes contained in the community for each community in the first community set to obtain a second prediction result.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores computer-executable instructions that can be executed by the processor, and the processor executes the computer-executable instructions to implement the classification prediction method.
In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above classification prediction method.
According to the classification prediction method, the classification prediction device and the electronic equipment, after a first prediction result of a classification model is obtained, an initial incidence relation network comprising a plurality of nodes can be constructed based on attribute incidence relations among samples in an initial sample set; setting an initial class label for a sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network, and performing community division on the first association relationship network; and updating the initial class labels of the samples corresponding to the nodes in the same community by taking the community as a unit according to the community division result to obtain a second prediction result. By adopting the technology, the class labels of the samples are updated by utilizing the attribute incidence relation among the samples, so that the reliability of the prediction result of the two-class model can be improved, and the risk of loss of related business departments is reduced.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart illustrating a classification prediction method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an association network according to an embodiment of the present invention;
FIG. 3 is a diagram of another associative relationship network in an embodiment of the present invention;
FIG. 4 is a diagram of another associative relationship network in an embodiment of the present invention;
FIG. 5 is a flowchart illustrating another classification prediction method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a calculation process of an associated network tag correction algorithm according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a classification prediction apparatus according to an embodiment of the present invention
Fig. 8 is a schematic structural diagram of an electronic device in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, the correlation between samples cannot be represented by features used by a traditional binary classification model, so that the model cannot utilize additional information implied by the correlation between samples during prediction, so that the reliability of a prediction result is low, a conclusion obtained after a relevant business department analyzes the prediction result is likely to be wrong, and irreparable loss may be caused to the relevant business department in the future.
Based on this, the classification prediction method, the classification prediction device and the electronic equipment provided by the embodiment of the invention can improve the reliability of the prediction result of the two classification models, thereby reducing the risk of loss of related business departments.
To facilitate understanding of the present embodiment, first, a classification prediction method disclosed in the present embodiment is described in detail, referring to a flowchart of the classification prediction method shown in fig. 1, where the method may include the following steps:
step S102, predicting an initial sample set of a target scene by using a binary model corresponding to the target scene to obtain a first prediction result; the target scene is a predetermined scene to be classified and predicted, and the first prediction result comprises the classification probability and/or the classification score of each sample in the initial sample set.
The target scene may be a credit evaluation scene, an insurance agent reputation evaluation scene, or other scenes that need fraud prevention, which is not limited. The two classification models may be decision tree models, support vector machine models, logistic regression models, and the like, which are not limited. The input features of the two-classification model and the predicted object class should be consistent with the samples currently used. The classification scores are usually obtained by classification probability conversion, and the specific conversion mode can be determined according to actual needs, for example, a formula is directly adopted for calculation, or the classification probabilities of different samples belonging to a certain class are firstly sequenced, and then the classification probabilities are mapped to corresponding score intervals according to the sequencing result, so that the classification scores are not limited.
Taking a credit evaluation scene as an example, a sample in the initial sample set may be a loan application proposed by different loan applicants for the same loan product, the loan application may be represented by an ID number of the loan applicant, each loan application is correspondingly bound with relevant information (such as name, gender, address, name of a work unit, a credit card repayment record, a loan repayment record, and the like) of the loan applicant, the ID numbers of all the loan applicants in the initial sample set may be input into a pre-trained two-class model for prediction, and a probability (i.e., a classification probability) that each ID number belongs to a positive sample is obtained through output of the two-class model.
Taking the reputation evaluation scenario of insurance agents as an example, the samples in the initial sample set may be job numbers of insurance agents, the job number of each insurance agent is correspondingly bound with relevant information (such as name, gender, address, work unit name, policy transaction record, policy service processing record and the like) of the insurance agent, the job numbers of all the insurance agents in the initial sample set can be input into a pre-trained two-classification model for prediction, the probability (namely, classification probability) that each job number belongs to a positive sample is calculated through the two-classification model, and the classification probability is converted into a classification score for output.
Step S104, constructing an initial incidence relation network containing a plurality of nodes based on the attribute incidence relation among the samples in the initial sample set; each node in the initial incidence relation network corresponds to each sample in the initial sample set one by one; in the initial incidence relation network, the incidence relation among the samples in the initial sample set is represented by connecting lines among the nodes, and each connecting line is provided with a weight for representing the relevance among the samples.
Specifically, each sample in the initial sample set is defined to have one or more attributes, which may be gender, age, address, work unit, graduation institution, native place, etc., and may be customized according to actual needs, which is not limited herein. Correspondingly, the attribute association relationship may be a relationship of same sex, a relationship of same age, a relationship of same community, a relationship of school friends, a relationship of country, a relationship of relatives, a relationship of friends, etc., and may be specifically defined according to actual needs without limitation.
The initial association relationship network may be homogeneous, heterogeneous, directional, or undirected, and may be customized according to actual situations, without limitation. The operation mode for constructing the initial association relationship network may be as follows: each node represents one sample in the initial sample set, and different nodes represent different samples in the initial sample set; if an attribute association relationship exists between two samples in the initial sample set, establishing a connection (which may also be referred to as "edge") between nodes corresponding to the two samples; and setting corresponding weight for each connecting line according to the association times and/or the relationship compactness among the samples.
As an example, if there is one or more incidence relations between two different samples, an edge is established between corresponding nodes, and the edge is given a weight; the weights are used to represent the magnitude of the correlation between samples, with greater weights providing greater correlation and vice versa. The setting of the weight can be adjusted according to the service, for example, the weight is directly set as the association times between samples; for example, in a heterogeneous association relationship network, the degree of closeness of the relationship between samples corresponding to different nodes is different, for example, in three relationships of a parent-child relationship, a table-brother relationship and a friendship, the parent-child relationship is the closest, so the weight of the edge corresponding to the parent-child relationship can be set to be the largest, and the weights of the edges corresponding to the table-brother relationship and the friendship relationship can be set to be smaller; for example, in a directed association relationship network, the relationship between samples corresponding to different nodes is directed, and when the weight of each edge is set, the average value of the association times in two directions can be taken as the weight of the edge.
Referring to the schematic diagram of the association relationship network shown in fig. 2, in the initial sample set, there are nodes (represented by circular icons with different characters in fig. 2) corresponding to the samples a, B, C, and D in the association relationship network, if there are 10 associations between the samples a and B, 5 associations between the samples a and C, 15 associations between the samples B and C, 1 association between the samples C and D, and the weight of the edge is set as the association times between the samples.
By adopting the method for constructing the incidence relation network, the attribute relation among the samples can be visually represented.
And S106, setting an initial class label for the sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network.
As an example, the first prediction result includes a classification probability of each sample in the initial sample set, and an initial class label may be set for a sample corresponding to each node in the initial association network based on the classification probability of each sample in the initial sample set. For example, in fig. 2, the probability that sample a belongs to a positive sample is 0.8, the probability that sample B belongs to a positive sample is 0.7, the probability that sample C belongs to a positive sample is 0.5, and the probability that sample D belongs to a positive sample is 0.3; the initial class labels for sample a and sample B may each be set to 1 and the initial class labels for sample C and sample D may each be set to 0.
As another example, the first prediction result includes a classification score of each sample in the initial sample set, and an initial category label may be set for a sample corresponding to each node in the initial association network based on the classification score of each sample in the initial sample set. For example, in fig. 2, the score for sample a belonging to the positive sample is 30, the score for sample B belonging to the positive sample is 44, the score for sample C belonging to the positive sample is 50, and the score for sample D belonging to the positive sample is 56; since the higher the probability that a sample belongs to a positive sample, the lower the classification score for that sample, the initial class label for sample a may be set to 1, and the initial class labels for sample B, sample C, and sample D may all be set to 0.
Step S108, carrying out community division on the first incidence relation network based on the first prediction result and the weight of each connecting line in the first incidence relation network to obtain a first community set; each community in the first community set comprises at least two nodes.
The communities are of a local structure, the internal connection of the same community is tight, the connection of different communities is sparse, namely the weight of the edge inside the same community is greater than that of the edge between different communities. The community division may be implemented by a community discovery algorithm, which may be, but is not limited to, a Louvain algorithm, a tag propagation algorithm, a G-N algorithm, and the like.
Step S210, for each community in the first community set, updating the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes contained in the community, and obtaining a second prediction result.
As an example, the first prediction result includes a classification probability of each sample in the initial sample set, and for each community, the initial category label of the sample corresponding to the node in the community may be updated according to the classification probability of the sample corresponding to all nodes included in the community. Continuing the previous example, for example, in fig. 2, the probabilities that the sample a, the sample B, the sample C, and the sample D belong to the positive sample are 0.8, 0.7, 0.5, and 0.3, respectively, the class labels of the sample a, the sample B, the sample C, and the sample D are 1, 0, and 0, respectively, and the nodes corresponding to the sample a, the sample B, and the sample C are divided into the same community; the class labels of the sample a, the sample B and the sample C can be all updated to 1, the class label of the sample D is kept to 0, and after the updating, the class labels of the sample a, the sample B, the sample C and the sample D are 1, 1 and 0 respectively.
As another example, the first prediction result includes a classification score of each sample in the initial sample set, and an initial category label may be set for a sample corresponding to each node in the initial association network based on the classification score of each sample in the initial sample set. Continuing the previous example, for example, in fig. 2, the scores of the sample a, the sample B, the sample C, and the sample D belonging to the positive sample are respectively 30, 44, 50, and 56, the initial category labels of the sample a, the sample B, the sample C, and the sample D are respectively 1, 0, and the nodes corresponding to the sample a and the sample B are divided into the same community, and the nodes corresponding to the sample C and the sample D are divided into another community; the class labels of the sample a and the sample B may be updated to 1, and the class labels of the sample C and the sample D may be kept as 0, after the updating, the class labels of the sample a, the sample B, the sample C, and the sample D are 1, 0, and 0, respectively.
According to the classification prediction method provided by the embodiment of the invention, after a first prediction result of a two-classification model is obtained, an initial incidence relation network comprising a plurality of nodes can be constructed based on the attribute incidence relation among samples in an initial sample set; setting an initial class label for a sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network, and performing community division on the first association relationship network; and updating the initial class labels of the samples corresponding to the nodes in the same community by taking the community as a unit according to the community division result to obtain a second prediction result. By adopting the technology, the class labels of the samples are updated by utilizing the attribute incidence relation among the samples, so that the reliability of the prediction result of the two-class model can be improved, and the risk of loss of related business departments is reduced.
As a possible implementation manner, the step S106 (that is, setting an initial category label for the sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain the first association relationship network) may include the following operation manners:
(11) based on the first prediction result, a classification score for each sample in the initial sample set is determined.
This step (11) may be performed in either of the following modes 1 or 2:
operation mode 1: if the first prediction result comprises the classification score of each sample in the initial sample set, the classification score of each sample in the initial sample set is directly obtained from the first prediction result.
Operation mode 2: if the first prediction result comprises the classification probability of each sample in the initial sample set but does not comprise the classification score of each sample in the initial sample set, directly obtaining the classification probability of each sample in the initial sample set from the first prediction result; a classification score for each sample in the initial sample set is then determined based on the classification probability for each sample in the initial sample set.
In order to facilitate the related service personnel to better understand the probability result that the sample belongs to the same category, the difference between different probabilities can be more intuitively represented through discretization, and the classification probability obtained in the step S102 is converted into the classification score by referring to a scoring card conversion probability mode commonly used in the credit industry. For example, in operation mode 2, the classification score of each sample in the initial sample set can be calculated according to the following formula:
Figure BDA0003757417480000121
wherein basepoint is the benchmark score, Pdo is the step length, and prob is the classification probability of the sample.
The benchmark basepoint and the step length Pdo can be adjusted according to the service requirement and the sample data condition, for example, in order to obtain a more simplified prediction result, the classification probability is dispersed to a classification score of 0-10, the benchmark basepoint can be set to 5 points, and the step length Pdo is set to 1; without loss of generality, the classification probability can be dispersed to a classification score of 0-100, the benchmark score basepoint is set to be 50, and the step length Pdo is set to be 10. The lower the classification score of a sample, the greater the probability that the sample is a positive sample.
Continuing with the previous example, for example, in FIG. 2, if the base score is set to 50 and the step length Pdo is set to 10, the classification score of the sample A is determined
Figure BDA0003757417480000131
Similarly, the classification Score of the sample B can be obtained B 44, classification Score of sample C C Class Score of sample D50 D =56。
In addition, in the operation mode 2, other formulas than the above formula may also be used to convert the classification probability into a classification score, which may be specifically adjusted according to the actual service, and is not limited thereto.
(12) Determining seed nodes in the initial incidence relation network based on the classification score of each sample in the initial sample set; and the classification score of the sample corresponding to the seed node is smaller than a first score threshold value.
As one example, a node in the initial incidence relation network corresponding to a sample with a classification score smaller than a first score threshold may be determined as a seed node. Referring to the association relationship network shown in fig. 2, the scores of the samples a, B, C and D belonging to the positive samples are 30, 44, 50 and 56, respectively, a first score threshold of 40 may be defined, and the node corresponding to the sample with the classification score smaller than 40 (i.e., the sample a) is determined as the seed node.
(13) Setting the initial class labels of the samples corresponding to the seed nodes as first class labels for representing the positive samples, and setting the initial class labels of the samples corresponding to other nodes except the seed nodes in the initial incidence relation network as second class labels for representing the negative samples to obtain the first incidence relation network.
Continuing with the previous example, referring to the association relationship network shown in fig. 2, it may be defined that the first class label characterizing the positive sample is 1, the second class label characterizing the negative sample is 0, after the node corresponding to the sample a is determined as the seed node, the initial class label of the sample a may be set to 1, and the initial class labels of the sample B, the sample C, and the sample D are all set to 0.
As a possible implementation manner, the step S108 (i.e. performing community division on the first association network based on the first prediction result and the weight of each connection line in the first association network to obtain the first community set) may include the following operation manners:
(21) based on the classification score of each sample in the initial sample set, deleting the first incidence relation network to obtain a second incidence relation network; the classification score of the sample corresponding to each node in the second incidence relation network is smaller than or equal to a second score threshold value; the second score threshold is greater than the first score threshold.
As an example, the nodes and corresponding edges with the classification scores less than or equal to the second score threshold in the first association relationship network may be retained, and the nodes and corresponding edges with the classification scores greater than or equal to the second score threshold in the first association relationship network may be removed to obtain the second association relationship network.
(22) And according to the weight of each connecting line in the second incidence relation network, carrying out community division on the second incidence relation network by adopting a community discovery algorithm to obtain an initial community set.
(23) And eliminating communities only containing one node in the initial community set to obtain a first community set.
Continuing the previous example, for example, fig. 2 and fig. 3, the scores of the samples a, B, C, and D belonging to the positive samples are respectively 30, 44, 50, and 56, the association relationship network shown in fig. 2 may be defined as a first association relationship network, the association relationship network shown in fig. 3 may be defined as a second association relationship network, a second score threshold value is defined as 45, a node corresponding to a sample (i.e., sample a and sample B) whose classification score is less than or equal to 45 in the association relationship network shown in fig. 2 (i.e., the first association relationship network) is selected as a seed node, the association relationship network shown in fig. 3 (i.e., the second association relationship network) is constructed, the second association relationship network is subjected to community division by using a Louvain algorithm, a community containing only one node is eliminated, and a community 1 is obtained, the community 1 contains nodes corresponding to each of the samples a and the samples B.
Continuing with the previous example, for example, fig. 2 and 4, the scores of the samples a, B, C and D belonging to the positive samples are respectively 30, 44, 50 and 56, the association relationship network shown in fig. 2 may be defined as a first association relationship network, the association relationship network shown in fig. 4 may be defined as a second association relationship network, a second score threshold value is defined as 50, a node corresponding to a sample (i.e., sample a, sample B and sample C) whose classification score is less than or equal to 50 in the association relationship network shown in fig. 2 (i.e., the first association relationship network) is selected as a seed node, the association relationship network shown in fig. 4 (i.e., the second association relationship network) is constructed, the second association relationship network is subjected to community division by using a Louvain algorithm, a community containing only one node is eliminated, and a community 2, which community 2 contains nodes corresponding to each of the sample a, sample B and sample C, is obtained.
As a possible implementation manner, the step S210 (that is, for each community in the first community set, updating the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes included in the community to obtain the second prediction result) may include the following operation manners: and for each community in the first community set, updating the initial class labels of the samples corresponding to the nodes in the community according to the number of the nodes corresponding to the samples with the first class labels in the community and the classification scores of the samples corresponding to all the nodes contained in the community, and obtaining a second prediction result.
As an example, the above-mentioned operation mode may be performed according to the following steps:
(31) for each community in the first community set, the percentage of the number of the nodes corresponding to the samples with the first type labels in the community can be calculated according to the number of the nodes corresponding to the samples with the first type labels in the community.
(32) For each community in the first community set, the average classification score of the samples corresponding to the nodes in the community can be calculated according to the classification scores of the samples corresponding to all the nodes contained in the community.
(33) For each community in the first community set, if the number percentage of nodes corresponding to the samples with the first class labels in the community is greater than or equal to a preset percentage threshold value, and the average classification score of the samples corresponding to the nodes in the community is smaller than a third percentage threshold value, replacing the initial class labels of the samples corresponding to all the nodes in the community with the first class labels to obtain final class labels of the samples corresponding to all the nodes in the community; wherein the third score threshold is greater than or equal to the first score threshold;
(34) for each community in the first community set, if the number percentage of the nodes corresponding to the samples with the first class labels in the community is smaller than a preset percentage threshold, or the average classification score of the samples corresponding to the nodes in the community is larger than or equal to a third percentage threshold, maintaining the initial class labels of the samples corresponding to all the nodes in the community unchanged, and obtaining the final class labels of the samples corresponding to all the nodes in the community.
(35) And taking each sample with the final class label as a second prediction result.
Continuing with the previous example, for example, in fig. 2 and fig. 3, the scores of the samples a, B, C, and D belonging to the positive samples are 30, 44, 50, and 56, the initial category labels of the samples a, B, C, and D are 1, 0, and 0, respectively, and the first community set only includes community 1, and there are nodes corresponding to each of the samples a and B in community 1; the preset percentage threshold value can be defined as 50%, the third score threshold value is defined as 40, the number percentage of the nodes corresponding to the samples with the labels 1 in the community 1 is calculated as 1/2 ≧ 50%, and the average classification score of the samples corresponding to the nodes in the community 1 is calculated as (30+44)/2 ═ 37<40, then the initial class labels of the samples corresponding to all the nodes in the community 1 are all replaced by the labels 1, and after the replacement is completed, the final class labels of the samples a, the samples B, the samples C and the samples D are respectively 1, 0 and 0; the sample a with the label 1, the sample B with the label 1, the sample C with the label 0, and the sample D with the label 0 are taken as the final prediction result (i.e., the second prediction result).
Continuing with the previous example, for example, in fig. 3 and 4, the scores of the samples a, B, C, and D belonging to the positive samples are 30, 44, 50, and 56, respectively, the initial category labels of the samples a, B, C, and D are 1, 0, and 0, respectively, and the first community set only includes community 2, and there are nodes corresponding to each of the samples a, B, and C in community 2; a preset percentage threshold of 50% may be defined, a third score threshold of 40 is defined, the percentage of the number of nodes corresponding to the sample with the label 1 in the community 2 is calculated to be 2/3% ≧ 50%, and the average classification score of the samples corresponding to the nodes in the community 2 is calculated to be (30+44+50)/3 ═ 41.33>40, then the initial class labels of the samples corresponding to all the nodes in the community 2 are maintained unchanged, and the final class labels of the sample a, the sample B, the sample C and the sample D are 1, 0 and 0, respectively; the sample a with the label 1, the sample B with the label 1, the sample C with the label 0, and the sample D with the label 0 are taken as the final prediction result (i.e., the second prediction result).
Based on the above classification prediction method, another classification prediction method is further provided in the embodiments of the present invention, as shown in fig. 5, the method includes the following steps:
step S501, obtaining the trained logistic regression model.
Step S502, predicting a current sample set by using a logistic regression model to obtain a current prediction result; wherein the current prediction result comprises a classification probability or a classification score of each sample in the current sample set.
If the current prediction result includes the classification probability of each sample in the current sample set, step S503 is executed; if the current prediction result includes the classification score of each sample in the current sample set, step S503 is skipped.
Step S503, converting the classification probability of each sample in the current sample set into a corresponding classification score.
The classification score for each sample in the current sample set may be calculated according to the following formula:
Figure BDA0003757417480000171
wherein basepoint is the benchmark score, Pdo is the step length, and prob is the classification probability of the sample.
Step S504, based on the attribute incidence relation among the samples in the current sample set, an incidence network comprising a plurality of nodes is constructed; each node in the associated network corresponds to each sample in the current sample set one by one; in the correlation network, the correlation relation among the samples in the current sample set is represented by the connecting lines among the nodes, and each connecting line is provided with a weight for representing the correlation among the samples.
For the sake of simplicity, the "association network" is hereinafter referred to as the "association network", and the "sub-network" is hereinafter referred to as the "association network".
The association network may also be referred to as an association network, and the association network is similar to the association network in related content, which is not described in detail herein.
Step S505, selecting a seed node.
After the association network is constructed, the node corresponding to the sample with the classification score smaller than N obtained in step S503 may be selected as a seed node, so that the selected seed node may be subsequently used as a start node of the association network correction algorithm, and the current prediction result of the logistic regression model is optimized through the association network correction algorithm. And N needs to be adjusted according to the accuracy of the logistic regression model, so that the prediction accuracy of the logistic regression model on the samples with the classification scores smaller than N is high enough.
Step S506, setting a class label of a sample corresponding to a node in the association network.
Specifically, the class label of the sample corresponding to the seed node in the association network is set to 1 (positive sample), and the class labels of the samples corresponding to the remaining nodes other than the seed node in the association network are set to 0 (negative sample).
And step S507, optimizing the current prediction result of the logistic regression model by using the associated network label correction algorithm.
Referring to fig. 6, the calculation process of the associated network tag correction algorithm is as follows:
step 0: a start score S is set.
Typically, the initial value of S is greater than N. Setting a loop termination threshold M, M > S. The setting of M needs to be adjusted according to the performance of the logistic regression model, and the uncertainty of the prediction result of the sample with the classification score below M and above N is high.
Step 1: judging whether S is larger than M; if yes, ending the flow; if not, go to Step 2.
For the sake of brevity, the "classification score" is hereinafter referred to as "score".
Step 2: and selecting nodes corresponding to the samples with the scores smaller than S to construct a sub-network of the associated network, namely, reserving the nodes and corresponding edges of the samples with the scores below the S score in the associated network to form the sub-network, and removing other nodes and corresponding edges except the nodes in the associated network.
Step 3: and carrying out community division on the sub-network, and eliminating communities with the community size smaller than 2.
The community size specifically refers to the number of nodes included in each community.
Step 4: and for each divided community, updating the class label of the sample corresponding to the node in the community according to a preset rule.
The preset rule is as follows: if the proportion of the nodes corresponding to the samples with the labels of 1 in the community is greater than or equal to 50% and the average score of the nodes in the community is less than T, setting the class labels of the samples corresponding to all the nodes in the community to be 1; otherwise, keeping the class labels of the samples corresponding to all the nodes in the community unchanged. The setting of T needs to be adjusted according to the performance of the logistic regression model so as to ensure that the whole community is in a lower fractional segment. Generally, T is equal to or slightly greater than N.
Step 5: updating S, S is S + K; and then go to Step 1. The step length K can be adjusted as required.
And step S508, outputting the final prediction result.
After step S507 is completed, the updated class label of the sample corresponding to each node is obtained, and each sample with the updated class label is output as the final prediction result.
For ease of understanding, the above steps S507 to S508 are described as follows by taking fig. 2 to fig. 4 as an example: as shown in fig. 2, the scores of the samples a, B, C and D belonging to the positive samples are 30, 44, 50 and 56, respectively, and if N is set to 40, the sample with the score smaller than N is only the sample a, and the seed node only includes the node corresponding to the sample a; the class label for sample a is set to 1 and the class labels for sample B, sample C, and sample D are all set to 0. Based on the above, the calculation process of the correlation network label correction algorithm is divided into the following three processes:
round 1:
step 10: the start score S is set to 45 and the loop termination threshold M is set to 55.
Step 11: and judging that S is 45<55, and then, turning to Step 12.
Step 12: the nodes corresponding to the samples with the scores less than 45 (i.e., sample a and sample B) are selected to construct the sub-network shown in fig. 3.
Step 13: the sub-networks shown in the figure 3 are subjected to community division by adopting a Louvain algorithm, communities with the community scale smaller than 2 are eliminated, and a community 1 is obtained, wherein the community 1 comprises nodes corresponding to the sample A and the sample B.
Step 14: and updating the class labels of the samples corresponding to the nodes in the community according to a preset rule for the divided community 1.
Setting T-N-40, the preset rule is: if the proportion of the nodes corresponding to the sample with the label of 1 in the community is more than or equal to 50% and the average score of the nodes in the community is less than 40, setting the class labels of the samples corresponding to all the nodes in the community as 1; otherwise, keeping the class labels of the samples corresponding to all the nodes in the community unchanged. Since the average score of the nodes in the community 1 is (30+44)/2 is 37<40, and the node proportion corresponding to the sample with the label 1 in the community 1 is 1/2 ≧ 50%, both of which satisfy the update condition, the class labels of the sample a and the sample B are both set to 1, at this time, the class label of the sample a is still 1, the class labels of the sample C and the sample D are still 0, but the class label of the sample B has been changed from the original 0 to 1.
Step 15: update S, set the Step K to 5, then S to 45+5 to 50, and go to Step21 of round 2.
And (3) round 2:
step 21: and judging that S is 50<55, and then, turning to Step 22.
Step 22: the nodes corresponding to the samples with the scores less than 50 (i.e., sample a, sample B, and sample C) are selected to construct the sub-network shown in fig. 4.
Step 23: the sub-networks shown in the figure 4 are subjected to community division by adopting a Louvain algorithm, communities with the community scale smaller than 2 are eliminated, and a community 2 is obtained, wherein the community 2 comprises nodes corresponding to the sample A, the sample B and the sample C.
Step 24: and updating the class labels of the samples corresponding to the nodes in the community according to a preset rule for the divided community 2.
Although the node proportion corresponding to the sample labeled 1 in the community 2 is 2/3 ≧ 66% ≧ 50%, the average of the nodes in the community 2 is (30+44+50)/3 ≧ 41.33 ≧ 40, and thus the update condition is not satisfied, and the category labels of the sample a, the sample B, and the sample C are not changed, and at this time, the category labels of the sample a, the sample B, the sample C, and the sample D are 1, 0, and 0, respectively.
Step 25: update S, set the Step K to 5, then S to 50+5 to 55, and jump to Step31 of round 3.
And (4) round 3:
step 31: and judging that S is more than or equal to 55, ending the circulation and finishing the associated network label correction algorithm.
The final prediction results are sample a with label 1, sample B with label 1, sample C with label 0, and sample D with label 0.
The classification prediction method is based on the established two-classification model, establishes an association network by using the association relation between samples, and then corrects the prediction result of the two-classification model by using the information of the association network. By adopting the method, the community is continuously expanded in an iterative manner from the positive sample with more accurate prediction of the binary model, and the nodes corresponding to the samples with higher possibility of being the positive samples are added into the community, so that the accuracy of the prediction result is prevented from being greatly reduced; in addition, the method fully utilizes the information of the incidence relation among the samples on the basis of not increasing the characteristic dimension, and because the information of the incidence relation among the samples is merged into the result of the two-classification model, the positive sample can be more accurately found from the samples with more uncertain two-classification results, so that the recall rate is improved on the basis of not reducing the accuracy rate.
For example, in some businesses where there are fraudulent parties, some individuals in the parties are not accurately identified from their original characteristics, but can be discovered through associations. However, the association is noisy, and it is not easy to determine all individuals in a divided community as fraudulent individuals. The classification prediction method can effectively combine a machine learning model and an association network technology, more comprehensively utilize information of an individual layer and a group layer contained in sample data, and correct the prediction result of the two classification models through the association relation between samples, thereby obtaining a better prediction result. Therefore, the classification prediction method has certain universality for scenes in which the clustering abnormal behaviors are easy to occur.
Based on the above classification prediction method, an embodiment of the present invention further provides a classification prediction apparatus, as shown in fig. 7, the apparatus includes the following modules:
a prediction module 702, configured to predict an initial sample set of a target scene by using a binary model corresponding to the target scene to obtain a first prediction result; the target scene is a predetermined scene to be classified and predicted, and the first prediction result comprises the classification probability and/or the classification score of each sample in the initial sample set.
An association relationship network building module 704, configured to build an initial association relationship network including a plurality of nodes based on the attribute association relationship among the samples in the initial sample set; each node in the initial incidence relation network corresponds to each sample in the initial sample set one by one; and in the initial incidence relation network, the incidence relation among the samples in the initial sample set is represented by connecting lines among the nodes, and each connecting line is provided with a weight for representing the relevance among the samples.
A label setting module 706, configured to set an initial category label for a sample corresponding to each node in the initial association network based on the first prediction result, so as to obtain a first association network.
A community division module 708, configured to perform community division on the first incidence relation network based on the first prediction result and the weight of each connection line in the first incidence relation network to obtain a first community set; wherein each community in the first set of communities comprises at least two nodes.
The label updating module 710 is configured to, for each community in the first community set, update the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes included in the community, so as to obtain a second prediction result.
According to the classification prediction device provided by the embodiment of the invention, after the first prediction result of the two classification models is obtained, an initial incidence relation network comprising a plurality of nodes can be constructed based on the attribute incidence relation among samples in the initial sample set; setting an initial class label for a sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network, and performing community division on the first association relationship network; and updating the initial class labels of the samples corresponding to the nodes in the same community by taking the community as a unit according to the community division result to obtain a second prediction result. By adopting the technology, the class labels of the samples are updated by utilizing the attribute incidence relation among the samples, so that the reliability of the prediction result of the two-class model can be improved, and the risk of loss of related business departments is reduced.
The tag setting module 706 may further be configured to: determining a classification score for each sample in the initial set of samples based on the first prediction result; determining a seed node in the initial incidence relation network based on the classification score of each sample in the initial sample set; the classification score of the sample corresponding to the seed node is smaller than a first score threshold value; setting the initial class label of the sample corresponding to the seed node as a first class label representing a positive sample, and setting the initial class labels of the samples corresponding to other nodes except the seed node in the initial incidence relation network as second class labels representing a negative sample to obtain the first incidence relation network.
The tag setting module 706 may further be configured to: directly obtaining the classification probability of each sample in the initial sample set from the first prediction result; determining a classification score for each sample in the initial set of samples based on the classification probability for each sample in the initial set of samples.
The tag setting module 706 may further be configured to: and directly obtaining the classification score of each sample in the initial sample set from the first prediction result.
The tag setting module 706 may further be configured to: calculating a classification score for each sample in the initial sample set based on the classification probability for each sample in the initial sample set in the first prediction result according to the following formula:
Figure BDA0003757417480000231
wherein basepoint is the benchmark score, Pdo is the step length, and prob is the classification probability of the sample.
The association relationship network building module 704 may further be configured to: if the attribute incidence relation exists between the two samples in the initial sample set, establishing a connection line between nodes corresponding to the two samples; and setting corresponding weight for each connecting line according to the association times and/or the relationship compactness among the samples.
The community partitioning module 708 may be further configured to: based on the classification score of each sample in the initial sample set, deleting the first incidence relation network to obtain a second incidence relation network; the classification score of the sample corresponding to each node in the second incidence relation network is smaller than or equal to a second score threshold value; the second score threshold is greater than the first score threshold; according to the weight of each connecting line in the second incidence relation network, carrying out community division on the second incidence relation network by adopting a community discovery algorithm to obtain an initial community set; and eliminating communities which only contain one node in the initial community set to obtain the first community set.
The tag update module 710 may further be configured to: and updating the initial class labels of the samples corresponding to the nodes in the community according to the number of the nodes corresponding to the samples with the first class labels in the community and the classification scores of the samples corresponding to all the nodes contained in the community, so as to obtain the second prediction result.
The tag update module 710 may further be configured to: for each community in the first community set, calculating the percentage of the number of the nodes corresponding to the samples with the first type labels in the community according to the number of the nodes corresponding to the samples with the first type labels in the community; for each community in the first community set, calculating an average classification score of samples corresponding to nodes in the community according to classification scores of samples corresponding to all the nodes contained in the community; for each community in the first community set, if the number percentage of the nodes corresponding to the samples with the first class labels in the community is greater than or equal to a preset percentage threshold value, and the average classification score of the samples corresponding to the nodes in the community is smaller than a third percentage threshold value, replacing the initial class labels of the samples corresponding to all the nodes in the community with the first class labels to obtain final class labels of the samples corresponding to all the nodes in the community; wherein the third score threshold is greater than or equal to the first score threshold; for each community in the first community set, if the number percentage of the nodes corresponding to the samples with the first class labels in the community is smaller than a preset percentage threshold, or the average classification score of the samples corresponding to the nodes in the community is larger than or equal to a third percentage threshold, maintaining the initial class labels of the samples corresponding to all the nodes in the community unchanged, and obtaining the final class labels of the samples corresponding to all the nodes in the community; and taking each sample with the final class label as the second prediction result.
The target scene can be a credit judgment scene or an insurance agent reputation judgment scene.
The classification prediction apparatus provided in the embodiment of the present invention has the same implementation principle and technical effect as the classification prediction method embodiment, and for brief description, reference may be made to the corresponding contents in the foregoing method embodiment for the part where the apparatus embodiment is not mentioned.
An embodiment of the present invention further provides an electronic device, as shown in fig. 8, which is a schematic structural diagram of the electronic device, where the electronic device 100 includes a processor 81 and a memory 80, the memory 80 stores computer-executable instructions capable of being executed by the processor 81, and the processor 81 executes the computer-executable instructions to implement the classification prediction method.
In the embodiment shown in fig. 8, the electronic device further comprises a bus 82 and a communication interface 83, wherein the processor 81, the communication interface 83 and the memory 80 are connected by the bus 82.
The Memory 80 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 83 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used. The bus 82 may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 82 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but this does not indicate only one bus or one type of bus.
The processor 81 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 81. The Processor 81 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in a memory, and the processor 81 reads information in the memory and completes the steps of the classification prediction method of the foregoing embodiment in combination with hardware thereof.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the classification prediction method, and specific implementation may refer to the foregoing method embodiment, which is not described herein again.
The classification prediction method, the classification prediction device, and the computer program product of the electronic device provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the classification prediction method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.
Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (12)

1. A method for class prediction, the method comprising:
predicting an initial sample set of a target scene by using a binary model corresponding to the target scene to obtain a first prediction result; the target scene is a predetermined scene to be classified and predicted, and the first prediction result comprises the classification probability and/or the classification score of each sample in the initial sample set;
constructing an initial incidence relation network containing a plurality of nodes based on the attribute incidence relation among the samples in the initial sample set; each node in the initial incidence relation network corresponds to each sample in the initial sample set in a one-to-one mode; representing the incidence relation among the samples in the initial sample set by using connecting lines among the nodes in the initial incidence relation network, wherein each connecting line is provided with a weight for representing the incidence size among the samples;
setting an initial category label for a sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network;
based on the first prediction result and the weight of each connecting line in the first incidence relation network, carrying out community division on the first incidence relation network to obtain a first community set; wherein each community in the first community set comprises at least two nodes;
and updating the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes contained in the community for each community in the first community set to obtain a second prediction result.
2. The method according to claim 1, wherein the step of setting an initial category label for the sample corresponding to each node in the initial association relationship network based on the first prediction result to obtain a first association relationship network comprises:
determining a classification score for each sample in the initial set of samples based on the first prediction result;
determining a seed node in the initial incidence relation network based on the classification score of each sample in the initial sample set; wherein the classification score of the sample corresponding to the seed node is smaller than a first score threshold value;
setting the initial class label of the sample corresponding to the seed node as a first class label representing a positive sample, and setting the initial class labels of the samples corresponding to other nodes except the seed node in the initial incidence relation network as second class labels representing a negative sample to obtain the first incidence relation network.
3. The method of claim 2, wherein the step of determining a classification score for each sample in the initial set of samples based on the first prediction result comprises:
directly obtaining the classification probability of each sample in the initial sample set from the first prediction result; determining a classification score for each sample in the initial sample set based on the classification probability for each sample in the initial sample set;
or, directly obtaining the classification score of each sample in the initial sample set from the first prediction result.
4. The method of claim 3, wherein the step of determining the classification score for each sample in the initial set of samples based on the classification probability for each sample in the initial set of samples comprises:
calculating a classification score for each sample in the initial sample set based on the classification probability for each sample in the initial sample set in the first prediction result according to the following formula:
Figure FDA0003757417470000021
wherein basepoint is the benchmark score, Pdo is the step length, and prob is the classification probability of the sample.
5. The method according to claim 1, wherein the step of constructing an initial incidence relation network including a plurality of nodes based on the attribute incidence relation among the samples in the initial sample set comprises:
if the attribute incidence relation exists between the two samples in the initial sample set, establishing a connection line between nodes corresponding to the two samples;
and setting corresponding weight for each connecting line according to the association times and/or the relationship compactness among the samples.
6. The method according to claim 2, wherein the step of performing community division on the first incidence relation network based on the first prediction result and the weight of each connection line in the first incidence relation network to obtain a first community set comprises:
based on the classification score of each sample in the initial sample set, deleting the first incidence relation network to obtain a second incidence relation network; the classification score of the sample corresponding to each node in the second incidence relation network is smaller than or equal to a second score threshold value; the second score threshold is greater than the first score threshold;
according to the weight of each connecting line in the second incidence relation network, carrying out community division on the second incidence relation network by adopting a community discovery algorithm to obtain an initial community set;
and eliminating communities which only contain one node in the initial community set to obtain the first community set.
7. The method according to claim 2, wherein for each community in the first community set, the step of updating the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes included in the community to obtain the second prediction result includes:
and for each community in the first community set, updating the initial category labels of the samples corresponding to the nodes in the community according to the number of the nodes corresponding to the samples with the first category labels in the community and the classification scores of the samples corresponding to all the nodes contained in the community, and obtaining a second prediction result.
8. The method according to claim 7, wherein for each community in the first community set, the step of updating the initial category labels of the samples corresponding to the nodes in the community according to the number of the nodes corresponding to the samples with the first category labels in the community and the classification scores of the samples corresponding to all the nodes included in the community to obtain the second prediction result includes:
for each community in the first community set, calculating the percentage of the number of nodes corresponding to the samples with the first type labels in the community according to the number of the nodes corresponding to the samples with the first type labels in the community;
for each community in the first community set, calculating the average classification score of the samples corresponding to the nodes in the community according to the classification scores of the samples corresponding to all the nodes contained in the community;
for each community in the first community set, if the number percentage of nodes corresponding to the samples with the first class labels in the community is greater than or equal to a preset percentage threshold value, and the average classification score of the samples corresponding to the nodes in the community is smaller than a third percentage threshold value, replacing the initial class labels of the samples corresponding to all the nodes in the community with the first class labels to obtain final class labels of the samples corresponding to all the nodes in the community; wherein the third score threshold is greater than or equal to the first score threshold;
for each community in the first community set, if the number percentage of the nodes corresponding to the samples with the first class labels in the community is smaller than a preset percentage threshold, or the average classification score of the samples corresponding to the nodes in the community is larger than or equal to a third percentage threshold, maintaining the initial class labels of the samples corresponding to all the nodes in the community unchanged, and obtaining the final class labels of the samples corresponding to all the nodes in the community;
and taking each sample with the final class label as the second prediction result.
9. The method of claim 1, wherein the target scenario is a credit evaluation scenario or an insurance agent reputation evaluation scenario.
10. A class prediction apparatus, the apparatus comprising:
the prediction module is used for predicting the initial sample set of the target scene by using the two classification models corresponding to the target scene to obtain a first prediction result; the target scene is a predetermined scene to be classified and predicted, and the first prediction result comprises the classification probability and/or the classification score of each sample in the initial sample set;
the incidence relation network building module is used for building an initial incidence relation network containing a plurality of nodes based on the attribute incidence relation among the samples in the initial sample set; each node in the initial incidence relation network corresponds to each sample in the initial sample set one by one; representing the incidence relation among the samples in the initial sample set by using connecting lines among the nodes in the initial incidence relation network, wherein each connecting line is provided with a weight for representing the incidence size among the samples;
a label setting module, configured to set an initial category label for a sample corresponding to each node in the initial association relationship network based on the first prediction result, so as to obtain a first association relationship network;
the community division module is used for carrying out community division on the first incidence relation network based on the first prediction result and the weight of each connecting line in the first incidence relation network to obtain a first community set; wherein each community in the first community set comprises at least two nodes;
and the label updating module is used for updating the initial category labels of the samples corresponding to the nodes in the community according to the first prediction results of all the nodes contained in the community for each community in the first community set to obtain a second prediction result.
11. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 9.
12. A computer-readable storage medium having computer-executable instructions stored thereon which, when invoked and executed by a processor, cause the processor to implement the method of any of claims 1 to 9.
CN202210873058.1A 2022-07-21 2022-07-21 Classification prediction method and device and electronic equipment Pending CN115099366A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210873058.1A CN115099366A (en) 2022-07-21 2022-07-21 Classification prediction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210873058.1A CN115099366A (en) 2022-07-21 2022-07-21 Classification prediction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115099366A true CN115099366A (en) 2022-09-23

Family

ID=83298821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210873058.1A Pending CN115099366A (en) 2022-07-21 2022-07-21 Classification prediction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115099366A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116781546A (en) * 2023-06-26 2023-09-19 中国信息通信研究院 Anomaly detection method and system based on depth synthesis data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116781546A (en) * 2023-06-26 2023-09-19 中国信息通信研究院 Anomaly detection method and system based on depth synthesis data
CN116781546B (en) * 2023-06-26 2024-02-13 中国信息通信研究院 Anomaly detection method and system based on depth synthesis data

Similar Documents

Publication Publication Date Title
WO2021012783A1 (en) Insurance policy underwriting model training method employing big data, and underwriting risk assessment method
CN107025596B (en) Risk assessment method and system
WO2020037942A1 (en) Risk prediction processing method and apparatus, computer device and medium
WO2020143233A1 (en) Method and device for building scorecard model, computer apparatus and storage medium
CN114930318B (en) Classifying data using aggregated information from multiple classification modules
WO2017215370A1 (en) Method and apparatus for constructing decision model, computer device and storage device
CN108520041B (en) Industry classification method and system of text, computer equipment and storage medium
WO2020224106A1 (en) Text classification method and system based on neural network, and computer device
CN108550065B (en) Comment data processing method, device and equipment
CN112328909B (en) Information recommendation method and device, computer equipment and medium
CN112231592A (en) Network community discovery method, device, equipment and storage medium based on graph
WO2019223104A1 (en) Method and apparatus for determining event influencing factors, terminal device, and readable storage medium
US20220414523A1 (en) Information Matching Using Automatically Generated Matching Algorithms
CN113807940A (en) Information processing and fraud identification method, device, equipment and storage medium
CN113689285A (en) Method, device, equipment and storage medium for detecting user characteristics
CN115099366A (en) Classification prediction method and device and electronic equipment
US20200175029A1 (en) Method and system for optimizing validations carried out for input data at a data warehouse
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN113836244B (en) Sample acquisition method, model training method, relation prediction method and device
CN115758271A (en) Data processing method, data processing device, computer equipment and storage medium
US20220277315A1 (en) Method, apparatus and computer program product for predictive graph-based network analysis
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
US20240184996A1 (en) Method and system for generating contextual explanation for model predictions
CN115660722B (en) Prediction method and device for silver life customer conversion and electronic equipment
US20240152696A1 (en) Building and using target-based sentiment models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination