CN110688536A - Label prediction method, device, equipment and storage medium - Google Patents

Label prediction method, device, equipment and storage medium Download PDF

Info

Publication number
CN110688536A
CN110688536A CN201910910439.0A CN201910910439A CN110688536A CN 110688536 A CN110688536 A CN 110688536A CN 201910910439 A CN201910910439 A CN 201910910439A CN 110688536 A CN110688536 A CN 110688536A
Authority
CN
China
Prior art keywords
label
sample
data
characteristic data
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910910439.0A
Other languages
Chinese (zh)
Inventor
陈桂花
袁进威
林乐凝
伏峰
陈东伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN201910910439.0A priority Critical patent/CN110688536A/en
Publication of CN110688536A publication Critical patent/CN110688536A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a label prediction method, a device, equipment and a storage medium, wherein the label prediction method comprises the following steps: determining basic characteristic data of an object to be predicted; predicting the credibility of the object to be predicted according to the basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data. The embodiment of the invention obtains a prediction model based on the relation atlas training among the objects in the sample data, and predicts the credibility of the object to be predicted by adopting the prediction model according to the basic characteristic data of the object to be predicted. The relation among the objects in the sample data is considered in the training stage, namely the model has the capability of comprehensively considering the influence of the relation among different objects on the credibility, and the accuracy of the credibility prediction of the object to be predicted is improved.

Description

Label prediction method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of information processing, in particular to a label prediction method, a label prediction device, label prediction equipment and a storage medium.
Background
With the enhancement of financial service work on enterprises in the society at present, the improvement of an anti-fraud system and the reduction of credit fraud risk are important work which needs to be completed urgently at the present stage.
Most of the existing fraud identification methods are based on traditional blacklists, expert rules or supervised machine learning algorithm models. The traditional blacklist is a fraud enterprise which passes manual review and authentication; the expert rules are that some rules are made by experts, and when the information of the enterprise violates the relevant rules, the enterprise is determined to be a fraud enterprise; the supervised machine learning algorithm model is a prediction model obtained by utilizing the data characteristics of enterprises for training.
However, the number of enterprises that can be determined through the conventional blacklist is very small, and the enterprises that are not manually checked cannot be judged; the limitation is too large for passing expert rules, and great human errors can be brought; the supervised machine learning algorithm model is a prediction model obtained by training the fraud tags of the existing enterprises, but because the labeling of the fraud tags is different from the labeling behaviors of other fields, the obtaining of the fraud tags needs to pay great cost and cost, so that the quantity of fraud sample data is small, and the prediction result is inaccurate due to the imbalance of positive and negative samples.
Disclosure of Invention
The embodiment of the invention provides a label prediction method, a label prediction device, label prediction equipment and a storage medium, which are used for improving the accuracy of fraud label prediction.
In a first aspect, an embodiment of the present invention provides a label prediction method, including:
determining basic characteristic data of an object to be predicted;
predicting the credibility of the object to be predicted according to the basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
In a second aspect, an embodiment of the present invention further provides a label prediction apparatus, including:
the basic characteristic data determining module is used for determining basic characteristic data of an object to be predicted;
the reliability prediction module is used for predicting the reliability of the object to be predicted according to the basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
In a third aspect, an embodiment of the present invention further provides a computer device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a tag prediction method as in any embodiment of the invention.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the label prediction method according to any embodiment of the present invention.
The embodiment of the invention obtains a prediction model based on the relation atlas training among the objects in the sample data, and predicts the credibility of the object to be predicted by adopting the prediction model according to the basic characteristic data of the object to be predicted. The neighbor characteristic data of the object in the sample data is considered in the training stage, namely the model has the capability of comprehensively considering the influence of the relation between different objects on the reliability, the reliability prediction accuracy of the object to be predicted is improved, and the situation that the positive and negative samples are unevenly distributed can be improved according to the prediction result of the model trained according to the neighbor characteristic data on the object to be predicted.
Drawings
FIG. 1 is a flow chart of a tag prediction method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of relationship map association relationship triplets in the present invention;
FIG. 3 is a flowchart of a tag prediction method according to a second embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a tag prediction apparatus according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device in the fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Fig. 1 is a flowchart of a tag prediction method in a first embodiment of the present invention, and this embodiment is applicable to a case of predicting the reliability of an object to be predicted in a reliability determination scenario. The method may be performed by a tag prediction apparatus, which may be implemented in software and/or hardware, and may be configured in a computer device, such as a background server or other device with communication and computing capabilities. As shown in fig. 1, the method specifically includes:
step 101, determining basic characteristic data of an object to be predicted.
The object to be predicted is an object which needs to be subjected to credibility prediction, and the credibility refers to the degree of confidence of the object to be predicted or the honesty of the object to be predicted and can be represented by a label. Optionally, the object to be predicted comprises a business or an individual, and the credibility comprises credibility judgment of the business or the individual, and can provide reference for bank transaction. The basic feature data refers to feature data which is extracted according to the information of the object and is related to the reliability, and can be used as a basis for judging the reliability of the object to be predicted. Optionally, the basic information of the enterprise or the individual is included, such as enterprise industry, enterprise registered funds, enterprise registered address and the like.
Illustratively, when the object to be predicted is a small micro enterprise, the feature extraction is performed by using basic information of the small micro enterprise, and the method comprises the following steps: the enterprise industry, enterprise registered funds, enterprise registered addresses and the like can be represented by preset characteristic values, such as 01 for the computer industry, 02 for the food industry and the like, so as to obtain basic characteristic data of small and micro enterprises.
Step 102, predicting the credibility of an object to be predicted according to basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
The preset prediction model is a model which is obtained by training through a supervised machine learning algorithm and is used for predicting the credibility of the object to be predicted. The samples comprise known label samples, unknown label samples and basic characteristic data of objects in the samples, the known label samples refer to the sample objects marked with the credibility labels in the sample objects, and the unknown label samples refer to the sample objects not marked with the credibility labels in the sample objects. The credibility label of the known label sample is obtained through official ways, such as banks, public security offices and the like, and the credibility label is real and reliable. The neighbor characteristic data is calculated according to known sample data of an object related to the object and is used for representing the influence of the surrounding known sample data on the object.
The relational graph is constructed according to the relation between objects in the sample data. Optionally, a relationship graph is constructed according to the association relationship between any two objects in the unknown tag sample and the known tag sample. Illustratively, when an object for constructing the relationship graph is an enterprise, first, data information of all enterprises in the graph is acquired, and an enterprise association relationship triple is constructed according to the data information, where the triple includes three parts (enterprise-enterprise relationship-enterprise), and the enterprise relationship may refer to a connection between two enterprises through a person, a phone number, a website, a registration address, an equipment name, an equipment ID, and the like.
For example, if a person is the real controller of enterprise a and the person is the legal person of enterprise B, the association relationship between enterprise a and enterprise B is established based on the person, and similarly, the two enterprises may form a connection relationship through the same telephone number. In addition, types are divided for all the association relations according to business requirements, and a weight value is defined for each type of association relation and is used for expressing the importance degree of the association relation in enterprise relation contact. A schematic diagram of associated relationship triplets in a relationship graph is shown in FIG. 2, where C1 and C2 represent two objects, such as Business A and Business B; p represents an element constituting a connection relationship between two objects, such as an individual; r1 and r2 represent specific relationships between connection relationship elements and objects, such as a person being the real owner of business A and at the same time being the legal of business B. For example, C1 represents enterprise a, C2 represents enterprise B, P represents zhang san, r1 represents stockholders, and r2 represents high-level management, so that the relationship graph can be understood that zhang san is the stockholder of enterprise a, and zhang san is also the high-level management of enterprise B, and enterprise a and enterprise B form a connection relationship by zhang san.
Illustratively, a model of enterprise label prediction is obtained by training according to basic characteristic data of an enterprise with known labels; enterprise labels of unknown labels are predicted through the model, a part of enterprises with accurate prediction are selected, neighbor characteristic data of the part of enterprises are obtained according to the enterprise relation map, and a new enterprise label prediction model is trained based on the neighbor characteristic data and the prediction labels of the part of enterprises; and selecting a part of enterprises with accurate prediction based on the prediction result of the new model on the enterprises with unknown labels, taking the part of enterprises and the enterprises with known labels as new known label samples, training the new known label samples based on the basic characteristic data to obtain a model for predicting the labels of the enterprises, and predicting the enterprises with unknown labels by adopting the prediction model according to the basic characteristic data of the enterprises to obtain the prediction labels of the enterprises.
The embodiment of the invention obtains a prediction model based on the relation atlas training among the objects in the sample data, and predicts the credibility of the object to be predicted by adopting the prediction model according to the basic characteristic data of the object to be predicted. The neighbor characteristic data of the object in the sample data is considered in the training stage, namely the model has the capability of comprehensively considering the influence of the relation between different objects on the reliability, the reliability prediction accuracy of the object to be predicted is improved, and the situation that the positive and negative samples are unevenly distributed can be improved according to the prediction result of the model trained according to the neighbor characteristic data on the object to be predicted.
Example two
Fig. 3 is a flowchart of a label prediction method in the second embodiment of the present invention, and the second embodiment of the present invention further optimizes based on the first embodiment of the present invention, and can train a model according to the neighbor features of sample data. As shown in fig. 3, the method includes:
step 301, training to obtain an initial model according to basic feature data and credibility labels of known label samples.
The known label sample refers to a sample with a known definite credibility label in the sample, the credibility label refers to an identifier which is used for dividing the sample with the definite credibility label and represents the credibility, and optionally, the credibility label can be a text label such as credibility or incredibility, or can be a numerical label such as 0 and 1.
Optionally, the training according to the basic feature data and the reliability label of the known label sample to obtain the initial model includes:
performing stability screening based on characteristic data extracted from known label samples;
performing characteristic derivation on the screened characteristic data to obtain basic characteristic data;
and carrying out supervised learning according to the basic characteristic data and the credibility label of the known label sample, and training to obtain an initial model.
The extracted feature data refers to feature extraction of all objects in the obtained known label sample to obtain original feature data related to the reliability. Such as business industry, business registered funds, business registered addresses, etc. The stability screening refers to screening extracted original characteristic data and removing characteristic data with low stability, wherein the low stability refers to unstable influence degree on reliability. For example, PSI (Population stability index) may be used to perform stability monitoring on the feature data, and feature data with PSI value greater than 0.25 is removed. The PSI calculation method comprises the following steps: and PSI (actual occupation ratio-expected occupation ratio) ln (actual occupation ratio/expected occupation ratio)), wherein the expected occupation ratio represents the proportion of each group in the test sample set result after the test sample set results are grouped, and the actual occupation ratio represents the proportion of each group in the new sample set result after the new sample set results are grouped according to the upper and lower boundary values of each group in the expected occupation ratio. PSI indicates whether the ratio distribution changes for different samples or samples at different times after grading according to fractions.
Illustratively, after a prediction model is obtained, all enterprise objects in a test sample set are predicted to obtain a prediction label value of each enterprise object, the prediction label values of all enterprise objects are sorted from small to large, and the enterprise objects are grouped according to a sorting result. Because the prediction result of the test sample set is relatively accurate, the ratio of each group after grouping can be used as the expected ratio. For example, all enterprises in the test sample set are divided into ten equal parts according to the prediction label value range, and the maximum and minimum prediction label values of each group after grouping are obtained. For example, the maximum and minimum label values of the first group after ten-fold division are 0.1 and 0, respectively. And then predicting the enterprise objects in the new sample set by using a prediction model to obtain a prediction label value of each enterprise object, grouping the enterprise objects of the new sample according to the obtained maximum and minimum predicted label values of each group, for example, the enterprise objects with the prediction label value between 0 and 0.1 form a first group of groups, and calculating the proportion of each group after grouping in the result of the new sample set, wherein the proportion is the actual proportion. If the model is stable, the predicted label values on the new sample set data should be consistent with the distribution in the test set, otherwise, the model is changed.
The feature derivation means that the screened feature data is used for feature learning to obtain new features. The characteristic derivation mode comprises the modes of combining different characteristics or calculating median, average and the like of characteristic data. For example, the feature data may be subjected to feature aggregation derivation by using a feature derivation tool, so as to obtain derived feature data.
Optionally, further feature screening is performed on the derived feature data to obtain final screened basic feature data. Further feature screening may include deleting features with data missing significantly, features with data values close to the same, linearly related features, and the like. Through further feature screening, the features with small influence on reliability prediction can be removed, repeated feature data can be removed, the number of the feature data is reduced, and the efficiency and accuracy of subsequent model training are improved.
Illustratively, after original feature data of enterprise samples with known labels are obtained, stability screening, feature derivation and feature screening are respectively performed on the feature data to obtain final basic feature data for training, and supervised learning is performed by combining credibility labels of the enterprise samples, for example, XGBOOST, logistic regression and lightGBM machine learning algorithms are adopted to train and obtain an enterprise label prediction initial model.
Step 302, training candidate labels predicted for unknown label samples based on the initial model, and basic feature data and neighbor feature data of the unknown label samples to obtain an intermediate model.
The candidate label is a credibility label predicted by an initial model on an unknown label sample. Optionally, the candidate tag may be a predicted tag value, and the value range is between 0 and 1, or the candidate tag may be two types of tags divided according to the predicted tag value, and the value is 0 or 1.
Optionally, training to obtain an intermediate model based on the candidate label predicted by the initial model on the unknown label sample, and the basic feature data and the neighbor feature data of the unknown label sample, includes:
performing label prediction on the unknown label sample by using the initial model to obtain a candidate label of the unknown label sample;
determining neighbor characteristic data of an unknown label sample according to a relation graph among objects in sample data;
and training to obtain an intermediate prediction model according to the basic characteristic data, the candidate label and the neighbor characteristic data of the unknown label sample.
The candidate label is a credibility prediction result obtained by predicting an unknown label sample by using a prediction model. Illustratively, performing label prediction on an enterprise sample with an unknown label by using an initial model to obtain a candidate label of the enterprise sample with the unknown label, and selecting part of enterprise samples according to a prediction result, for example, selecting the enterprise samples with prediction labels with values close to 0 or 1; calculating neighbor characteristic data of the selected part of enterprise samples according to the enterprise relation map, wherein the neighbor characteristic data can be calculated according to known tagged enterprises which are directly connected with the enterprises in the map; and adding the basic feature data, the candidate labels and the neighbor feature data of the selected part of enterprise samples as new features into the training features of the initial model for a new round of training to obtain an intermediate prediction model.
Optionally, determining neighbor feature data of the unknown label sample according to the relationship graph between the unknown objects in the unknown label sample, including:
determining the confidence degree of the candidate label according to the candidate label of the unknown label sample;
sequencing unknown label samples according to the confidence degrees of the candidate labels, and selecting a target object with a confidence degree value reaching a preset threshold value in a sequencing result;
determining candidate objects having direct association relation with the target object according to the relation map from the known label sample;
and determining neighbor characteristic data of the target object according to the relation weight between the target object and the candidate object in the relation map.
The confidence degree refers to the reliability degree of the predicted candidate label, and the candidate label of the unknown label sample can be calculated according to the model. Optionally, after predicting the candidate tag of the enterprise, the confidence degree is used to represent the trust degree of the prediction result, for example, when the candidate tag is the tag value predicted by the model, the value range is from 0 to 1, 0 represents that the enterprise confidence degree is low, 1 represents that the enterprise confidence degree is high, and the result of enterprise confidence degree prediction that the value is closer to the end points 0 and 1 is more reliable. Illustratively, the confidence level is calculated as follows:
Figure BDA0002214535260000101
wherein, yiAnd representing the prediction candidate label value of the ith enterprise.
And the sequencing refers to sequencing the unknown label samples according to the confidence degree value to obtain the sequencing sequence of the unknown label samples. The preset threshold is a numerical value which is preset according to the confidence ranking result and indicates that the confidence is high, and can be obtained according to an empirical value. The direct association relation means that the candidate object and the target object are connected through only one association relation, and the object directly connected with the target object can be directly acquired from the relation map and is the candidate object. The target object is an unknown label sample selected according to a preset threshold value; the candidate object refers to an object directly connected with the target object in the relationship graph, namely, the two objects are connected through only one association relationship.
Illustratively, enterprise samples with unknown labels are predicted to obtain candidate label values, corresponding confidence coefficient values are obtained through calculation, the enterprise samples with the unknown labels are ranked according to the obtained confidence coefficient values, the enterprise samples with the confidence coefficients larger than 0.7 are selected, the candidate label prediction results of the enterprise samples are determined as labels with accurate prediction results, namely the known labels, the labels are added into model prediction of the next step, the sample amount of model training is increased, and the accuracy of the model prediction is improved. The method comprises the steps of obtaining known label enterprise samples directly connected with enterprise samples in a relation graph after the enterprise samples are selected, calculating neighbor characteristic data of selected enterprise sample objects according to weight values of incidence relations between the enterprise samples, combining basic characteristic data of the newly added enterprise samples with known labels and candidate labels of the enterprise samples, and calculating the obtained neighbor characteristic data, and training by using a supervised machine learning algorithm to obtain an intermediate prediction model containing neighbor characteristics, wherein the machine learning algorithm used by the intermediate prediction model is consistent with the algorithm used by an initial model.
The neighbor feature data may represent the influence of the labels of the known label exemplars on the exemplars of the sought neighbor feature data. Illustratively, the neighbor characteristic data may be represented as Fi={f1i,f2iWhere f1iAnd representing the sum of the relationship weights which have direct association relationship with the sample object corresponding to the obtained neighbor feature data in the positive sample of the known label (namely the known label is the sample with high reliability), wherein the relationship weights which have direct association relationship with the sample object can be obtained through a relationship graph. f2iAnd the sum of the relationship weights which are expressed in the negative sample of the known label (namely the known label is the sample with low credibility) and have direct association relationship with the sample object corresponding to the sought neighbor feature data. For example, in the relationship graph, enterprise 1, enterprise 2, enterprise 3, enterprise 4 and enterprise 5 have direct association relationship with enterprise a, the relationship weights of enterprise a and enterprises 1-5 are w1, w2, w3, w4 and w5, respectively, wherein the labels of enterprise 1 and enterprise 2 are known positive examples, the labels of enterprise 3 and enterprise 4 are known negative examples, and the label of enterprise 5 is unknown, then f1i=w1+w2,f2i=w3+w4。
Optionally, training to obtain an intermediate model based on the candidate label predicted by the initial model on the unknown label sample, and the basic feature data and the neighbor feature data of the unknown label sample, includes:
predicting unknown label samples by adopting an intermediate model obtained in the previous training so as to update candidate labels;
updating neighbor characteristic data of an unknown label sample according to the updated candidate label;
and performing the training of the current round according to the basic characteristic data of the unknown label sample, the updated candidate label and the neighbor characteristic data to obtain an intermediate model of the iterative training of the current round.
Optionally, the intermediate model is obtained through n rounds of iteration, for the h-th round of iteration, a sample object with h/n × sample total number before confidence degree sequencing is selected, the candidate tag of the selected sample object is identified as a known tag, neighbor feature data of the sample object obtained through each round of selection is calculated, the basic feature data of the known tag sample obtained through each round, the corresponding candidate tag and the neighbor feature data obtained through calculation are used as input of the round of training, and the intermediate model of the round of iterative training is finally obtained. And obtaining a final intermediate model after the n-round iteration is finished. The number of iterations n may be preset according to the size of the sample size.
Step 303, training the sample label predicted for the unknown label sample based on the intermediate model and the basic feature data of the unknown label sample to obtain a final prediction model.
And the sample label is the result of predicting all unknown label samples by adopting a final intermediate model. And selecting a sample object of a part of the predicted labels as a newly added known label sample object. And combining the original known label sample as a new characteristic set, and obtaining a final prediction model through a machine learning algorithm. Optionally, the sample object of the selected partial prediction tag may be selected according to the distribution of positive and negative samples in the known sample, or selected according to the confidence degree sequence. For example, a large amount of evidence is required for an official to judge that the credibility of a certain enterprise is low, so that the number of negative samples of a known tag obtained from an official way is small, the distribution of the positive samples and the negative samples is not uniform, that is, the number of positive samples in the known sample tag is far greater than that of the negative samples, and therefore when a sample object with a known tag is newly added, a sample object with a low tag value in a prediction tag is selected as a newly added sample object with low known credibility in order to balance the number of the positive samples and the negative samples.
The unknown label sample is predicted through the intermediate model, the sample with high prediction accuracy is selected as the known label sample according to the requirement to be trained, the distribution of positive and negative samples in the known label sample is balanced, the accuracy of model training is improved, and the cost for obtaining the known label sample is reduced.
Optionally, the newly added known label sample object and the original known label sample object are combined into a final known label sample object, the final known label sample object is divided into a training set and a testing set according to a certain proportion, the training set is used for training to obtain a model, the model is used for testing the testing set, and parameters of the model are adjusted according to results to obtain a final prediction model. Illustratively, the training set and test set are divided in an 8:2 ratio.
And step 304, determining basic characteristic data of the object to be predicted.
And 305, predicting the credibility of the object to be predicted according to the basic characteristic data by adopting a preset prediction model.
According to the embodiment of the invention, the sample labels are automatically added through the intermediate model which is continuously iteratively trained, and the iterative training of the intermediate model is combined with the neighbor characteristic data, so that the accuracy of the added sample labels is improved, the accuracy of model prediction is improved by utilizing the added sample labels for training, and the problem of small sample label data quantity caused by high difficulty and high cost in obtaining the sample labels is solved. And the intermediate model is continuously updated through the neighbor characteristic data, so that the incidence relation between the sample data is fully utilized, the reliability prediction accuracy of the object to be predicted is improved, and the cost of sample labeling is reduced.
EXAMPLE III
Fig. 4 is a schematic structural diagram of a tag prediction apparatus in a third embodiment of the present invention, which is applicable to a case of predicting the reliability of an object to be predicted. As shown in fig. 4, the apparatus includes:
a basic feature data determining module 410, configured to determine basic feature data of an object to be predicted;
the reliability prediction module 420 is configured to predict the reliability of the object to be predicted according to the basic feature data by using a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
The embodiment of the invention obtains a prediction model based on the relation atlas training among the objects in the sample data, and predicts the credibility of the object to be predicted by adopting the prediction model according to the basic characteristic data of the object to be predicted. The relation among the objects in the sample data is considered in the training stage, namely the model has the capability of comprehensively considering the influence of the relation among different objects on the credibility, and the accuracy of the credibility prediction of the object to be predicted is improved.
Optionally, the apparatus further includes a prediction model training module, including:
the initial model training unit is used for training to obtain an initial model according to basic feature data and credibility labels of known label samples;
the intermediate model training unit is used for training candidate labels predicted by the initial model on unknown label samples, and basic characteristic data and neighbor characteristic data of the unknown label samples to obtain an intermediate model;
and the final prediction model training unit is used for training the sample label predicted by the intermediate model on the unknown label sample and the basic characteristic data of the unknown label sample to obtain the final prediction model.
Optionally, the initial model training unit is specifically configured to:
performing stability screening based on characteristic data extracted from the known tag sample;
performing characteristic derivation on the screened characteristic data to obtain the basic characteristic data;
and carrying out supervised learning according to the basic feature data and the credibility label of the known label sample, and training to obtain the initial model.
Optionally, the intermediate model training unit includes:
the candidate label prediction subunit is used for performing label prediction on the unknown label sample by using the initial model to obtain a candidate label of the unknown label sample;
the neighbor characteristic data determining subunit is used for determining neighbor characteristic data of the unknown label sample according to the relation map between the objects in the sample data;
and the intermediate prediction model training subunit is used for training to obtain an intermediate prediction model according to the basic feature data of the unknown label sample, the candidate label and the neighbor feature data.
Optionally, the neighbor feature data determining subunit is specifically configured to:
determining confidence degrees of the candidate labels according to the candidate labels of the unknown label samples;
sequencing the unknown label samples according to the confidence degrees of the candidate labels, and selecting a target object of which the confidence degree value in the sequencing result reaches a preset threshold value;
determining candidate objects having direct association relation with the target object according to the relation graph from the known label sample;
and determining neighbor characteristic data of the target object according to the relation weight between the target object and the candidate object in the relation map.
Optionally, the intermediate model training unit includes:
the candidate label updating subunit is used for predicting the unknown label sample by adopting an intermediate model obtained in the previous training so as to update the candidate label;
the neighbor characteristic data updating subunit is used for updating the neighbor characteristic data of the unknown label sample according to the updated candidate label;
and the intermediate model iterative training subunit is used for carrying out the current round of training according to the basic characteristic data of the unknown label sample, the updated candidate label and the neighbor characteristic data to obtain an intermediate model of the current round of iterative training.
Optionally, before the neighbor feature data determining subunit, a relationship graph constructing subunit is further included, which is specifically configured to:
and constructing a relation map according to the incidence relation between any two objects in the unknown label sample and the known label sample.
The label prediction device provided by the embodiment of the invention can execute the label prediction method provided by any embodiment of the invention, and has the corresponding functional module and beneficial effect of executing the label prediction method.
Example four
Fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 5 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.
As shown in FIG. 5, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory device 28, and a bus 18 that couples various system components including the system memory device 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory device bus or memory device controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system storage 28 may include computer system readable media in the form of volatile storage, such as Random Access Memory (RAM)30 and/or cache storage 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Storage 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in storage 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be appreciated that although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system storage device 28, for example, implementing a tag prediction method provided by an embodiment of the present invention, including:
determining basic characteristic data of an object to be predicted;
predicting the credibility of the object to be predicted according to the basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
EXAMPLE five
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a tag prediction method according to an embodiment of the present invention, where the computer program includes:
determining basic characteristic data of an object to be predicted;
predicting the credibility of the object to be predicted according to the basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory device (RAM), a read-only memory device (ROM), an erasable programmable read-only memory device (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory device (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A label prediction method, comprising:
determining basic characteristic data of an object to be predicted;
predicting the credibility of the object to be predicted according to the basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
2. The method of claim 1, wherein training the predictive model comprises:
training to obtain an initial model according to basic feature data and credibility labels of known label samples;
training to obtain an intermediate model based on the candidate labels predicted by the initial model on the unknown label sample, and the basic feature data and the neighbor feature data of the unknown label sample;
and training to obtain a final prediction model based on the sample label predicted by the intermediate model on the unknown label sample and the basic characteristic data of the unknown label sample.
3. The method of claim 2, wherein training the initial model according to the basic feature data and the confidence label of the known label sample comprises:
performing stability screening based on characteristic data extracted from the known tag sample;
performing characteristic derivation on the screened characteristic data to obtain the basic characteristic data;
and carrying out supervised learning according to the basic feature data and the credibility label of the known label sample, and training to obtain the initial model.
4. The method of claim 2, wherein training an intermediate model based on the candidate labels predicted by the initial model for the unknown label sample and the basic feature data and the neighbor feature data of the unknown label sample comprises:
performing label prediction on the unknown label sample by using the initial model to obtain a candidate label of the unknown label sample;
determining neighbor characteristic data of the unknown label sample according to a relation graph among objects in the sample data;
and training to obtain an intermediate prediction model according to the basic feature data of the unknown label sample, the candidate label and the neighbor feature data.
5. The method according to claim 4, wherein the determining the neighbor feature data of the unknown labeled sample according to the relationship graph between the unknown objects in the unknown labeled sample comprises:
determining confidence degrees of the candidate labels according to the candidate labels of the unknown label samples;
sequencing the unknown label samples according to the confidence degrees of the candidate labels, and selecting a target object with a confidence degree value reaching a preset threshold value in a sequencing result;
determining candidate objects having direct association relation with the target object according to the relation graph from the known label sample;
and determining neighbor characteristic data of the target object according to the relation weight between the target object and the candidate object in the relation map.
6. The method of claim 2, wherein training an intermediate model based on the candidate labels predicted by the initial model for the unknown label sample and the basic feature data and the neighbor feature data of the unknown label sample comprises:
predicting the unknown label sample by adopting an intermediate model obtained by the previous training so as to update the candidate label;
updating neighbor characteristic data of the unknown label sample according to the updated candidate label;
and performing the current round of training according to the basic characteristic data of the unknown label sample, the updated candidate label and the neighbor characteristic data to obtain an intermediate model of the current round of iterative training.
7. The method according to claim 4, further comprising, before said determining neighbor feature data of said unknown labeled sample according to a relationship graph between objects in said sample data:
and constructing a relation map according to the incidence relation between any two objects in the unknown label sample and the known label sample.
8. A label prediction apparatus, comprising:
the basic characteristic data determining module is used for determining basic characteristic data of an object to be predicted;
the reliability prediction module is used for predicting the reliability of the object to be predicted according to the basic characteristic data by adopting a preset prediction model; the preset prediction model is obtained by training according to sample data and neighbor characteristic data, and the neighbor characteristic data is determined according to a relation graph among objects in the sample data.
9. A computer device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the tag prediction method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the label prediction method according to any one of claims 1 to 7.
CN201910910439.0A 2019-09-25 2019-09-25 Label prediction method, device, equipment and storage medium Pending CN110688536A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910910439.0A CN110688536A (en) 2019-09-25 2019-09-25 Label prediction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910910439.0A CN110688536A (en) 2019-09-25 2019-09-25 Label prediction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110688536A true CN110688536A (en) 2020-01-14

Family

ID=69110600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910910439.0A Pending CN110688536A (en) 2019-09-25 2019-09-25 Label prediction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110688536A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222981A (en) * 2020-01-16 2020-06-02 中国建设银行股份有限公司 Credibility determination method, device, equipment and storage medium
CN112232444A (en) * 2020-11-23 2021-01-15 中国移动通信集团江苏有限公司 Method, device and equipment for determining geographic position data of object and storage medium
CN112328657A (en) * 2020-11-03 2021-02-05 中国平安人寿保险股份有限公司 Feature derivation method, feature derivation device, computer equipment and medium
CN112333211A (en) * 2021-01-05 2021-02-05 博智安全科技股份有限公司 Industrial control behavior detection method and system based on machine learning
CN113051406A (en) * 2021-03-23 2021-06-29 龙马智芯(珠海横琴)科技有限公司 Character attribute prediction method, device, server and readable storage medium
CN113837394A (en) * 2021-09-03 2021-12-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Multi-feature view data label prediction method, system and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299811A (en) * 2018-08-20 2019-02-01 众安在线财产保险股份有限公司 A method of the identification of fraud clique and Risk of Communication prediction based on complex network
CN109636061A (en) * 2018-12-25 2019-04-16 深圳市南山区人民医院 Training method, device, equipment and the storage medium of medical insurance Fraud Prediction network
CN109685647A (en) * 2018-12-27 2019-04-26 阳光财产保险股份有限公司 The training method of credit fraud detection method and its model, device and server
CN109978538A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Determine fraudulent user, training pattern, the method and device for identifying risk of fraud

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978538A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Determine fraudulent user, training pattern, the method and device for identifying risk of fraud
CN109299811A (en) * 2018-08-20 2019-02-01 众安在线财产保险股份有限公司 A method of the identification of fraud clique and Risk of Communication prediction based on complex network
CN109636061A (en) * 2018-12-25 2019-04-16 深圳市南山区人民医院 Training method, device, equipment and the storage medium of medical insurance Fraud Prediction network
CN109685647A (en) * 2018-12-27 2019-04-26 阳光财产保险股份有限公司 The training method of credit fraud detection method and its model, device and server

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222981A (en) * 2020-01-16 2020-06-02 中国建设银行股份有限公司 Credibility determination method, device, equipment and storage medium
CN112328657A (en) * 2020-11-03 2021-02-05 中国平安人寿保险股份有限公司 Feature derivation method, feature derivation device, computer equipment and medium
CN112232444A (en) * 2020-11-23 2021-01-15 中国移动通信集团江苏有限公司 Method, device and equipment for determining geographic position data of object and storage medium
CN112232444B (en) * 2020-11-23 2024-02-27 中国移动通信集团江苏有限公司 Method, device, equipment and storage medium for determining geographic position data of object
CN112333211A (en) * 2021-01-05 2021-02-05 博智安全科技股份有限公司 Industrial control behavior detection method and system based on machine learning
CN112333211B (en) * 2021-01-05 2021-04-23 博智安全科技股份有限公司 Industrial control behavior detection method and system based on machine learning
CN113051406A (en) * 2021-03-23 2021-06-29 龙马智芯(珠海横琴)科技有限公司 Character attribute prediction method, device, server and readable storage medium
CN113837394A (en) * 2021-09-03 2021-12-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Multi-feature view data label prediction method, system and readable storage medium

Similar Documents

Publication Publication Date Title
CN110688536A (en) Label prediction method, device, equipment and storage medium
CN110992169B (en) Risk assessment method, risk assessment device, server and storage medium
US11488055B2 (en) Training corpus refinement and incremental updating
US11295242B2 (en) Automated data and label creation for supervised machine learning regression testing
CN111199474B (en) Risk prediction method and device based on network map data of two parties and electronic equipment
CN111222976B (en) Risk prediction method and device based on network map data of two parties and electronic equipment
CN110995459B (en) Abnormal object identification method, device, medium and electronic equipment
US11074043B2 (en) Automated script review utilizing crowdsourced inputs
US11972382B2 (en) Root cause identification and analysis
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
CN112613569A (en) Image recognition method, and training method and device of image classification model
CN112818162A (en) Image retrieval method, image retrieval device, storage medium and electronic equipment
CN111199469A (en) User payment model generation method and device and electronic equipment
CN111178687A (en) Financial risk classification method and device and electronic equipment
CN112990294A (en) Training method and device of behavior discrimination model, electronic equipment and storage medium
CN115936895A (en) Risk assessment method, device and equipment based on artificial intelligence and storage medium
CN115422028A (en) Credibility evaluation method and device for label portrait system, electronic equipment and medium
CN111210332A (en) Method and device for generating post-loan management strategy and electronic equipment
CN111191677A (en) User characteristic data generation method and device and electronic equipment
CN111625555B (en) Order matching method, device, equipment and storage medium
CN111815435A (en) Visualization method, device, equipment and storage medium for group risk characteristics
CN111582313A (en) Sample data generation method and device and electronic equipment
CN116245630A (en) Anti-fraud detection method and device, electronic equipment and medium
CN114116688B (en) Data processing and quality inspection method and device and readable storage medium
CN111859985B (en) AI customer service model test method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220914

Address after: 25 Financial Street, Xicheng District, Beijing 100033

Applicant after: CHINA CONSTRUCTION BANK Corp.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200114