CN115186749A

CN115186749A - Data identification method, device, equipment and storage medium

Info

Publication number: CN115186749A
Application number: CN202210807295.8A
Authority: CN
Inventors: 王天祺; 刘昊骋; 徐世界; 徐靖宇; 田建
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-10-14

Abstract

The disclosure provides a data identification method, a data identification device, data identification equipment and a storage medium, and relates to the technical field of computers, in particular to the technical field of big data. The specific implementation scheme is as follows: acquiring a training sample set from a target memory, wherein the training sample set comprises target samples; performing rule extraction on the training sample set by the target processor according to a strategy mining algorithm to obtain an alternative rule set; screening the rules in the alternative rule set according to the target sample rate of the rules in the alternative rule set to generate an optimal rule set; and identifying the data to be identified according to the optimal rule set to obtain an identification result. According to the data identification method, the data identification device, the data identification equipment and the storage medium, an end-to-end automatic process is realized in the rule generation process, the problems that a large amount of labor is consumed in rule generation and the dependence on expert experience is high are solved, and the generated alternative rule set is screened, so that the rule with better reasoning and judging effects can be generated.

Description

Data identification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data identification method, apparatus, device, and storage medium in the field of big data technologies.

Background

Rules are generally generated from features and feature data corresponding to data objects, which may embody local rules of the data, and are therefore commonly used for data classification or identification of a particular type of data object. For example, in the field of industrial product detection, poor quality industrial products can be identified using detection rules generated from characteristics and feature data of the industrial products; in the field of financial wind control, risk users can be identified by using wind control rules generated by user characteristics and characteristic data. In the big data era, how to realize accurate identification of data is crucial to various industries, and in the prior art, data is generally identified by using a previous rule set or a rule set manually generated by business personnel according to experience.

Disclosure of Invention

The present disclosure provides a more efficient data recognition method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided a data recognition method including: obtaining a training sample set from a target memory, the training sample set comprising target samples; carrying out rule extraction on the training sample set by a target processor according to a strategy mining algorithm to obtain an alternative rule set; screening the rules in the alternative rule set according to the target sample rate of the rules in the alternative rule set to generate an optimal rule set; and identifying the data to be identified according to the optimal rule set to obtain an identification result.

According to another aspect of the present disclosure, there is provided a data recognition apparatus including: an acquisition module, configured to acquire a training sample set from a target memory, where the training sample set includes target samples; the extraction module is used for carrying out rule extraction on the training sample set by a target processor according to a strategy mining algorithm to obtain an alternative rule set; the generating module is used for screening the rules in the alternative rule set according to the target sample rate of the rules in the alternative rule set to generate an optimal rule set; and the identification module is used for identifying the data to be identified according to the optimal rule set to obtain an identification result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow chart diagram of a data identification method according to a first embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of a data recognition method according to a second embodiment of the present disclosure;

FIG. 3 is a flow chart diagram of a data identification method according to a third embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of a data recognition method according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram of a data recognition method according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic flow chart diagram of a data recognition method according to a sixth embodiment of the present disclosure;

FIG. 7 is a flow chart diagram of a data identification method according to a seventh embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a data recognition apparatus according to a ninth embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a data recognition method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flow chart of a data identification method according to a first embodiment of the present disclosure, as shown in fig. 1, the method mainly includes:

step S101, a training sample set is obtained from a target memory, and the training sample set comprises target samples.

In this embodiment, first, a training sample set needs to be obtained from a target storage, where the training sample set includes training samples corresponding to actual service scenarios, for example, in a financial wind control scenario, one training sample may include a user account, a user amount, a user's credit balance, and whether the user is overdue, where the user's credit balance is a credit balance that the user has not yet finished; in a commodity recommendation scene, one training sample can comprise a customer account number, a customer browsing history, a customer purchase record, a customer return record and the like; in an industrial product detection scene, one training sample can comprise an industrial product number, an industrial product shape parameter, an industrial product color, an industrial product weight and the like. Of course, the training samples may also come from other business scenarios, and the disclosure does not limit the business scenario to which the training samples belong. The training sample set comprises target samples, and the target samples can be training samples of a specific type, for example, in a financial wind control scene, the target samples can be training samples corresponding to user accounts with higher risks; in an industrial product detection scenario, the target sample may be a training sample corresponding to an industrial product with a lower quality.

And S102, carrying out rule extraction on the training sample set by the target processor according to a strategy mining algorithm to obtain an alternative rule set.

In this embodiment, after the training sample set is obtained, the target processor further needs to extract rules from the training sample set according to a policy mining algorithm, so as to obtain the alternative rule set. Specifically, each training sample in the training sample set includes features corresponding to the training sample, for example, in a financial wind control scenario, if a training sample includes a user account, a user amount, a user credit balance, and whether the user is overdue, the user account, the user amount, the user credit balance, and whether the user is overdue may all be used as the features of the training sample, and the policy mining algorithm may extract a rule satisfying a preset condition from all the features of the training sample set, and add the extracted rule to the alternative rule set.

In an implementation manner, the strategy mining algorithm may be a decision tree algorithm, a binning algorithm, or the like, and if the strategy mining algorithm is a decision tree algorithm, the decision tree is trained according to a training sample set, a path corresponding to each leaf node of the decision tree generated by training is a rule, and the rule meeting a preset condition may be added to an alternative rule set; if the strategy mining algorithm is a box-dividing algorithm, box-dividing operation is carried out on each feature of the training sample set, each box corresponds to one rule, the rule meeting the preset condition can be added into the alternative rule set, if the boxes corresponding to the certain feature do not meet the preset condition, the box of the feature is crossed with the boxes of other features to generate a new box, then the rule meeting the preset condition and corresponding to the new box is added into the alternative rule set, and the preset condition can be set according to the actual situation.

And S103, screening the rules in the alternative rule set according to the target sample rate of the rules in the alternative rule set to generate an optimal rule set.

And step S104, identifying the data to be identified according to the optimal rule set to obtain an identification result.

In this embodiment, training samples hit by a plurality of rules in the alternative rule set may be repeated, and the recognition capabilities of all the rules are also different, so that the rules in the alternative rule set need to be screened according to a policy generation algorithm, so as to generate an optimal rule set, and then the data to be recognized is recognized by using the optimal rule set, so as to obtain a recognition result, where the data to be recognized is data that needs to be recognized, and the recognition process is similar to that in the prior art, and is not described herein again.

In an implementation manner, the training sample set includes target samples, and a target sample rate of each rule in the alternative rule set may be calculated, where the target sample rate is a ratio of the number of target samples hit by the rule to the number of all training samples hit by the rule, then the rules in the alternative rule set are sorted in a descending manner according to the target sample rate, and a plurality of rules ranked in the top are selected to generate an optimal rule set, that is, a plurality of rules with a higher target sample rate are selected to generate an optimal rule set, and the number of selected rules may be determined according to an actual situation.

In the first embodiment of the disclosure, the target processor automatically generates the alternative rule set according to the policy mining algorithm, then directly screens the rules in the alternative rule set to generate the optimal rule set, an end-to-end automatic process is realized in the generation process of the optimal rule set, the problems that a large amount of labor is consumed for rule generation and the dependency on expert experience is high are solved, the generated alternative rule set is screened, the generation of the optimal rule set with a better identification effect is ensured, the optimal rule set is used for identifying data to be identified, and the accuracy of the identification result is further improved.

In a second embodiment of the present disclosure, the policy mining algorithm is a decision tree algorithm, fig. 2 is a schematic flow chart of a data identification method according to the second embodiment of the present disclosure, as shown in fig. 2, step S102 mainly includes:

step S201, training a decision tree according to preset parameters and a training sample set.

Step S202, calculating a target sample rate of leaf nodes in the decision tree, and adding rules corresponding to the leaf nodes with the target sample rate larger than a first preset threshold value into an alternative rule set.

In this embodiment, a decision tree is first trained on a training sample set according to preset parameters, each training sample falls into a leaf node of the decision tree, a path corresponding to each leaf node in the decision tree is a rule, then target sample rates of all leaf nodes in the decision tree are calculated, and a rule corresponding to a leaf node with a target sample rate greater than a first preset threshold, that is, a path corresponding to a leaf node with a target sample rate greater than the first preset threshold, is added to an alternative rule set. Wherein the target sample rate is a ratio of the number of target samples in the leaf node to the number of all training samples in the leaf node. Specifically, the preset parameters may include a maximum number of layers and a minimum number of leaf node samples, where the maximum number of layers is the number of layers of the decision tree generated by training, and the minimum number of leaf node samples is the minimum number of training samples hit by a leaf node in the decision tree, that is, if the number of training samples falling into a certain leaf node is less than the minimum number of leaf node samples, the leaf node may be pruned, and both the preset parameters and the first preset threshold may be set according to an actual situation.

Step S203, deleting the characteristics corresponding to the rules from the training sample set, and training the decision tree again according to the preset parameters and the training sample set after characteristics are deleted until a first preset condition is met.

In this embodiment, after adding the rule corresponding to the leaf node whose target sample rate is greater than the first preset threshold in the trained decision tree into the alternative rule set, all the features corresponding to the above rule may be deleted from the training sample set, then the decision tree is trained again according to the preset parameters and the training sample set after the features are deleted, and the rule corresponding to the leaf node whose target sample rate is greater than the first preset threshold in the newly trained decision tree is added into the alternative rule set until the first preset condition is satisfied.

In an implementation manner, not only all features corresponding to the extracted rules can be deleted from the training sample set, but also all training samples corresponding to the extracted rules can be deleted, and then the decision tree is retrained.

In an implementation manner, the first preset condition may be that there is no feature in the training sample set or the number of features in the training sample set is not enough to train a decision tree, or the first preset condition may be that the number of rules in the alternative rule set reaches a certain value, and the value may be set according to an actual situation.

In the second embodiment of the disclosure, rule extraction is performed on the training sample set according to the decision tree algorithm, a rule corresponding to a leaf node in the trained decision tree, in which the target sample rate is greater than the first preset threshold, is added to the alternative rule set, and the rule and the training sample in the training sample set are continuously filtered, so that the accuracy of the alternative rule set is further ensured, and the repetition rate of the extracted rule is reduced.

In a third embodiment of the present disclosure, the policy mining algorithm is a gradient boosting algorithm, fig. 3 is a schematic flow chart of a data identification method according to the third embodiment of the present disclosure, as shown in fig. 3, step S102 mainly includes:

step S301, adding sample weight to training samples in the training sample set.

In this embodiment, a sample weight needs to be added to each training sample in the training sample set first. Specifically, the sample weight may be set according to an actual service scenario, for example, taking the field of financial wind control as an example, if a training sample includes a user account, a user amount, a user credit balance, and whether the user is overdue, the sample weight of each training sample may be calculated according to the following formula:

wherein sum (user quota) is the sum of user quota of all training samples in the training sample set. Of course, the sample weight of each training sample may also be determined in other manners according to different actual service scenarios.

In one possible embodiment, in the field of financial wind control, the sample weight of each training sample may also be calculated according to the following formula:

if the user is overdue, determining whether the user is overdue as 1; and if the user is not overdue, determining whether the user is overdue as 0.

In an implementation manner, in the field of financial wind control, the sample weight of each training sample can be further calculated according to the following formula:

if the user is overdue, determining whether the user is overdue as 1; if the user is not overdue, determining whether the user is overdue as 0; sum (user overdue x loan balance) is the product of the user overdue and the loan balance for all training samples in the set of training samples.

Step S302, inputting the training sample set added with the sample weight into a gradient elevation model for training to obtain a plurality of rule trees.

Step S303, calculating target sample rates of leaf nodes in the plurality of rule trees, and adding the rule corresponding to the leaf node whose target sample rate is greater than the first preset threshold value into the alternative rule set.

In this embodiment, after adding a sample weight to each training sample in a training sample set, the training sample set to which the sample weight is added is input into an eXtreme Gradient Boosting (XGBoost) model for training to obtain a plurality of rule trees, then target sample rates of all leaf nodes in the plurality of rule trees are calculated, and a rule corresponding to a leaf node whose target sample rate is greater than a first preset threshold is added to an alternative rule set.

In an embodiment, since the XGBoost model is trained with the same training sample set to form a plurality of rule trees, the rules extracted from the plurality of rule trees may be repeated, and therefore, the rules in the alternative rule set need to be subjected to correlation filtering. For example, according to a plurality of trained rule trees, a training sample corresponding to each rule in the candidate rule set is determined, and if the coincidence degree of the training samples corresponding to two rules is greater than a certain value, the value can be set according to the actual situation, and the rule with the low target sample rate in the two rules is deleted from the candidate rule set.

In the third embodiment of the disclosure, a sample weight is added to each training sample in the training sample set, and the training sample set to which the sample weight is added is input into the XGBoost model for rule extraction, so that higher accuracy of the extracted rule can be ensured, and correlation screening is performed on the rules in the alternative rule set, thereby ensuring that a better alternative rule set can be generated.

In a fourth embodiment of the present disclosure, the policy mining algorithm is a joint ranking algorithm, fig. 4 is a schematic flow chart of a data identification method according to the fourth embodiment of the present disclosure, as shown in fig. 4, step S102 mainly includes:

step S401, adding sample weight for training samples in the training sample set.

Step S401 is similar to step S301, and is not described herein again.

And S402, performing box separation operation on the features corresponding to the training sample set, and sequencing the features according to the information values to obtain a sequencing result.

In this embodiment, a binning algorithm needs to be called to perform binning operation on all features corresponding to a training sample set, and all features are sorted according to Information values to obtain a sorting result, where the Information values (IV, information Value) are used to evaluate prediction capabilities of the features, the size of the Information Value indicates the strength of prediction or judgment capabilities of the features, and the Information Value may be calculated by a method commonly used in the art, which is not described herein again. Specifically, the binning algorithm may be chi-square binning, equidistant binning, equal-frequency binning, or the like, and the binning algorithm is not limited in this disclosure.

Step S403, according to the sorting result, calculating the target sample rate and the sample weight average value of the sub-box of each feature, and finding out a first continuous sub-box with the target sample rate being greater than a first preset threshold value.

And S404, sequencing the first continuous sub-boxes according to the sample weight mean value to obtain second continuous sub-boxes, and adding rules corresponding to the second continuous sub-boxes into the alternative rule set.

In this embodiment, according to the sorting result, the target sample rate and the sample weight mean value of all the bins of each feature are sequentially calculated from the feature with the largest information value, and the first continuous bin with the target sample rate larger than the first preset threshold is found, then the first continuous bin is sorted according to the sample weight mean values of all the bins, so as to obtain the second continuous bin, and the rule corresponding to the second continuous bin is added to the candidate rule set. Specifically, the target sample rate is a ratio of the number of target samples corresponding to the binning to the number of training samples corresponding to the binning, and the sample weight average value is a sample weight average value of all training samples corresponding to the binning.

In an implementation manner, the first continuous binning refers to binning at continuous positions where the target sample rate of each feature is greater than a first preset threshold, for example, five binning are obtained after a certain feature is subjected to binning operation, namely binning 1, binning 2, binning 3, binning 4 and binning 5, where the target sample rates of the binning 1, the binning 3, the binning 4 and the binning 5 are greater than the first preset threshold, and the binning 3, the binning 4 and the binning 5 form a first continuous binning together; if the first continuous sub-boxes formed by the sub-boxes 3, 4 and 5 are sorted according to the sample weight mean value of each sub-box, and the obtained sorting results are the sub-boxes 4, 5 and 3, the sub-boxes 4 and 5 jointly form second continuous sub-boxes, at this time, the rule corresponding to the sub-box 1 and the rule corresponding to the second continuous sub-boxes formed by the sub-boxes 4 and 5 jointly can be added into the alternative rule set, that is, two rules are added into the alternative rule set.

Step S405, if the target sample rates of all the sub-boxes of the current feature are not greater than the first preset threshold, the sub-boxes of the current feature are crossed with the sub-boxes of the next feature to obtain new sub-boxes, and the target sample rates and the sample weight mean values of the new sub-boxes are continuously calculated until all the features are traversed.

In this embodiment, if the target sample rates of all bins of the current feature are not greater than the first preset threshold, all bins of the current feature are intersected with all bins of the next feature to obtain new bins, that is, for each bin of the current feature, the new bin is merged with all bins of the next feature, for example, there are 5 bins for the current feature and the next feature, there are 25 new bins after the intersection, and the target sample rates and the sample weight averages of the new bins are continuously calculated, that is, steps 403 and 404 are repeated until all features are traversed.

In an implementation, after all the features are traversed, the features and/or samples corresponding to all the rules in the alternative rule set may be further deleted from the training sample set, and steps S402 to S405 may be repeated using the training sample set after the features and/or samples are deleted.

In the fourth embodiment of the disclosure, a joint ordering algorithm is used to extract rules from a training sample set, all features are ordered and binned according to information values, a first continuous bin with a target sample rate greater than a first preset threshold value on each feature is found in sequence, then the first continuous bin is ordered according to a sample weight mean value to obtain a second continuous bin, and rules corresponding to the second continuous bin are added to an alternative rule set.

In a fifth embodiment of the present disclosure, the policy mining algorithm is a pole search algorithm, fig. 5 is a schematic flow chart of a data identification method according to the fifth embodiment of the present disclosure, as shown in fig. 5, step S102 mainly includes:

and S501, sorting the characteristics corresponding to the training sample set according to the information values to obtain a sorting result.

The specific process of step S501 has already been discussed in step S402, and is not described herein again.

Step S502, according to the sorting result, calculating a target sample rate of a first pole interval at each characteristic monotonic direction pole, and adding a rule corresponding to the first pole interval with the target sample rate larger than a first preset threshold value into an alternative rule set.

In this embodiment, after all the features of the training sample set are sorted, the target sample rate of the first pole interval at the pole point in the monotonic direction of each feature needs to be sequentially calculated according to the sorting result, and then the rule corresponding to the first pole interval with the target sample rate greater than the first preset threshold is added to the alternative rule set. The monotone direction refers to a direction in which the characteristic increases or decreases, the monotone direction pole refers to a maximum value or a minimum value of the characteristic along the monotone direction, the pole interval refers to an interval near the monotone direction pole, for example, the monotone direction of one monotone increasing characteristic is a direction from the minimum value corresponding to the characteristic to the maximum value, the monotone direction pole is the maximum value of the characteristic, and the first pole interval may be an interval including the maximum value; the monotone direction of a monotone decreasing feature is the direction from the maximum value to the minimum value corresponding to the feature, the monotone direction pole is the minimum value of the feature, and the first pole interval can be an interval containing the minimum value.

In one possible embodiment, if a feature is a monotone increasing feature, the minimum value is 0, and the maximum value is 100, the monotone direction of the feature is a direction from 0 to 100, the monotone direction pole of the feature is 100, a quantile, for example, 0.95, may be set near the monotone direction pole, and the corresponding first pole interval is [95, 100], and the target sample rate of the interval is the ratio of the number of target samples falling in the interval to the number of all training samples falling in the interval. Wherein, the quantile can be set according to the actual situation.

In an implementation manner, according to the sorting result, the Lift degree Lift of the first pole section at each characteristic monotonic direction pole point may be sequentially calculated, and the rule corresponding to the first pole section where the Lift is greater than the first threshold is added to the alternative rule set, where the Lift may adopt the following formulaAnd (3) calculating:

the first threshold value can be determined according to actual conditions.

In an implementation manner, a sample weight may be further added to each training sample set in the training sample sets, a sample weight average value of a first pole interval at each characteristic monotonic direction pole point is calculated, and a rule corresponding to the first pole interval where Lift is greater than a first threshold and the sample weight average value is greater than a second threshold is added to the candidate rule set.

In step S503, if the target sample rate of the first pole interval of the current feature is not greater than the first preset threshold, the first pole interval is adjusted to obtain a second pole interval, and the target sample rate of the second pole interval is calculated until a second preset condition is satisfied.

In this embodiment, if the target sample rate of the first pole section of the current feature is not greater than the first preset threshold, the first pole section is adjusted to obtain a second pole section, and then the target sample rate of the second pole section is calculated until the second preset condition is satisfied. The second preset condition may be that the number of training samples in the candidate rule set reaches a certain value, or the number of times of adjustment of the first pole interval reaches a certain value, and the above values may be determined by themselves.

In an embodiment, the concept of binary search can be adopted, and the first pole interval is adjusted by adjusting the quantile, for example, if the monotonic direction of the current feature is from 0 to 100, the monotonic direction pole of the feature is 100, and the quantile is 0.95, that is, the first pole interval is [95, 100 [ ]]If the target sample rate of the first pole interval is not greater than the first preset threshold, the first pole interval may be adjusted by adjusting the quantile, for example, the quantile is adjusted to 0.975, and the second pole interval is [97.5, 100]]Then, the target sample rate of the second polar section is calculated, if the target sample rate is larger than a first preset threshold value, the rule corresponding to the second polar section is added into an alternative rule set, and then the second polar section can be subjected toThe lower bound of the interval is adjusted, that is, if the target sample rate of the second polar interval is greater than the first preset threshold, the second polar interval can be adjusted to

The target sample rate for the second pole interval is recalculated. Specifically, the adjustment manner of the first pole section may be set according to the actual situation.

Step S504, if the target sample rates of all pole intervals on the current feature are not greater than the first preset threshold, storing the first pole interval, combining the first pole interval and all pole intervals of the next feature to obtain a new pole interval, and recalculating the target sample rate of the new pole interval until all features are traversed.

In this embodiment, if the target sample rates of all the pole intervals on the current feature are not greater than the first preset threshold, the first pole interval of the current feature is stored, which is equivalent to storing the initial pole interval of the current feature, and when the rule is extracted for the next feature, for each pole interval of the next feature, the pole interval is merged with the first pole interval of the current feature to obtain a new pole interval, the target sample rate of the new pole interval is recalculated, and the rule corresponding to the new pole interval with the target sample rate greater than the first preset threshold is added to the alternative rule set until all the features are traversed.

In a fifth embodiment of the present disclosure, a pole search algorithm is adopted to perform rule extraction on a training sample set, determine whether a target sample rate of a pole interval at a pole point in a characteristic monotonic direction is greater than a first preset threshold, and if yes, add a corresponding rule of the pole interval to an alternative rule set; if the target sample rate is not greater than the first preset threshold, the range of the pole section is adjusted, and whether the target sample rate of the adjusted pole section is greater than the first preset threshold is judged again.

In the second to fifth embodiments of the present disclosure, each policy mining algorithm is composed of a series of operators, the operators have fixed input and output, and all the operators are connected in series, so that an end-to-end automatic process is realized, and the problems that a large amount of labor is required for rule generation and the dependency on expert experience is high are solved.

Fig. 6 is a flowchart illustrating a data identification method according to a sixth embodiment of the present disclosure, as shown in fig. 6, before step S102, the method further includes:

and step S601, scoring the training sample set according to the credit scoring card to obtain a scoring result.

And step S602, according to the grading result, performing box separation on the training sample set to obtain a box separation result.

And step S603, screening the training sample set according to a second preset condition and a box separation result to obtain an appointed sample set.

In this embodiment, before rule extraction is performed on the training sample set, the training sample set needs to be screened, the training samples in the training sample set can be scored according to the credit scoring card, then, box separation operation is performed on the training sample set according to the scoring result to obtain a box separation result, and then, according to a second preset condition, a specified training sample is selected from the box separation result to obtain a specified sample set.

In an implementation manner, a machine learning model may be used to establish a credit rating card, such as a logistic regression model, and the second preset condition may be that training samples corresponding to the last two bins in the binning results are used as a specified sample set, for example, if the training sample set is binned according to the scoring result of the credit rating card, bin 1, bin 2, bin 3, bin 4, and bin 5 are obtained, and the training samples corresponding to bin 4 and bin 5 may be used as the specified sample set.

In one embodiment, step S102 includes: and the target processor performs rule extraction on the specified sample set according to a strategy mining algorithm to obtain an alternative rule set, namely, the rule extraction is performed only by using the screened specified sample set, and after the alternative rule set is obtained, a second preset condition is required to be added into the alternative rule set, so that the efficiency and the quality of the rule extraction are ensured.

In the sixth embodiment of the present disclosure, before rule extraction is performed on a training sample set, the training sample set is screened to obtain an assigned sample set, and a target processor performs rule extraction on the assigned sample set according to a policy mining algorithm to obtain an alternative rule set, so that efficiency and quality of rule extraction can be ensured.

Fig. 7 is a schematic flowchart of a data identification method according to a seventh embodiment of the disclosure, and as shown in fig. 7, step S103 mainly includes:

step S701, calculating the promotion degree of the rule in the alternative rule set, wherein the promotion degree comprises the ratio of the target sample rate of the rule to the target sample rate of the training sample set.

Step S702, the rules are sorted according to the promotion degree, and a sorting result is obtained.

And step S703, selecting a preset number of rules to generate an optimal rule set according to the sorting result.

In this embodiment, the Lift degrees Lift of all rules in the candidate rule set are first calculated, where Lift may be a ratio of a target sample rate of one rule to a target sample rate of a training sample set, then all rules are sorted according to Lift, and finally, a preset number of rules are selected according to a sorting result to generate an optimal rule set, for example, the first 20 rules with the highest Lift are selected according to a sorting result to generate an optimal rule set. The preset number can be determined according to actual conditions.

In an implementation manner, a sample weight may also be added to each training sample in the training sample set, then a sample weight average value of the training sample corresponding to each rule in the alternative rule set is calculated, all rules in the alternative rule set are sorted according to the sample weight average value, and according to a sorting result, a preset number of rules are selected to generate an optimal rule set, for example, the first 20 rules with the highest sample weight average value are selected to generate the optimal rule set.

In the eighth embodiment of the present disclosure, step S103 mainly includes:

traversing the rules in the alternative rule set, and judging whether the rules are added into the optimal rule set or not, wherein the promotion degree of the optimal rule set is increased and comprises the ratio of the target sample rate of the optimal rule set to the target sample rate of the training sample set; and if the promotion degree of the optimal rule set is increased, adding the rule into the optimal rule set.

In this embodiment, a greedy algorithm is adopted to generate an optimal rule set according to the candidate rule set, specifically, all rules in the candidate rule set are traversed, whether the rules are added into the optimal rule set and Lift of the optimal rule set is increased or not is judged, and if the Lift of the optimal rule set is increased, the rules are added into the optimal rule set.

In an implementation manner, it may be further determined whether the rule is added to the optimal rule set, and if the sample weight average of the optimal rule set is increased, the rule is added to the optimal rule set.

In the seventh and eighth embodiments of the present disclosure, the rules in the candidate rule set are screened by using a ranking method or a greedy algorithm to generate an optimal rule set, so that it can be ensured that the rule with the highest accuracy is screened from the optimal rule set to generate a global optimal rule set.

Fig. 8 is a schematic structural diagram of a data recognition apparatus according to a ninth embodiment of the present disclosure, and as shown in fig. 8, the apparatus mainly includes:

an obtaining module 80, configured to obtain a training sample set from a target storage, where the training sample set includes target samples; the extraction module 81 is used for extracting rules of the training sample set by the target processor according to a strategy mining algorithm to obtain an alternative rule set; a generating module 82, configured to screen the rules in the alternative rule set according to a target sample rate of the rules in the alternative rule set, so as to generate an optimal rule set; and the identification module 83 is configured to identify the data to be identified according to the optimal rule set to obtain an identification result.

In an implementation, the policy mining algorithm is a decision tree algorithm, and the extracting module 81 mainly includes: the first training submodule is used for training a decision tree according to preset parameters and a training sample set; the first calculation submodule is used for calculating a target sample rate of leaf nodes in the decision tree and adding rules corresponding to the leaf nodes of which the target sample rate is greater than a first preset threshold value into the alternative rule set; and the retraining submodule is used for deleting the characteristics corresponding to the rules from the training sample set, and training the decision tree again according to the preset parameters and the training sample set after the characteristics are deleted until a first preset condition is met.

In an implementation, the strategy mining algorithm is a gradient-boosting algorithm, and the extraction module 81 mainly includes: the first adding submodule is used for adding sample weights to the training samples in the training sample set; the second training submodule is used for inputting the training sample set added with the sample weight into the extreme gradient lifting model for training to obtain a plurality of rule trees; and the second calculation submodule is used for calculating a target sample rate of leaf nodes in the plurality of rule trees and adding rules corresponding to the leaf nodes of which the target sample rate is greater than the first preset threshold value into the alternative rule set.

In an implementation manner, the policy mining algorithm is a joint ranking algorithm, and the extraction module 81 mainly includes: the second adding submodule is used for adding sample weights to the training samples in the training sample set; the first box dividing module is used for carrying out box dividing operation on the characteristics corresponding to the training sample set and sequencing the characteristics according to the information values to obtain a sequencing result; the third calculation submodule is used for calculating the target sample rate and the sample weight average value of the sub-box of each feature according to the sorting result and finding out a first continuous sub-box of which the target sample rate is greater than a first preset threshold value; the first sequencing submodule is used for sequencing the first continuous sub-boxes according to the sample weight average value to obtain second continuous sub-boxes, and adding rules corresponding to the second continuous sub-boxes into the alternative rule set; and the first traversal submodule is used for crossing the bin of the current feature and the bin of the next feature to obtain a new bin if the target sample rates of all the bins of the current feature are not greater than the first preset threshold, and continuously calculating the target sample rate and the sample weight mean value of the new bin until all the features are traversed.

In one possible embodiment, the strategy mining algorithm is a pole search algorithm, and the extraction module 81 mainly includes: the second sequencing submodule is used for sequencing the characteristics corresponding to the training sample set according to the information values to obtain a sequencing result; the fourth calculation submodule is used for calculating the target sample rate of the first pole interval at each characteristic monotonic direction pole point according to the sorting result, and adding the rule corresponding to the first pole interval with the target sample rate being greater than the first preset threshold value into the alternative rule set; the adjusting submodule is used for adjusting the first pole interval to obtain a second pole interval if the target sample rate of the first pole interval of the current characteristics is not greater than a first preset threshold, and calculating the target sample rate of the second pole interval until a second preset condition is met; and the second traversal submodule is used for storing the first pole section if the target sample rates of all the pole sections on the current characteristic are not greater than the first preset threshold, combining the first pole section with all the pole sections of the next characteristic to obtain a new pole section, and recalculating the target sample rate of the new pole section until all the characteristics are traversed.

In one embodiment, the apparatus further comprises: the scoring module is used for scoring the training sample set according to the credit scoring card to obtain a scoring result; the box dividing module is used for dividing the training sample set into boxes according to the grading result to obtain a box dividing result; the screening module is used for screening the training sample set according to a second preset condition and the box separation result to obtain an appointed sample set; the extraction module 81 is further configured to perform rule extraction on the specified sample set by the target processor according to a policy mining algorithm to obtain an alternative rule set.

In one implementation, the generation module 82 mainly includes: the fifth calculation submodule is used for calculating the promotion degree of the rule in the alternative rule set, wherein the promotion degree comprises the ratio of the target sample rate of the rule to the target sample rate of the training sample set; the third sequencing submodule is used for sequencing the rules according to the promotion degree to obtain a sequencing result; and the selecting submodule is used for selecting a preset number of rules to generate an optimal rule set according to the sequencing result.

In one implementation, the generation module 82 mainly includes: the judgment submodule is used for traversing the rules in the alternative rule set, judging whether the rules are added into the optimal rule set or not, and judging whether the promotion degree of the optimal rule set is increased or not, wherein the promotion degree comprises the ratio of the target sample rate of the optimal rule set to the target sample rate of the training sample set; and the adding submodule is used for adding the rule into the optimal rule set if the promotion degree of the optimal rule set is increased.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the customs of public sequences.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a data recognition method. For example, in some embodiments, a data recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of a data recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform a data recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A data identification method, comprising:

obtaining a training sample set from a target memory, the training sample set comprising target samples;

carrying out rule extraction on the training sample set by a target processor according to a strategy mining algorithm to obtain an alternative rule set;

screening the rules in the alternative rule set according to the target sample rate of the rules in the alternative rule set to generate an optimal rule set;

and identifying the data to be identified according to the optimal rule set to obtain an identification result.

2. The method of claim 1, wherein the policy mining algorithm is a decision tree algorithm, and the extracting, by the target processor, the rules from the training sample set according to the policy mining algorithm to obtain an alternative rule set comprises:

training a decision tree according to preset parameters and the training sample set;

calculating a target sample rate of leaf nodes in the decision tree, and adding rules corresponding to the leaf nodes with the target sample rate larger than a first preset threshold value into the alternative rule set;

and deleting the characteristics corresponding to the rules from the training sample set, and training the decision tree again according to the preset parameters and the training sample set after characteristics are deleted until a first preset condition is met.

3. The method of claim 1, wherein the strategy mining algorithm is a polar gradient boosting algorithm, and the performing, by the target processor, rule extraction on the training sample set according to the strategy mining algorithm to obtain an alternative rule set comprises:

adding sample weights to training samples in the training sample set;

inputting the training sample set added with the sample weight into a gradient lifting model for training to obtain a plurality of rule trees;

and calculating the target sample rate of the leaf nodes in the plurality of rule trees, and adding the rule corresponding to the leaf node with the target sample rate larger than a first preset threshold value into the alternative rule set.

4. The method of claim 1, wherein the policy mining algorithm is a joint ranking algorithm, and the extracting, by the target processor, the rules from the training sample set according to the policy mining algorithm to obtain an alternative rule set comprises:

adding sample weights to training samples in the training sample set;

performing box separation operation on the features corresponding to the training sample set, and sorting the features according to the information values to obtain a sorting result;

according to the sorting result, calculating a target sample rate and a sample weight mean value of each box of the characteristics, and finding a first continuous box of which the target sample rate is greater than a first preset threshold value;

sorting the first continuous boxes according to the sample weight mean value to obtain second continuous boxes, and adding rules corresponding to the second continuous boxes into the alternative rule set;

if the target sample rates of all the sub-boxes of the current feature are not larger than the first preset threshold, the sub-boxes of the current feature are crossed with the sub-boxes of the next feature to obtain new sub-boxes, and the target sample rates and the sample weight mean values of the new sub-boxes are continuously calculated until all the features are traversed.

5. The method of claim 1, wherein the strategy mining algorithm is a pole search algorithm, and the performing, by the target processor, rule extraction on the training sample set according to the strategy mining algorithm to obtain an alternative rule set comprises:

sorting the characteristics corresponding to the training sample set according to the information values to obtain a sorting result;

according to the sorting result, calculating a target sample rate of a first pole interval at each characteristic monotonic direction pole point, and adding a rule corresponding to the first pole interval with the target sample rate larger than a first preset threshold value into the alternative rule set;

if the target sample rate of a first pole interval of the current characteristics is not greater than a first preset threshold, adjusting the first pole interval to obtain a second pole interval, and calculating the target sample rate of the second pole interval until a second preset condition is met;

and if the target sample rates of all pole intervals on the current characteristic are not greater than a first preset threshold, storing the first pole interval, combining the first pole interval with all pole intervals of the next characteristic to obtain a new pole interval, and recalculating the target sample rate of the new pole interval until all the characteristics are traversed.

6. The method of any one of claims 2 to 5, further comprising, before said extracting, by the target processor, the rules from the training sample set according to a policy mining algorithm to obtain an alternative rule set:

scoring the training sample set according to a credit scoring card to obtain a scoring result;

according to the grading result, carrying out box separation on the training sample set to obtain a box separation result;

screening the training sample set according to a second preset condition and the box separation result to obtain a specified sample set;

the target processor extracts the rules of the training sample set according to a strategy mining algorithm to obtain an alternative rule set, and the method comprises the following steps:

and performing rule extraction on the specified sample set by a target processor according to a strategy mining algorithm to obtain an alternative rule set.

7. The method of claim 6, wherein the screening the rules in the candidate rule set according to the target sample rate of all the rules in the candidate rule set to generate an optimal rule set comprises:

calculating the promotion degree of a rule in the alternative rule set, wherein the promotion degree comprises the ratio of the target sample rate of the rule to the target sample rate of the training sample set;

sorting the rules according to the promotion degree to obtain a sorting result;

and selecting a preset number of rules to generate the optimal rule set according to the sequencing result.

8. The method of claim 6, wherein the screening the rules in the candidate rule set according to the target sample rate of all the rules in the candidate rule set to generate an optimal rule set comprises:

traversing the rules in the alternative rule set, and judging whether the rules are added into the optimal rule set and whether the promotion degree of the optimal rule set is increased, wherein the promotion degree comprises the ratio of the target sample rate of the optimal rule set to the target sample rate of the training sample set;

and if the promotion degree of the optimal rule set is increased, adding the rule into the optimal rule set.

9. A data recognition apparatus comprising:

the acquisition module is used for acquiring a training sample set from a target memory, wherein the training sample set comprises target samples;

the extraction module is used for carrying out rule extraction on the training sample set by the target processor according to a strategy mining algorithm to obtain an alternative rule set;

the generation module is used for screening the rules in the alternative rule set according to the target sample rate of the rules in the alternative rule set to generate an optimal rule set;

and the identification module is used for identifying the data to be identified according to the optimal rule set to obtain an identification result.

10. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.