CN109858633B

CN109858633B - Characteristic information identification method and system

Info

Publication number: CN109858633B
Application number: CN201910132261.1A
Authority: CN
Inventors: 郭振宇; 黄炳; 刘华杰; 姜璐
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2021-02-02
Anticipated expiration: 2039-02-22
Also published as: CN109858633A

Abstract

The invention provides a method and a system for identifying characteristic information, which comprises the following steps: acquiring a discrete data unit and a continuous data unit corresponding to a first unique identifier of a data set to be predicted; inputting the discrete data unit corresponding to the first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to the preset discrete model; the first predicted value includes: a first unique identifier; combining the continuous data unit corresponding to the first unique identifier and the first predicted value, inputting the combined continuous data unit and the first predicted value into a preset continuous model, and calculating to generate a second predicted value corresponding to the preset continuous model; the second predicted value includes: a first unique identifier; and generating characteristic information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value. The method and the device can improve the data processing efficiency of the machine learning algorithm on the data containing both discrete data and continuous data, so that the efficiency of identifying the characteristic information by applying the machine learning algorithm is improved.

Description

Characteristic information identification method and system

Technical Field

The invention relates to the technical field of computer data processing, in particular to a characteristic information identification method and system.

Background

Currently, in the field of machine learning, there are two main types of machine learning algorithms: the algorithm suitable for discrete data and the algorithm suitable for continuous data have the following defects:

1. the machine learning algorithm (such as logistic regression) suitable for the discrete data has the defects that: continuous data in sample data (the sample data sometimes includes both discrete data and continuous data) needs to be discretized in advance, but the selection of a discrete algorithm (including bucket partitioning, segmentation, LOG processing, etc.) affects the final evaluation result. The machine learning algorithm suitable for discrete data is complex in processing process, and a better discretization algorithm can be obtained through evaluation after multiple tests in the process of selecting the discretization algorithm.

2. The machine learning algorithm (such as GBDT algorithm) suitable for continuous data has the defects that: in the model training or prediction process, the decision tree of the GBDT needs to logically nand the discrete data. When the types of discrete data (such as professional types including teachers, doctors, engineers, farmers, workers, directors, actors and the like) are very large, the GBDT decision tree becomes very large, and the processing efficiency of the machine learning algorithm suitable for continuous data is greatly reduced.

Therefore, for data containing both discrete data and continuous data, the conventional machine learning algorithm has a complex and inefficient processing procedure, which results in a problem of low efficiency in identifying feature information by applying the machine learning algorithm.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a method and a system for identifying characteristic information, which can effectively improve the efficiency of identifying the characteristic information by applying a machine learning algorithm.

In order to achieve the above object, the present invention provides a feature information identification method, including:

acquiring a discrete data unit and a continuous data unit corresponding to a first unique identifier of a data set to be predicted;

inputting the discrete data unit corresponding to the first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to the preset discrete model; the first predicted value includes: the first unique identifier;

combining the continuous data units corresponding to the first unique identifier and the first predicted value, and inputting the combined value into a preset continuous model to calculate and generate a second predicted value corresponding to the preset continuous model; the second predicted value includes: the first unique identifier;

and generating characteristic information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value.

The present invention also provides a feature information recognition system, which includes:

the acquiring unit is used for acquiring a discrete data unit and a continuous data unit corresponding to the first unique identifier of the data group to be predicted;

the first generating unit is used for inputting the discrete data unit corresponding to the first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to the preset discrete model; the first predicted value includes: the first unique identifier;

the second generating unit is used for combining the continuous data unit corresponding to the first unique identifier and the first predicted value and inputting the combined value into a preset continuous model to calculate and generate a second predicted value corresponding to the preset continuous model; the second predicted value includes: the first unique identifier;

and the third generating unit is used for generating the characteristic information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value.

The present invention also provides an electronic device, comprising: the characteristic information identification method comprises the following steps of a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the steps of the characteristic information identification method are realized when the processor executes the program.

The present invention provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the characteristic information identification method.

The invention provides a method and a system for identifying characteristic information, comprising the following steps: acquiring a discrete data unit and a continuous data unit corresponding to a first unique identifier of a data set to be predicted; inputting the discrete data unit corresponding to the first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to the preset discrete model; the first predicted value includes: the first unique identifier; combining the continuous data units corresponding to the first unique identifier and the first predicted value, and inputting the combined value into a preset continuous model to calculate and generate a second predicted value corresponding to the preset continuous model; the second predicted value includes: the first unique identifier; and generating characteristic information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value. The method and the device can improve the data processing efficiency of the machine learning algorithm on the data containing both discrete data and continuous data, so that the efficiency of identifying the characteristic information by applying the machine learning algorithm is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for identifying characteristic information of the present application;

FIG. 2 is a flow chart of a method for identifying feature information in an embodiment of the present application;

fig. 3 is a flowchart of step S201 in an embodiment of the present application;

FIG. 4 is a flowchart of step S205 in an embodiment of the present application;

FIG. 5 is a flowchart of step S207 in an embodiment of the present application;

FIG. 6 is a flowchart of step S209 in an embodiment of the present application;

FIG. 7 is a flowchart of step S211 in an embodiment of the present application;

FIG. 8 is a flowchart of step S212 in one embodiment of the present application;

FIG. 9 is a flow chart of a fraud feature information identification method in another embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a process of generating discrete training models M-S1 and first training predictors corresponding to the logistic regression algorithm S1 according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating a generation process of merged training data units T13-S1-i corresponding to each unique identifier K1-i in the logistic regression algorithm S1 according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a generation process of the continuous training model M-L1 in an embodiment of the present application;

fig. 13 is a schematic structural diagram of a feature information identification model Zj in an embodiment of the present application;

FIG. 14 is a schematic diagram illustrating a generation process of each of the first verification predicted values Y1-M1-i corresponding to the discrete training models M-S1 in an embodiment of the present application;

FIG. 15 is a schematic diagram illustrating the generation process of merged verification data cells T23-S1-i in the discrete training model M-S1 according to an embodiment of the present application;

FIG. 16 is a schematic diagram illustrating a generation process of each second verification prediction value Y2-M1-i in the continuous training model M-L1 according to an embodiment of the present application;

FIG. 17 is a schematic diagram illustrating the generation process of the difference values V1-i in the continuous training model M-L1 according to an embodiment of the present application;

FIG. 18 is a schematic diagram illustrating a generation process of a first predicted value C1-i corresponding to the discrete model M-S2 in an embodiment of the present application;

FIG. 19 is a schematic diagram illustrating a generation process of the second predicted value C2-i corresponding to the continuous model M-L2 in an embodiment of the present application;

fig. 20 is a schematic structural diagram of a feature information recognition system according to the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As used herein, the terms "first," "second," … …, etc. do not denote any order or order, nor are they used to limit the invention, but rather are used to distinguish one element from another element or operation described by the same technical terms.

As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.

As used herein, "and/or" includes any and all combinations of the described items.

For the defects existing in the prior art, the flow chart of the method for identifying feature information provided by the invention is shown in fig. 1, and the method comprises the following steps:

s101: and acquiring a discrete data unit and a continuous data unit corresponding to the first unique identifier of the data group to be predicted.

The data sets to be predicted are multiple sets, which should not be construed as a limitation in the present application.

In specific implementation, the step S101 is specifically executed as follows:

firstly, a first unique identifier is respectively set for each acquired data group to be predicted. Wherein, each data group to be predicted comprises: a number of first characteristic data.

And secondly, splitting each data group to be predicted according to the data type of each first characteristic data of each data group to be predicted to generate a discrete data unit and a continuous data unit corresponding to each data group to be predicted. Wherein each discrete data unit comprises: the first unique identification and the first feature data of the discrete type in the data group to be predicted corresponding to the first unique identification are obtained. Each continuous data unit corresponding to the discrete data unit comprises: the first unique identification and the first feature data of continuous types in the data group to be predicted corresponding to the first unique identification.

S102: and inputting the discrete data unit corresponding to the first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to the preset discrete model. Wherein the first predicted value includes: a first unique identification.

S103: and combining the continuous data unit corresponding to the first unique identifier and the first predicted value, and inputting the combined value into a preset continuous model to calculate and generate a second predicted value corresponding to the preset continuous model. Wherein the second predicted value includes: a first unique identification.

S104: and generating characteristic information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value.

As can be seen from the process shown in fig. 1, in the present application, a discrete data unit and a continuous data unit corresponding to a first unique identifier of a data set to be predicted are obtained; inputting the discrete data unit corresponding to the first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to the preset discrete model; combining the continuous data unit corresponding to the first unique identifier and the first predicted value, inputting the combined continuous data unit and the first predicted value into a preset continuous model, and calculating to generate a second predicted value corresponding to the preset continuous model; the characteristic information corresponding to the data group to be predicted is generated according to the data group to be predicted corresponding to the first unique identifier and the second predicted value, so that the method has the advantages of reducing the probability of occurrence of the fitting phenomenon caused by discrete data, being simple in machine learning algorithm and high in machine learning algorithm efficiency, and improving the efficiency of identifying the characteristic information by applying the machine learning algorithm in the method.

In order to make those skilled in the art better understand the present invention, a more detailed embodiment is listed below, as shown in fig. 2, the embodiment of the present invention provides a feature information identification method, which needs to perform a training process and a verification process before a prediction process, and then performs the prediction process according to an optimal feature information identification model output by the verification process, including the following steps:

the first step is as follows: prediction process

S201: and acquiring a discrete data unit and a continuous data unit corresponding to the first unique identifier of the data group to be predicted.

In specific implementation, as shown in fig. 3, the step S201 specifically executes the following process:

s301: and respectively setting a first unique identifier for each acquired data group to be predicted. Wherein, each data group to be predicted comprises: a number of first characteristic data.

S302: and splitting each data group to be predicted according to the data type of each first characteristic data of each data group to be predicted to generate a discrete data unit and a continuous data unit corresponding to each data group to be predicted.

Wherein each discrete data unit comprises: the first unique identification and the first feature data of the discrete type in the data group to be predicted corresponding to the first unique identification are obtained. Each continuous data unit corresponding to the discrete data unit comprises: the first unique identification and the first feature data of continuous types in the data group to be predicted corresponding to the first unique identification.

S202: and inputting the discrete data unit corresponding to each first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to each first unique identifier in the preset discrete model. Wherein the first predicted value includes: a first unique identification.

In specific implementation, the preset discrete model adopts the existing discrete algorithm such as a logistic regression algorithm, and the application is not limited to this.

S203: and combining the continuous data unit corresponding to the first unique identifier and the first predicted value, and inputting the combined value into a preset continuous model to calculate and generate a second predicted value corresponding to the preset continuous model. Wherein the second predicted value includes: a first unique identification.

In specific implementation, the preset continuous model adopts any one of the existing continuous algorithms such as the GBDT algorithm, and the application is not limited to this.

S204: and generating characteristic information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value.

The second step is that: training process

S205: and acquiring discrete training data units and continuous training data units corresponding to the second unique identifier of the training data set.

In specific implementation, the training data sets are multiple sets, and each training data set comprises: a discrete training data unit and a continuous training data unit. The discrete training data units and the continuous training data units realize one-to-one correspondence through the second unique identification of the corresponding training data set.

As shown in fig. 4, the specific implementation of step S205 is as follows:

s401: and respectively setting a second unique identifier and first characteristic marking information for each acquired training data set. Wherein each training data set comprises: a number of second characteristic data. The second unique identification and the first characteristic marking information have a one-to-one correspondence relationship.

S402: and splitting each training data group into discrete training data units and continuous training data units corresponding to each training data group according to the data type of each second characteristic data of each training data group.

Wherein each discrete training data unit comprises: and the second unique identifier, the second feature data of the discrete type in the training data set corresponding to the second unique identifier and the first feature labeling information of the training data set corresponding to the second unique identifier. The continuous training data unit corresponding to each discrete training data unit comprises: the second unique identifier, the second feature data of the continuous type in the training data set corresponding to the second unique identifier, and the first feature labeling information of the training data set corresponding to the second unique identifier.

The number of discrete training data units in each training data set is equal to the number of continuous training data units, and the number of discrete training data units, the continuous training data units and the training data sets realize a one-to-one correspondence relationship through the second unique identification of the training data sets.

S206: and inputting the discrete training data unit corresponding to the second unique identifier into each preset discrete algorithm to calculate and generate a discrete training model corresponding to each preset discrete algorithm and a first training predicted value corresponding to each preset discrete algorithm. Wherein the first training predictor comprises: a second unique identification.

The preset discrete algorithm is a plurality of existing discrete algorithms, such as a logistic regression algorithm, a naive bayes algorithm, a decision tree algorithm, and the like, which is not limited in the present application.

When the method is specifically implemented, the discrete training data units corresponding to the second unique identifiers are input into the logistic regression algorithm to calculate and generate the discrete training models corresponding to the logistic regression algorithm and the first training predicted values corresponding to the logistic regression algorithm.

And inputting the discrete training data unit corresponding to each second unique identifier into a naive Bayes discrete algorithm to calculate and generate a discrete training model corresponding to the naive Bayes discrete algorithm and a first training predicted value corresponding to the naive Bayes discrete algorithm, and by analogy, inputting the discrete training data unit corresponding to each second unique identifier into other preset discrete algorithms to calculate and generate discrete training models corresponding to the other preset discrete algorithms and first training predicted values corresponding to the other preset discrete algorithms respectively.

And each first training predicted value has a one-to-one correspondence relationship with each second unique identifier, and each first training predicted value also has a one-to-one correspondence relationship with each discrete algorithm.

S207: and combining the continuous training data units corresponding to the second unique identifiers with the first training predicted values corresponding to the second unique identifiers of each preset discrete algorithm, and inputting the combined values into the preset continuous algorithm to calculate and generate the continuous training models corresponding to the discrete training models of the preset discrete algorithms.

The preset continuous algorithm is a plurality of known continuous algorithms, such as GBDT algorithm, linear regression algorithm, K-means algorithm, etc., and the present application is not limited thereto.

As shown in fig. 5, the step S207 specifically executes the following steps:

s501: and respectively combining the continuous training data unit corresponding to each second unique identifier with the first training predicted value corresponding to each second unique identifier of each preset discrete algorithm to generate a combined training data unit corresponding to each second unique identifier of each preset discrete algorithm.

The merged training data unit has a one-to-one correspondence relationship with the preset discrete algorithm, and has a one-to-one correspondence relationship with the second unique identifier in each preset discrete algorithm.

In specific implementation, the continuous training data unit corresponding to each second unique identifier is combined with the first training prediction value corresponding to each second unique identifier of the logistic regression algorithm to generate a combined training data unit corresponding to each second unique identifier of the logistic regression algorithm.

And combining the continuous training data unit corresponding to each second unique identifier with the first training predicted value corresponding to each second unique identifier of the naive Bayes discrete algorithm to generate a combined training data unit corresponding to each second unique identifier of the naive Bayes discrete algorithm. And by analogy, combining the continuous training data unit corresponding to each second unique identifier with the first training predicted value corresponding to each second unique identifier of other preset discrete algorithms to generate a combined training data unit corresponding to each second unique identifier of other preset discrete algorithms.

S502: and inputting each combined training data unit into a preset continuous algorithm to calculate and generate a continuous training model corresponding to the discrete training model of each preset discrete algorithm. The preset continuous algorithm may be multiple or one, and the present application is not limited thereto.

Wherein the preset continuous algorithm comprises: the GBDT algorithm, the linear regression algorithm, the K-means algorithm, etc., but the present application is not limited thereto.

When the method is specifically implemented, the merged training data unit corresponding to each second unique identifier of the logistic regression algorithm is input into the GBDT algorithm to calculate and generate a continuous training model corresponding to the discrete training model of the logistic regression algorithm.

The merged training data units corresponding to the second unique identifiers of different preset discrete algorithms may input the same preset continuous algorithm (as shown in example 1), or may input different preset continuous algorithms (as shown in example 2).

Example 1: and inputting the merged training data unit corresponding to each second unique identifier of the logistic regression algorithm into the GBDT algorithm to calculate and generate a continuous training model corresponding to the discrete training model of the logistic regression discrete algorithm. And analogizing in sequence, inputting the merged training data unit corresponding to each second unique identifier of other preset discrete algorithms into the GBDT algorithm to calculate and generate continuous training models corresponding to the discrete training models of the other preset discrete algorithms.

Example 2: and inputting the merged training data unit corresponding to each second unique identifier of the logistic regression algorithm into the GBDT algorithm to calculate and generate a continuous training model corresponding to the discrete training model of the logistic regression algorithm.

And inputting the merged training data unit corresponding to each second unique identifier of the naive Bayes algorithm into a linear regression algorithm to calculate and generate a continuous training model corresponding to the discrete training model of the naive Bayes algorithm.

And inputting the merged training data unit corresponding to each second unique identifier of other preset discrete algorithms into any one of the GBDT algorithm, the linear regression algorithm or the K-means algorithm to calculate and generate a continuous training model corresponding to the discrete training model of the other preset discrete algorithms.

S208: and combining each discrete training model with the continuous training model corresponding to each discrete training model to generate each characteristic information recognition model.

In specific implementation, the feature information identification models are multiple, and each feature information identification model comprises: a discrete training model and a continuous training model corresponding to the discrete training model.

And combining the discrete training model corresponding to the logistic regression algorithm with the continuous training model corresponding to the discrete training model of the logistic regression algorithm to generate a feature information recognition model.

And combining the discrete training model corresponding to the naive Bayes discrete algorithm with the continuous training model corresponding to the discrete training model of the naive Bayes discrete algorithm to generate another feature information recognition model.

And by analogy, combining the discrete training models corresponding to the discrete algorithms of other appointments with the continuous training models corresponding to the discrete training models of the discrete algorithms of other appointments to generate other characteristic information recognition models.

The third step: verification process

S209: and acquiring a third unique identifier of the verification data set, the second feature marking information and discrete verification data units and continuous verification data units corresponding to the third unique identifier.

Wherein, the verification data set is the multiunit, and every verification data set includes: one discrete verification data unit and one continuous verification data unit. The discrete verification data unit and the continuous verification data unit realize one-to-one correspondence through the third unique identification of the corresponding verification data group; the second feature labeling information and the third unique identifier have a one-to-one correspondence relationship.

In specific implementation, as shown in fig. 6, step S209 specifically executes the following process:

s601: and respectively setting a third unique identifier and second characteristic marking information for each acquired verification data set. Wherein each validation data set comprises: a number of third characteristic data.

S602: and splitting each verification data group according to the data type of each third characteristic data of each verification data group to generate a discrete verification data unit and a continuous verification data unit of each verification data group.

Wherein each discrete verification data unit comprises: and the third unique identifier, the third feature data of the discrete type in the verification data group corresponding to the third unique identifier and the second feature marking information of the verification data group corresponding to the third unique identifier. The continuous verification data unit corresponding to each discrete verification data unit comprises: and the third unique identifier, the continuous third feature data in the verification data group corresponding to the third unique identifier and the second feature marking information of the verification data group corresponding to the third unique identifier.

S210: and respectively inputting the discrete verification data units corresponding to the third unique identifier into each discrete training model to calculate and generate a first verification prediction value corresponding to each discrete training model. Wherein the first verification prediction comprises: a third unique identification.

S211: and combining the continuous verification data unit corresponding to the third unique identifier with the first verification predicted value, inputting the combined continuous verification data unit corresponding to the discrete training model of each first verification predicted value, and calculating to generate a second verification predicted value corresponding to each continuous training model. Wherein the second verification prediction comprises: a third unique identification.

In specific implementation, as shown in fig. 7, the specific execution of step S211 includes the following steps:

s701: and respectively combining the continuous verification data units corresponding to the third unique identifications and the first verification predicted values corresponding to the third unique identifications of each discrete training model to generate combined verification data units corresponding to the third unique identifications of each discrete training model.

S702: and inputting the merged verification data unit corresponding to each third unique identifier of each discrete training model into the continuous training model corresponding to each discrete training model to calculate and generate a second verification predicted value corresponding to each third unique identifier of each continuous training model.

S212: and generating an optimal feature information identification model according to the second feature labeling information corresponding to the third unique identifiers and the second verification predicted value corresponding to the third unique identifiers in each continuous training model.

The optimal characteristic information identification model comprises the following steps: a preset discrete model and a preset continuous model.

In specific implementation, as shown in fig. 8, the specific process of step S212 is executed as follows, which is not limited in this application:

s801: and respectively differentiating the second feature marking information corresponding to each third unique identifier and the second verification predicted value corresponding to each third unique identifier in each continuous training model to generate a difference value corresponding to each third unique identifier in each continuous training model.

S802: and summing the difference values corresponding to the third unique identifiers in each continuous training model to generate a verification value of the characteristic information identification model corresponding to each continuous training model.

S803: and sequencing the verification values of the feature information identification models, and taking the feature information identification model corresponding to the minimum verification value as the optimal feature information identification model.

In one embodiment, the characteristic information includes: fraud characteristic information, potential large customer characteristic information, etc., which the present application is not limited to.

In order to make the present invention better understood by those skilled in the art, a more detailed scenario embodiment one is listed below.

As shown in fig. 9, an embodiment of the present invention provides a fraud feature information identification method, including the following steps:

the first step is as follows: training process

Setting the unique identification Ki as a positive integer, wherein the characteristic labeling information Bi comprises: 1 and 0, wherein i is a positive integer greater than or equal to 1. When the characteristic marking information Bi is 1, the training data set is represented as a training data set with fraud characteristics, and when the characteristic marking information Bi is 0, the training data set is represented as a training data set without fraud characteristics. Training data sets with fraudulent features will be identified as fraudulent clients and training data sets without fraudulent features will be identified as non-fraudulent clients.

S901: a unique identifier K1 and first feature labeling information B1 are respectively set for each acquired training data set H. Each unique identifier K1 and each first feature annotation information B1 have a one-to-one correspondence relationship.

The unique identifier K1 is a positive integer, and the first feature annotation information B1 includes: 1 and 0, i is a positive integer of 1 or more. When the first feature labeling information B1 is 1, it represents that the training data set H is a training data set with a fraudulent feature, and when the first feature labeling information B1 is 0, it represents that the training data set H is a training data set without a fraudulent feature. Training data sets with fraudulent features will be identified as fraudulent clients and training data sets without fraudulent features will be identified as non-fraudulent clients.

Each training data set H includes: a number of characteristic data T1, as shown in table 1.

TABLE 1

Wherein the training data set H comprises: age, city, occupation, income and the like characteristic data T1, which is not limited in this application. T1-city and T1-occupation are set as discrete type feature data, and T1-age and T1-income are set as continuous type feature data. The training data set H is a plurality of sets including: h1, H2, … …, H99999999, but the present application is not limited thereto.

S902: and splitting each training data set H according to the data type of each feature data T1 of each training data set H to generate a discrete training data unit T11 and a continuous training data unit T12 corresponding to each training data set H.

Wherein each discrete training data unit T11-i comprises: the unique identifier K1-i, the feature data T1 of the discrete type in the training data group Hi corresponding to the unique identifier K1-i and the first feature labeling information B1-i of the training data group Hi corresponding to the unique identifier K1-i, wherein i is a positive integer greater than or equal to 1.

Specifically, the discrete training data unit T11-i includes: the characteristic data T1-city, the characteristic data T1-occupation, the unique identifier K1-i and the first characteristic marking information B1-i. As shown in table 2, the discrete training data units T11-i corresponding to each training data group Hi are T11-1, T11-2, … …, and T11-99999999, respectively, where i is 1, 2, … …, 99999999.

TABLE 2

Training data set H	Discrete training data unit T11	Unique identification K1	T1-City	T1-profession	First feature labeling information B1
						H1	T11-1	00000001	021 (Shanghai)	0 (teacher)	0
H 2	T11-2	00000002	010 (Beijing)	1 (doctor)	0
						……	……	……	……	……	……
H 99999999	T11-H99999999	99999999	020 (Guangzhou)	5 (Wuye)	1

The consecutive training data units T12-i corresponding to each discrete training data unit T11-i include: the unique identifier K1-i, the continuous type feature data T1 in the training data group Hi corresponding to the unique identifier K1-i, and the first feature labeling information B1-i of the training data group Hi corresponding to the unique identifier K1-i, wherein i is a positive integer greater than or equal to 1, as shown in Table 3.

TABLE 3

Training data set H	Continuous training data unit T12	Unique identification K1	T1-age	T1-income	First feature labeling information B1
						H1	T12-1	00000001	20	100000	0
H 2	T12-2	00000002	60	150000	0
						……	……	……	……	……	……
H 99999999	T12-H99999999	99999999	30	2000	1

Specifically, the continuous training data unit T12-i includes: the characteristic data T1-age, the characteristic data T1-income, the unique identification K1-i and the first characteristic marking information B1-i. As shown in table 3. The continuous training data units T12-i corresponding to the training data groups Hi are T12-1, T12-2, … … and T12-99999999, wherein i is 1, 2, … … and 99999999.

The number of discrete training data units T11-i in each training data set Hi is equal to the number of continuous training data units T12-i, and the number of discrete training data units T11-i, continuous training data units T12-i and training data sets Hi realize a one-to-one correspondence relationship through the unique identification K1-i of each training data set Hi.

S903: and inputting each discrete training data unit T11-i into each preset discrete algorithm Sj to calculate and generate a discrete training model M-Sj corresponding to each preset discrete algorithm Sj and a first training prediction value X-Sj-i corresponding to each preset discrete algorithm Sj, wherein i and j are positive integers more than or equal to 1. Wherein the first training prediction value X-Sj-i comprises: the unique identification K1-i and the predicted value Xi. Wherein i is 1, 2, 3, … … 99999999.

In specific implementation, the preset discrete algorithm Sj includes: the present application is not limited to the logistic regression algorithm S1, the naive bayes algorithm S2, and the decision tree algorithm S3, wherein j is 1, 2, 3, … ….

As shown in FIG. 10, T11-1 corresponding to the unique identifier K1-1, T11-2 and … … corresponding to the unique identifier K1-2 and T11-99999999 corresponding to the unique identifier K1-99999999 are all input into a logistic regression algorithm S1 to calculate and generate a discrete training model M-S1 corresponding to the logistic regression algorithm S1 and a first training predicted value X-S1-1, a first training predicted value X-S1-2, a first training predicted value … … and a first training predicted value X-S1-99999999 corresponding to each unique identifier K1-i in the logistic regression algorithm S1.

Wherein the first training prediction value X-S1-1 comprises: the unique identifier K1-1 and the training result value X1, and the first training predicted value X-S1-2 comprises: the unique identifier K1-2 and the training result values X2, … …, and the first training predicted value X-S1-99999999 comprises: the unique identifier K1-99999999 and the training result value X99999999 are shown in Table 4.

TABLE 4

First training predictor X-S1-i	Unique identification K1-i	Training result values Xi
			X-S1-1	00000001	-2.45
X-S1-2	00000002	-4.56
			……	……	……
X-S1-99999999	99999999	10.23

And sequentially calculating to generate the discrete training models M-Sj corresponding to each preset discrete algorithm Sj and the first training predicted values X-Sj-i corresponding to each preset discrete algorithm Sj by referring to the calculation processes of the discrete training models M-S1 corresponding to the logistic regression algorithm S1 and the first training predicted values X-S1-i corresponding to the logistic regression algorithm S1.

And each first training predicted value X-Sj-i has a one-to-one correspondence relationship with each unique identifier K1-i, and each first training predicted value X-Sj-i also has a one-to-one correspondence relationship with each preset discrete algorithm Sj.

S904: and respectively combining the continuous training data unit T12-i corresponding to each unique identification K1-i with the first training predicted value X-Sj-i corresponding to each unique identification K1-i of each preset discrete algorithm Sj to generate a combined training data unit T13-Sj-i corresponding to each unique identification K1-i of each preset discrete algorithm Sj. Wherein, i and j are positive integers which are more than or equal to 1.

The combined training data units T13-Sj-i have a corresponding relation with the preset discrete algorithm Sj, and have a one-to-one corresponding relation with the unique identifier K1-i in each preset discrete algorithm Sj.

In specific implementation, as shown in fig. 11, the continuous training data unit T12-1 corresponding to the unique identifier K1-1 and the first training prediction value X-S1-1 corresponding to the unique identifier K1-1 in the logistic regression algorithm S1 are merged to generate a merged training data unit T13-S1-1 corresponding to the unique identifier K1-1 in the logistic regression algorithm S1; combining the continuous training data unit T12-2 corresponding to the unique identifier K1-2 with the first training predicted value X-S1-2 corresponding to the unique identifier K1-2 in the logistic regression algorithm S1 to generate a combined training data unit T13-S1-2 corresponding to the unique identifier K1-2 in the logistic regression algorithm S1; … …, combining the continuous training data units T12-99999999 corresponding to the unique identifications K1-99999999 with the first training predicted values X-S1-99999999 corresponding to the unique identifications K1-99999999 in the logistic regression algorithm S1 to generate combined training data units T13-S1-99999999 corresponding to the unique identifications K1-99999999 in the logistic regression algorithm S1.

By analogy, the continuous training data units T12-i corresponding to the unique identifications K1-i are respectively combined with the first training predicted values X-Sj-i corresponding to the unique identifications K1-i in any other preset discrete algorithm Sj to generate combined training data units T13-Sj-i corresponding to the unique identifications K1-i in the preset discrete algorithm Sj.

S905: and inputting each combined training data unit T13-Sj-i into at least one preset continuous algorithm Lj to calculate and generate a continuous training model M-Lj corresponding to the discrete training model M-Sj of each preset discrete algorithm Sj.

The preset continuous algorithm Lj may be multiple or one, and the present application is not limited thereto. The preset continuous algorithm Lj includes: GBDT algorithm L1, linear regression algorithm L2, and K-means algorithm L3, etc., but the present application is not limited thereto, where j is 1, 2, and 3 … ….

In specific implementation, as shown in fig. 12, the merged training data unit T13-S1-1 corresponding to the unique identifier K1-1 of the logistic regression algorithm S1, the merged training data units T13-S1-1 and … … corresponding to the unique identifier K1-2, and the merged training data units T13-S1-99999999 corresponding to the unique identifiers K1-99999999 are all input to the GBDT algorithm L1 to calculate and generate the continuous training model M-L1 corresponding to the discrete training model M-S1 of the logistic regression algorithm S1.

In this embodiment, the merged training data units T13-Sj-i corresponding to the unique identifiers K1-i of different preset discrete algorithms Sj are set to input different preset continuous algorithms Lj, for example:

the merged training data units T13-S2-1 corresponding to the unique identifier K1-1, the merged training data units T13-S2-1 and … … corresponding to the unique identifier K1-2 and the merged training data units T13-S2-99999999 corresponding to the unique identifiers K1-99999999 of the naive Bayes algorithm S2 are all input into a linear regression algorithm L2 to calculate and generate a continuous training model M-L2 corresponding to a discrete training model M-S2 of the naive Bayes algorithm S2.

S906: and combining each discrete training model M-Sj with the continuous training model M-Lj corresponding to each discrete training model M-Sj to generate each characteristic information identification model Zj.

As shown in fig. 13, the feature information recognition models Zj are plural, and each feature information recognition model Zj includes: a discrete training model M-Sj and a continuous training model M-Lj corresponding to the discrete training model M-Sj.

In specific implementation, the discrete training model M-S1 corresponding to the logistic regression algorithm S1 and the continuous training model M-L1 corresponding to the discrete training model M-S1 of the logistic regression algorithm S1 are combined to generate a feature information recognition model Z1. Combining the discrete training model M-S2 corresponding to the naive Bayes algorithm S2 with the continuous training model M-L2 corresponding to the discrete training model M-S2 of the naive Bayes algorithm S2 to generate the feature information recognition model Z2, and repeating the steps, combining the discrete training models M-Sj corresponding to the discrete algorithms Sj of other appointments with the continuous training models M-Lj corresponding to the discrete training models M-Sj of the discrete algorithms Sj of other appointments to generate the other feature information recognition models Zj.

The second step is that: verification process

S907: a unique identifier K2 and second feature marking information B2 are respectively set for each acquired verification data group G. Each unique identifier K2 and each second feature label information B2 have a one-to-one correspondence relationship.

The unique identifier K2 is a positive integer, and the second feature annotation information B2 includes: 1 and 0, i is a positive integer of 1 or more. When the second signature information B2 is 1, the verification data group G is a verification data group with fraud characteristics, and when the second signature information B2 is 0, the verification data group G is a verification data group without fraud characteristics. The authentication data set G with the fraudulent characteristic will be identified as a fraudulent client and the authentication data set G without the fraudulent characteristic will be identified as a non-fraudulent client.

Each verification data group G includes: a number of characteristic data T2, as shown in table 5.

Wherein the verification data group G includes: age, city, occupation, income and the like characteristic data T2, which is not limited in this application. T2-city and T2-occupation are set as discrete type feature data, and T2-age and T2-income are set as continuous type feature data. The verification data group G is a plurality of groups, including: g1, G2, … …, G99999999, to which the present application is not limited.

TABLE 5

S908: the discrete verification data units T21 and the continuous verification data units T21 of each verification data group G are generated by splitting each verification data group G according to the data type of the respective feature data T2 of each verification data group G.

Wherein each discrete verification data unit T21-i comprises: the unique identifier K2-i, the discrete type feature data T2 in the verification data group G corresponding to the unique identifier K2-i and the second feature marking information B2-i of the verification data group G corresponding to the unique identifier K2-i, wherein i is a positive integer greater than or equal to 1.

Specifically, the discrete verification data unit T21-i includes: the characteristic data T2-city, the characteristic data T2-occupation, the unique identifier K2-i and the second characteristic marking information B2-i. As shown in table 6, the discrete verification data units T21-i corresponding to each verification data set Gi are T21-1, T21-2, … …, and T21-99999999, respectively, where i is 1, 2, … …, 99999999.

TABLE 6

Verification data set G	Discrete authentication data element T21	Unique identification K2	T2-City	T2-profession	Second feature notation information B2
						G1	T21-1	00000001	021 (Shanghai)	0 (teacher)	0
G2	T21-2	00000002	010 (Beijing)	1 (doctor)	0
						……	……	……	……	……	……
G99999999	T21-99999999	99999999	020 (Guangzhou)	5 (Wuye)	1

TABLE 7

Verification data set G	Continuous authentication data element T22	Unique identification K2	T2-age	T2-income	Second feature notation information B2
						G1	T22-1	00000001	20	100000	0
G2	T22-2	00000002	60	150000	0
						……	……	……	……	……	……
G99999999	T22-99999999	99999999	30	2000	1

The consecutive verification data cells T22-i corresponding to each discrete verification data cell T21-i include: the unique identifier K2-i, the continuous type feature data T2 in the verification data group Gi corresponding to the unique identifier K2-i and the second feature marking information B2-i of the verification data group Gi corresponding to the unique identifier K2-i, wherein i is a positive integer greater than or equal to 1, are shown in Table 7.

Specifically, the continuous verification data cell T22-i includes: the characteristic data T2-age, the characteristic data T2-income, the unique identification K2-i and the second characteristic marking information B2-i. As shown in table 7. The consecutive verification data units T22-i corresponding to each verification data set Gi are T22-1, T22-2, … …, and T22-99999999, where i is 1, 2, … …, 99999999.

The number of discrete verification data units T21-i in each verification data group Gi is equal to the number of continuous verification data units T22-i, and the discrete verification data units T21-i, the continuous verification data units T22-i and the verification data groups Gi realize a one-to-one correspondence relationship through the unique identification K2-i of the verification data groups Gi.

S909: and inputting each discrete verification data unit T21-i into each discrete training model M-Sj to calculate and generate each first verification prediction value Y1-Mj-i, i and j corresponding to each discrete training model M-Sj, wherein each of i and j is a positive integer greater than or equal to 1.

Wherein the first verification prediction value Y1-Mj-i comprises: the unique identification K2-i and the first verification result value Y1-i. Wherein i is 1, 2, 3, … … 99999999.

In specific implementation, as shown in fig. 14, the discrete verification data unit T21-1, the discrete verification data units T21-2, … … and the discrete verification data units T21-99999999 are input into the discrete training model M-S1 to calculate and generate the first verification predicted value Y1-M1-1, the first verification predicted value Y1-M1-2, … … and the first verification predicted value Y1-M1-99999999 corresponding to the discrete training model M-S1.

Wherein the first verification prediction value Y1-M1-1 comprises: the unique identifier K2-1 and the first verification result value Y1-1, and the first verification prediction value Y1-M1-2 comprises: the unique identifier K2-2 and the first verification result value Y1-2, … …, and the first verification prediction value Y1-M1-99999999 comprise: the unique identifiers K2-99999999 and the first verification result values Y1-99999999 are shown in Table 8.

TABLE 8

First verification prediction value Y1-M1-i	Unique identification K2-i	First verification result value Y1-i
			Y1-M1-1	00000001	-2.45
Y1-M1-2	00000002	-4.56
			……	……	……
Y1-M1-99999999	99999999	10.23

And sequentially calculating and generating the first verification predicted values Y1-Mj-i corresponding to other discrete training models M-Sj by referring to the calculation process of the first verification predicted values Y1-M1-i corresponding to the discrete training models M-S1. And each first verification predicted value Y1-Mj-i has a one-to-one correspondence relation with each unique identification K2-i.

S910: and respectively combining the continuous verification data units T22-i corresponding to each unique identification K2-i and the first verification predicted values Y1-Mj-i corresponding to the unique identifications K2-i in each discrete training model M-Sj to generate combined verification data units T23-Sj-i corresponding to the unique identifications K2-i of each discrete training model M-Sj. Wherein, i and j are positive integers which are more than or equal to 1.

The merged verification data unit T23-Sj-i has a corresponding relation with the discrete training models M-Sj, and has a one-to-one corresponding relation with each unique identifier K2-i in each discrete training model M-Sj.

In specific implementation, as shown in fig. 15, the continuous verification data unit T22-1 corresponding to the unique identifier K2-1 is combined with the first verification predicted value Y1-M1-1 corresponding to the unique identifier K2-1 in the discrete training model M-S1 to generate a combined verification data unit T23-S1-1 corresponding to the unique identifier K2-1 of the discrete training model M-S1; and combining the continuous verification data unit T22-2 corresponding to the unique identifier K2-2 with the first verification predicted value Y1-M1-2 corresponding to the unique identifier K2-2 in the discrete training model M-S1 to generate a combined verification data unit T23-S1-2, … … corresponding to the unique identifier K2-2 of the discrete training model M-S1, and combining the continuous verification data unit T22-99999999 corresponding to the unique identifier K2-99999999 with the first verification predicted value Y1-M1-99999999 corresponding to the unique identifier K2-99999999 in the discrete training model M-S1 to generate a combined verification data unit T23-S1-99999999 corresponding to the unique identifier K2-99999999 of the discrete training model M-S1.

By analogy, the continuous verification data units T22-i corresponding to the unique identifications K2-i and the first verification predicted values Y1-Mj-i corresponding to the unique identifications K2-i in any one of the other discrete training models M-Sj are combined to generate combined verification data units T23-Sj-i corresponding to the unique identifications K2-i of any one of the discrete training models M-Sj.

S911: and inputting the merged verification data unit T23-Sj-i corresponding to each unique identifier K2-i of each discrete training model M-Sj into the continuous training model M-Lj corresponding to each discrete training model M-Sj to calculate and generate a second verification predicted value Y2-Mj-i corresponding to each unique identifier K2-i of each continuous training model M-Lj, wherein i and j are positive integers which are more than or equal to 1.

Wherein the second verification prediction value Y2-Mj-i comprises: the unique identifier K2-i and a second verification result value Y2-i. Wherein i is 1, 2, 3, … … 99999999.

In specific implementation, as shown in fig. 16, a merged verification data unit T23-S1-1 corresponding to the unique identifier K2-1, a merged verification data unit T23-S1-2, … … corresponding to the unique identifier K2-2, and a merged verification data unit T23-S1-99999999 corresponding to the unique identifier K2-99999999 in the discrete training model M-S1 are input into a continuous training model M-L1 corresponding to the discrete training model M-S1 to calculate and generate a second verification prediction value Y2-M1-1 corresponding to the unique identifier K2-1, a second verification prediction value Y2-M1-2, a second verification prediction value … … corresponding to the unique identifier K2-2, and a second verification prediction value Y2-M1-99999999 corresponding to the unique identifier K2-99999999 of the continuous training model M-L1.

Wherein the second verification prediction value Y2-M1-1 comprises: the unique identifier K2-1 and the second verification result value Y2-1, and the second verification prediction value Y2-M1-2 comprise: the unique identifier K2-2 and the second verification result value Y2-2, … …, and the second verification prediction value Y2-M1-99999999 comprise: the unique identifiers K2-99999999 and the second verification result values Y2-99999999 are shown in Table 9.

TABLE 9

Second validation predict value Y2-M1-i	Unique identification K2-i	Second verification result value Y2-i
			Y2-M1-1	00000001	-2.45
Y2-M1-2	00000002	-4.56
			……	……	……
Y2-M1-99999999	99999999	10.23

And inputting combined verification data units T23-S2-1 corresponding to the unique identifier K2-1, combined verification data units T23-S2-2 and … … corresponding to the unique identifier K2-2 and combined verification data units T23-S2-99999999 corresponding to the unique identifier K2-99999999 in the discrete training model M-S2 into a continuous training model M-L2 corresponding to the discrete training model M-S2 to calculate and generate a second verification prediction value Y2-M2-1 corresponding to the unique identifier K2-1, a second verification prediction value Y2-M2-2 and … … corresponding to the unique identifier K2-2 and a second verification prediction value Y2-M2-99999999 corresponding to the unique identifier K2-99999999 of the continuous training model M-L2.

And sequentially calculating and generating second verification predicted values Y2-Mj-i corresponding to other discrete training models M-Sj by referring to the calculation process of the second verification predicted values Y2-M1-i corresponding to the discrete training models M-S1 and the calculation process of the second verification predicted values Y2-M2-i corresponding to the discrete training models M-S2. And each second verification predicted value Y2-Mj-i has a one-to-one correspondence relation with each unique identifier K2-i.

S912: and respectively differencing the second feature labeling information B2-i corresponding to each unique identifier K2-i and the second verification predicted values Y2-Mj-i corresponding to each unique identifier K2-i in each continuous training model M-Lj to generate a difference value Vj-i corresponding to each unique identifier K2-i in each continuous training model M-Lj.

In specific implementation, as shown in fig. 17, the second feature labeling information B2-1 corresponding to the unique identifier K2-1 is differentiated from the second verification predicted value Y2-M1-1 corresponding to the unique identifier K2-1 in the continuous training model M-L1 to generate a difference value V1-1 corresponding to the unique identifier K2-1 in the continuous training model M-L1, the second feature labeling information B2-2 corresponding to the unique identifier K2-2 is differentiated from the second verification predicted value Y2-M1-2 corresponding to each unique identifier K2-2 in each continuous training model M-L1 to generate difference values V1-2, … … corresponding to the unique identifier K869-2 in the continuous training model M-L1, and the second feature labeling information B2-99999999 corresponding to the unique identifier K2-99999999 is differentiated from the second verification predicted value V8672-72-2 corresponding to each unique identifier K68658 in each continuous training model M-L1 And the value Y2-M1-1 is used for making a difference to generate a difference value V1-99999999 corresponding to the unique identifier K2-1 in the continuous training model M-L1.

And carrying out difference on second feature labeling information B2-1 corresponding to the unique identifier K2-1 and a second verification predicted value Y2-M2-1 corresponding to each unique identifier K2-1 in each continuous training model M-L2 to generate a difference value V2-1 corresponding to a unique identifier K2-1 in the continuous training model M-L2, carrying out difference on second feature labeling information B2-2 corresponding to the unique identifier K2-2 and a second verification predicted value Y2-M2-2 corresponding to each unique identifier K2-2 in each continuous training model M-L2 to generate a difference value V2-2, 2 and a difference value between second feature labeling information B2-2 corresponding to the unique identifier K2-2 and a second verification predicted value Y2-Y72-2 corresponding to each unique identifier K2 in each continuous training model M-L2 to generate a difference value M2-M2 And generating a difference value V2-99999999 corresponding to the unique identifier K2-1 in the continuous training model M-L2.

And by analogy, respectively generating a difference value Vj-i corresponding to each unique identifier K2-i in each continuous training model M-Lj.

S913: and (4) summing all the difference values Vj-i in each continuous training model M-Lj to generate a verification value Qj of the characteristic information identification model Zj corresponding to each continuous training model M-Lj.

In specific implementation, the difference values V1-1, C1-2, … … and C1-99999999 in the continuous training model M-L1 are used for generating a verification value Q1 of a characteristic information recognition model Z1 corresponding to the continuous training model M-L1; making and generating a verification value Q2 of a characteristic information recognition model Z2 corresponding to the continuous training model M-L2 by using the difference values V2-1, C2-2, … … and C2-99999999 in the continuous training model M-L2; and analogizing in turn, generating verification values Qj of the characteristic information identification models Zj.

S914: and sequencing the verification values Qj of the characteristic information identification models Zj, and taking the characteristic information identification model Zj corresponding to the minimum verification value Qj as the optimal characteristic information identification model.

In concrete implementation, the verification value Q1 of the feature information recognition model Z1, the verification values Q2 and … … of the feature information recognition model Z2, and the verification value Q9999999999 of the feature information recognition model Z9999999999 are sorted, and the feature information recognition model corresponding to the smallest verification value is used as the optimal feature information recognition model.

In the present embodiment, the Q2 is set to the minimum value, and the optimal feature information recognition model is Z2, a predetermined discrete model M-S2 and a predetermined continuous model M-L2.

The third step: test procedure

S915: a unique identifier K3 is set for each acquired data set D to be predicted.

Wherein, the unique identification K3 is a positive integer.

Each data set D to be predicted comprises: a number of characteristic data T3, as shown in table 10.

Watch 10

Wherein, the data group D to be predicted comprises: age, city, occupation, income and the like characteristic data T3, which is not limited in this application. T3-city and T3-occupation are set as discrete type feature data, and T3-age and T3-income are set as continuous type feature data. The data sets D to be predicted are multiple sets, and the data sets D to be predicted comprise: d1, D2, … …, D99999999, but the present application is not limited thereto.

S916: and splitting each data group D to be predicted according to the data type of each characteristic data T3 of each data group D to be predicted to generate a discrete data unit T31 and a continuous data unit T32 corresponding to each data group D to be predicted.

Wherein each discrete data unit T31-i includes: the unique identification K3-i and the unique identification K3-i correspond to the discrete type feature data T3 in the data group D to be predicted.

Specifically, the discrete data unit T31-i includes: characteristic data T3-city, characteristic data T3-occupation, unique identification K3-i. As shown in table 11, the discrete data units T31-i corresponding to each data set Di to be predicted are T31-1, T31-2, … …, and T31-99999999, where i is 1, 2, … …, 99999999.

TABLE 11

Data set D to be predicted	Discrete data unit T31	Unique identification K3	T3-City	T3-profession
					D1	T31-1	00000001	021 (Shanghai)	0 (teacher)
D2	T31-2	00000002	010 (Beijing)	1 (doctor)
					……	……	……	……	……
D99999999	T31-99999999	99999999	020 (Guangzhou)	5 (Wuye)

Each successive data cell T32-i corresponding to a discrete data cell T31-i includes: the unique identification K3-i and the unique identification K3-i correspond to the continuous type feature data T3 in the data group D to be predicted, wherein i is a positive integer greater than or equal to 1.

Specifically, the consecutive data cell T32-i includes: the characteristic data T3-age, the characteristic data T3-income and the unique identification K3-i. As shown in table 12. The continuous data units T32-i corresponding to each data group Di to be predicted are T32-1, T32-2, … …, and T32-99999999, where i is 1, 2, … …, 99999999.

TABLE 12

Data set D to be predicted	Consecutive data element T32	Unique identification K3	T3-age	T3-income
					D1	T32-1	00000001	20	100000
D2	T32-2	00000002	60	150000
					……	……	……	……	……
D99999999	T32-99999999	99999999	30	2000

The number of discrete data units T31-i in each data group Di to be predicted is equal to the number of continuous data units T32-i, and the discrete data units T31-i, the continuous data units T32-i and the data groups Di to be predicted realize one-to-one correspondence through the unique identification K3-i of the data groups Di to be predicted.

S917: and inputting the discrete data units T31-i corresponding to the unique identifications K3-i into a preset discrete model M-S2 to calculate and generate first predicted values C1-i corresponding to the unique identifications K3-i in the preset discrete model M-S2.

Wherein the first predicted value C1-i includes: the unique identifier K3-i and the first test result value C1 i.

In specific implementation, the preset discrete model M-S2 adopts the existing discrete algorithm such as the logistic regression algorithm, and the application is not limited thereto.

As shown in FIG. 18, T31-1 corresponding to the unique identifier K3-1, T31-2 and … … corresponding to the unique identifier K3-2 and T31-99999999 corresponding to the unique identifier K3-99999999 are all input into the discrete model M-S2 to calculate and generate a first predicted value C1-i corresponding to the discrete model M-S2.

Wherein the first predicted value C1-1 includes: unique identification K3-1 and first test result value C₁₁The first predicted value C1-2 includes: unique identification K3-2 and first test result value C₁₂… …, the first predicted values C1-99999999 include: unique identification K3-2 and first test result value C_199999999As shown in table 13.

Watch 13

First predicted value C1-i	Unique identification K3-i	First test result value C_1i
			C1-1	00000001	-2.45
C1-2	00000002	-4.56
			……	……	……
C1-99999999	99999999	10.23

And each first predicted value C1-i has a one-to-one correspondence relationship with each unique identifier K3-i.

S918: and respectively merging the continuous data units T32-i corresponding to the unique identification K3-i with the first prediction values C1-i, inputting the merged continuous data units T32-i into a preset continuous model M-L2, and calculating to generate second prediction values C2-i corresponding to the preset continuous model M-L2.

Wherein the second predicted value C2-i includes: the first unique identifier K3-i and the second test result value C_2i。

In specific implementation, the predetermined continuous model M-L2 adopts a GBDT algorithm, which is not limited in this application.

As shown in fig. 19, the continuous data unit T32-1 corresponding to the unique identifier K3-1 is merged with the first prediction value C1-1 and then input to a preset continuous model M-L2 to calculate and generate a second prediction value C2-1, the continuous data unit T32-2 corresponding to the unique identifier K3-2 is merged with the first prediction value C1-2 and then input to a preset continuous model M-L2 to calculate and generate a second prediction value C2-2, the continuous data unit T32-3 corresponding to the unique identifier K3-3 is merged with the first prediction value C1-3 and then input to a preset continuous model M-L2 to calculate and generate a second prediction value C2-3, … …, merging the continuous data units T32-99999999 corresponding to the unique identifications K3-99999999 with the first predicted values C1-99999999, and inputting the merged data units into a preset continuous model M-L2 to calculate and generate second predicted values C2-99999999.

Wherein the second predicted value C2-1 includes: unique identification K3-1 and second test result value C₂₁The second predicted value C2-2 includes: unique identification K3-2 and second test result value C₂₂… …, the second predicted value C2-99999999 includes: unique identification K3-99999999 and second test result value C_299999999As shown in table 14.

TABLE 14

Second predicted value C2-i	Unique identification K3-i	Second testTest result value C_2i
			C2-1	00000001	0.1346
C2-2	00000002	0.0293
			……	……	……
C2-99999999	99999999	0.9374

S919: and generating characteristic information B3i corresponding to the data group Di to be predicted according to the data group Di to be predicted corresponding to the unique identifier K3-i and the second predicted value C2-i.

In specific implementation, feature information B1 corresponding to the data group D1 to be predicted is generated according to the data group D1 to be predicted corresponding to the unique identifier K3-1 and the second predicted value C2-1, feature information B32 and … … corresponding to the data group D31 to be predicted is generated according to the data group D2 to be predicted corresponding to the unique identifier K3-2 and the second predicted value C2-2, and feature information B39999999999 corresponding to the data group D99999999 to be predicted is generated according to the data group D999999999999 to be predicted corresponding to the unique identifier K3-99999999 and the second predicted value C2-99999999, as shown in table 15.

Watch 15

The feature information B3 includes: 1 and 0, i is a positive integer of 1 or more. When the characteristic information B3 is 1, the data group Di to be predicted is a data group to be predicted with fraud characteristics, and is identified as a fraudulent client; when the characteristic information B3 is 0, it represents that the data set to be predicted Di is a data set to be predicted without a fraudulent characteristic, and is identified as a non-fraudulent client.

Based on the same application concept as the above feature information identification method, the present invention also provides a feature information identification system, as described in the following embodiments. Because the principle of solving the problems of the characteristic information identification system is similar to that of the characteristic information identification method, the implementation of the characteristic information identification system can refer to the implementation of the characteristic information identification method, and repeated parts are not repeated.

Fig. 20 is a schematic structural diagram of a feature information identification system according to an embodiment of the present application. As shown in fig. 20, the characteristic information identification system includes: an acquisition unit 101, a first generation unit 102, a second generation unit 103, and a third generation unit 104.

The obtaining unit 101 obtains a discrete data unit and a continuous data unit corresponding to a first unique identifier of a data set to be predicted.

The first generating unit 102 is configured to input the discrete data unit corresponding to the first unique identifier into a preset discrete model, and calculate to generate a first predicted value corresponding to the preset discrete model. Wherein the first predicted value includes: a first unique identification.

And the second generating unit 103 is configured to combine the continuous data unit corresponding to the first unique identifier and the first predicted value, input the combined result into a preset continuous model, and calculate and generate a second predicted value corresponding to the preset continuous model. Wherein the second predicted value includes: a first unique identification.

And a third generating unit 104, configured to generate feature information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value.

Based on the same application concept as the above-mentioned feature information identification method, the present application provides a computer device, as described in the following embodiments. Because the principle of solving the problem of the computer device is similar to the characteristic information identification method, the implementation of the computer device can refer to the implementation of the characteristic information identification method, and repeated parts are not described again.

In one embodiment, an electronic device includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing all the steps of the method in the above embodiments when executing the computer program, e.g. as shown in fig. 1, the processor implementing the following steps when executing the computer program:

Based on the same application concept as the above-described feature information identification method, the present application provides a computer-readable storage medium, as described in the following embodiments. Because the principle of solving the problem of the computer-readable storage medium is similar to the characteristic information identification method, the implementation of the computer-readable storage medium can refer to the implementation of the characteristic information identification method, and repeated parts are not described again.

In one embodiment, the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements all the steps of the feature information identification method in the above embodiments, for example, as shown in fig. 1, the computer program, when executed by the processor, implements the steps of:

The invention provides a method and a system for identifying characteristic information, comprising the following steps: acquiring a discrete data unit and a continuous data unit corresponding to a first unique identifier of a data set to be predicted; inputting the discrete data unit corresponding to the first unique identifier into a preset discrete model to calculate and generate a first predicted value corresponding to the preset discrete model; the first predicted value includes: a first unique identifier; combining the continuous data unit corresponding to the first unique identifier and the first predicted value, inputting the combined continuous data unit and the first predicted value into a preset continuous model, and calculating to generate a second predicted value corresponding to the preset continuous model; the second predicted value includes: a first unique identifier; and generating characteristic information corresponding to the data group to be predicted according to the data group to be predicted corresponding to the first unique identifier and the second predicted value. The method and the device can improve the data processing efficiency of the machine learning algorithm on the data containing both discrete data and continuous data, so that the efficiency of identifying the characteristic information by applying the machine learning algorithm is improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for identifying feature information, comprising:

2. The method according to claim 1, wherein the data group to be predicted is a plurality of groups.

3. The method according to claim 2, wherein the obtaining of the discrete data unit and the continuous data unit corresponding to the first unique identifier of the data group to be predicted includes:

respectively setting a first unique identifier for each acquired data group to be predicted; each data group to be predicted comprises: a number of first characteristic data;

and splitting each data group to be predicted into discrete data units and continuous data units corresponding to each data group to be predicted according to the data type of each first characteristic data of each data group to be predicted.

4. The feature information identification method according to claim 3, wherein each of the discrete data units includes: the prediction method comprises the steps of obtaining a first unique identifier and discrete type first feature data in a data group to be predicted corresponding to the first unique identifier.

5. The method according to claim 4, wherein the consecutive data units corresponding to each of the discrete data units comprise: and the first unique identifier and continuous type first feature data in the data group to be predicted corresponding to the first unique identifier.

6. The feature information identification method according to claim 1, further comprising:

acquiring discrete training data units and continuous training data units corresponding to the second unique identifier of the training data set;

inputting the discrete training data unit corresponding to the second unique identifier into each preset discrete algorithm to calculate and generate a discrete training model corresponding to each preset discrete algorithm and a first training predicted value corresponding to each preset discrete algorithm; the first training predictor comprises: the second unique identifier;

combining the continuous training data units corresponding to the second unique identifier with the first training predicted value corresponding to the second unique identifier of each preset discrete algorithm, and inputting the combined values into a preset continuous algorithm to calculate and generate a continuous training model corresponding to the discrete training model of each preset discrete algorithm;

and combining each discrete training model with the continuous training model corresponding to each discrete training model to generate each characteristic information recognition model.

7. The feature information identification method according to claim 6, wherein the training data sets are a plurality of sets.

8. The method according to claim 7, wherein the obtaining discrete training data units and continuous training data units corresponding to the second unique identifier of the training data set comprises:

respectively setting a second unique identifier and first characteristic marking information for each acquired training data set; each of the training data sets includes: a plurality of second characteristic data;

and splitting each training data set according to the data type of each second feature data of each training data set to generate a discrete training data unit and a continuous training data unit corresponding to each training data set.

9. The feature information recognition method according to claim 8, wherein each of the discrete training data units includes: a second unique identifier, second feature data of a discrete type in a training data set corresponding to the second unique identifier, and first feature labeling information of the training data set corresponding to the second unique identifier;

each continuous training data unit corresponding to the discrete training data unit comprises: the second unique identifier, the second feature data of the continuous type in the training data set corresponding to the second unique identifier, and the first feature labeling information of the training data set corresponding to the second unique identifier.

10. The method according to claim 9, wherein the combining the continuous training data units corresponding to the second unique identifier with the first training predicted value corresponding to the second unique identifier of each preset discrete algorithm, and inputting the combined result into a preset continuous algorithm to calculate and generate a continuous training model corresponding to the discrete training model of each preset discrete algorithm, includes:

respectively combining the continuous training data unit corresponding to each second unique identifier with the first training predicted value corresponding to each second unique identifier of each preset discrete algorithm to generate a combined training data unit corresponding to each second unique identifier of each discrete algorithm;

and inputting each combined training data unit into the preset continuous algorithm to calculate and generate a continuous training model corresponding to the discrete training model of each preset discrete algorithm.

11. The feature information identification method according to claim 6, further comprising:

acquiring a third unique identifier of a verification data set, second feature labeling information, and a discrete verification data unit and a continuous verification data unit corresponding to the third unique identifier;

inputting the discrete verification data unit corresponding to the third unique identifier into each discrete training model to calculate and generate a first verification prediction value corresponding to each discrete training model; the first verification prediction value comprises: the third unique identifier;

combining the continuous verification data unit corresponding to the third unique identifier with the first verification predicted value, and inputting the combined continuous verification data unit and the first verification predicted value into a continuous training model corresponding to a discrete training model of each first verification predicted value to calculate and generate a second verification predicted value corresponding to each continuous training model; the second verification prediction value comprises: the third unique identifier;

and generating an optimal feature information identification model according to the second feature labeling information corresponding to the third unique identifier and the second verification predicted value corresponding to the third unique identifier in each continuous training model.

12. The feature information identification method according to claim 11, wherein the verification data sets are a plurality of sets.

13. The method according to claim 12, wherein the obtaining of the discrete verification data unit and the continuous verification data unit corresponding to the third unique identifier, the second feature labeling information, and the third unique identifier of the verification data set comprises:

respectively setting a third unique identifier and second characteristic labeling information for each acquired verification data set; each of the verification data sets includes: a number of third feature data;

and splitting each verification data group according to the data type of each third feature data of each verification data group to generate discrete verification data units and continuous verification data units of each verification data group.

14. The feature information identification method according to claim 13, wherein each of the discrete verification data units includes: a third unique identifier, third feature data of a discrete type in the verification data group corresponding to the third unique identifier, and second feature labeling information of the verification data group corresponding to the third unique identifier;

each continuous verification data unit corresponding to the discrete verification data unit comprises: and the third unique identifier, the continuous third feature data in the verification data group corresponding to the third unique identifier and the second feature marking information of the verification data group corresponding to the third unique identifier.

15. The method according to claim 14, wherein the step of generating the second verification prediction value corresponding to each of the continuous training models by calculating the continuous training model corresponding to the discrete training model that is input to each of the first verification prediction values after combining the continuous verification data unit corresponding to the third unique identifier with the first verification prediction value comprises:

respectively merging the continuous verification data units corresponding to the third unique identifications and the first verification predicted values corresponding to the third unique identifications of each discrete training model to generate merged verification data units corresponding to the third unique identifications of each discrete training model;

and inputting the merged verification data unit corresponding to each third unique identifier of each discrete training model into the continuous training model corresponding to each discrete training model to calculate and generate a second verification predicted value corresponding to each third unique identifier of each continuous training model.

16. The feature information identification method according to claim 11, wherein the generating an optimal feature information identification model according to the second feature labeling information corresponding to the third unique identifier and the second verification predicted value corresponding to the third unique identifier in each of the continuous training models comprises:

respectively differentiating second feature labeling information corresponding to the third unique identifier and a second verification predicted value corresponding to the third unique identifier corresponding to each continuous training model to generate a difference value corresponding to the third unique identifier in each continuous training model;

summing the difference values in each continuous training model to generate a verification value of the characteristic information identification model corresponding to each continuous training model;

and sequencing the verification values, and taking the feature information identification model corresponding to the minimum verification value as the optimal feature information identification model.

17. The feature information recognition method according to claim 11, wherein the optimal feature information recognition model includes: the preset discrete model and the preset continuous model.

18. The feature information identification method according to any one of claims 1 to 17, wherein the feature information includes: fraud characteristic information.

19. A characteristic information identification system, characterized by comprising:

20. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method for recognizing characteristic information according to any one of claims 1 to 18 are implemented when the processor executes the program.

21. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the characteristic information identification method of any one of claims 1 to 18.