CN116485523A

CN116485523A - Decision tree-based data evaluation method, device, equipment and storage medium

Info

Publication number: CN116485523A
Application number: CN202310450497.6A
Authority: CN
Inventors: 潘成挺; 钟红义
Original assignee: Hangzhou Breeze Enterprise Technology Co ltd
Current assignee: Hangzhou Breeze Enterprise Technology Co ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-25

Abstract

The application relates to intelligent decision, and discloses a decision tree-based data evaluation method, a decision tree-based data evaluation device, decision tree-based data evaluation equipment and a storage medium, wherein the decision tree-based data evaluation method comprises the steps of obtaining target data, carrying out standardization processing on the target data through a preset standardization engine, and generating target data characteristics corresponding to the target data; generating target indexes of the target data features through a preset rule engine and the data features; and comparing the preset decision tree model with the target index to obtain an evaluation result of the target data based on the preset decision tree model. Through the method, the data are standardized after the target data are acquired, the data indexes are determined from the data characteristics after the data characteristics are extracted, the data indexes are compared with the preset indexes in the decision tree model, the final trust evaluation result is determined according to the comparison result, and the trust efficiency of enterprises is improved.

Description

Decision tree-based data evaluation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of intelligent decision making technologies, and in particular, to a decision tree-based data evaluation method, apparatus, device, and storage medium.

Background

At present, machine learning and deep learning with big data combined with artificial intelligence are commonly applied to various industries, a decision tree of the machine learning is a classification algorithm of supervised learning commonly used in the field of artificial intelligence, and a decision tree analysis method, also called a probability analysis decision method, is a system analysis method for representing relevant elements forming a decision scheme as a tree and analyzing and selecting the decision scheme on the basis of the tree. This is one of the most common methods for risk-based decisions, and is particularly useful for analyzing relatively complex problems. She compares the expected benefit values (expected values) of different schemes based on the benefit values, and decides the choice of scheme. The method has the greatest characteristics that the decision process of the whole decision problem in different stages of time can be displayed in an image, the logic thinking is clear, the hierarchy is clear, and the method is very visual. The decision tree has the following advantages: 1. easy to understand and implement; 2. the preparation of the data is simple or unnecessary, the data type and conventional type attributes can be processed simultaneously, a feasible and good result can be made on a large data source in a relatively short time, and the model can be easily evaluated through static test.

Based on the loan scene of the bank, when the bank audits the loan, the invoice is often utilized to analyze the business condition of the enterprise, but some invoices possibly having business authenticity problems are difficult to be removed, the business value of the enterprise cannot be truly and accurately reflected, and the credit giving efficiency of the bank to the small enterprise is low. Therefore, how to improve the trust efficiency of enterprises is a technical problem to be solved.

Disclosure of Invention

The application provides a decision tree-based data evaluation method, device, equipment and storage medium, so as to improve the trust efficiency of enterprises.

In a first aspect, the present application provides a decision tree-based data evaluation method, the decision tree-based data evaluation method comprising:

acquiring target data, and carrying out standardization processing on the target data through a preset standardization engine to generate target data characteristics corresponding to the target data;

generating target indexes of the target data features through a preset rule engine and the data features;

and comparing the preset decision tree model with the target index to obtain an evaluation result of the target data based on the preset decision tree model.

Further, comparing the preset decision tree model with the target index, and before obtaining the evaluation result of the target data based on the preset decision tree model, including:

acquiring historical data as a training set;

training the training set to generate the preset decision tree model.

Further, training the training set to generate the preset decision tree model, including:

extracting training data features of the training set based on the preset rule engine and the training set;

determining at least one training index corresponding to the training data features through the preset rule engine;

and calculating the information gain of each training index through a preset information gain function, determining the node position corresponding to each training index according to the information gain, and generating the preset decision tree model.

Further, calculating the information gain of each training index through a preset function includes:

dividing the training data characteristics into a preset number of value intervals in an impure reduction mode, and calculating a gear entropy value corresponding to each value interval;

and calculating the total information entropy value of each training index based on the total information entropy value function and the value interval, and calculating the information gain according to the total information entropy value and each gear entropy value.

Further, the total information entropy function is:

I(X)＝-∑pi*logpi,i＝1,2,…,n；

wherein I (X) is the total information entropy value, and Pi is the proportion of the ith sample in the current sample set.

Further, the preset information gain function is:

ΔI(X,f)＝I(X)-(P ₁ I(X ₁ )+…+P _N I(X _N ))；

wherein ΔI (X, f) is the information gain, X is the sample set, P _N The proportion of samples in X divided into subsets.

Further, calculating the information gain of each training index through a preset information gain function, and determining the node position corresponding to each training index according to the information gain, so as to generate the preset decision tree model, including:

and performing descending order processing on the information gains, and generating the preset decision tree model by arranging the training indexes corresponding to the information gains according to descending order.

In a second aspect, the present application further provides a decision tree-based data evaluation device, the decision tree-based data evaluation device comprising:

the data normalization module is used for acquiring target data, performing normalization processing on the target data through a preset normalization engine and generating target data characteristics corresponding to the target data;

the index generation module is used for generating target indexes of the target data features through a preset rule engine and the data features;

and the decision tree comparison module is used for comparing a preset decision tree model with the target index to obtain an evaluation result of the target data based on the preset decision tree model.

In a third aspect, the present application also provides an apparatus comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the decision tree-based data evaluation method as described above when the computer program is executed.

In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement a decision tree based data evaluation method as described above.

The application discloses a decision tree-based data evaluation method, a decision tree-based data evaluation device, decision tree-based data evaluation equipment and a storage medium, wherein the decision tree-based data evaluation method comprises the steps of obtaining target data, and performing standardized processing on the target data through a preset standardized engine to generate target data characteristics corresponding to the target data; generating target indexes of the target data features through a preset rule engine and the data features; and comparing the preset decision tree model with the target index to obtain an evaluation result of the target data based on the preset decision tree model. Through the method, the data are standardized after the target data are acquired, the data indexes are determined from the data characteristics after the data characteristics are extracted, the data indexes are compared with the preset indexes in the decision tree model, the final trust evaluation result is determined according to the comparison result, and the trust efficiency of enterprises is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a decision tree-based data evaluation method according to a first embodiment of the present application;

FIG. 2 is a schematic flow chart of a decision tree-based data evaluation method according to a second embodiment of the present application;

FIG. 3 is a schematic flow chart of a decision tree-based data evaluation method according to a third embodiment of the present application;

FIG. 4 is a schematic flow chart of a decision tree-based data evaluation method according to a fourth embodiment of the present application;

FIG. 5 is a schematic diagram of a decision tree model according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of a decision tree based data evaluation apparatus provided by an embodiment of the present application;

fig. 7 is a schematic block diagram of an apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the application provides a data evaluation method, device, equipment and storage medium based on a decision tree. The decision tree-based data evaluation method can be applied to a server, standardized processing is carried out on data after target data are acquired, each data index is determined from the data characteristics after the data characteristics are extracted, each data index is compared with a preset index in a decision tree model, a final trust evaluation result is determined according to the comparison result, and the trust efficiency of enterprises is improved. The server may be an independent server or a server cluster.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart of a decision tree-based data evaluation method according to a first embodiment of the present application. The decision tree-based data evaluation method can be applied to a server, is used for carrying out standardization processing on data after target data are acquired, determining each data index from the data characteristics after extracting the data characteristics, comparing each data index with a preset index in a decision tree model, determining a final trust evaluation result according to the comparison result, and improving the trust efficiency of enterprises.

As shown in fig. 1, the decision tree-based data evaluation method specifically includes steps S10 to S30.

Step S10, acquiring target data, and carrying out standardization processing on the target data through a preset standardization engine to generate target data characteristics corresponding to the target data;

specifically, the data normalization process includes the steps of:

1) Defining a calculation index of standardized tax stamps of enterprises;

2) And defining characteristic values according to the proportion of the calculated index, and defining the characteristics and the characteristic values.

The standardization of the tax receipt data is accomplished as follows:

1) Acquiring a data model template corresponding to an enterprise user;

2) Inputting financial data, tax data and invoice data;

3) The normalization engine processes, merges, cleans, transforms, etc. the data according to the data model templates to generate normalized data.

Step S20, generating target indexes of the target data features through a preset rule engine and the data features;

specifically, training sample data are obtained to generate a training set, wherein the training sample data comprise general indexes, tax credits, ticket credits, three-party indexes, credit indexes and financial indexes;

according to the characteristics of the decision trees, calculating the characteristics selected by the information gain as root nodes, setting nodes of non-leaf nodes of each decision tree as decision nodes, and setting leaf nodes of each decision tree as output units, wherein each decision node is a sample characteristic and a corresponding judgment value, and each leaf node corresponds to a loan prediction result.

And step S30, comparing a preset decision tree model with the target index to obtain an evaluation result of the target data based on the preset decision tree model.

In a specific embodiment, the decision tree generation comprises the steps of:

step 1: all data are regarded as one node, and step 2 is carried out;

step 2: selecting one data characteristic from all the data characteristics to divide the nodes, and entering step 3;

step 3: generating a plurality of child nodes, judging each child node, and entering step 4 if the condition of stopping splitting is met; otherwise, enter step 2;

step 4: the node is set to be a child node, and the output result is the category with the largest number of the nodes.

The embodiment discloses a decision tree-based data evaluation method, a device, equipment and a storage medium, wherein the decision tree-based data evaluation method comprises the steps of obtaining target data, and performing standardized processing on the target data through a preset standardized engine to generate target data characteristics corresponding to the target data; generating target indexes of the target data features through a preset rule engine and the data features; and comparing the preset decision tree model with the target index to obtain an evaluation result of the target data based on the preset decision tree model. Through the method, the data are standardized after the target data are acquired, the data indexes are determined from the data characteristics after the data characteristics are extracted, the data indexes are compared with the preset indexes in the decision tree model, the final trust evaluation result is determined according to the comparison result, and the trust efficiency of enterprises is improved.

Referring to fig. 2, fig. 2 is a schematic flowchart of a decision tree-based data evaluation method according to a second embodiment of the present application. The decision tree-based data evaluation method can be applied to a server, is used for carrying out standardization processing on data after target data are acquired, determining each data index from the data characteristics after extracting the data characteristics, comparing each data index with a preset index in a decision tree model, determining a final trust evaluation result according to the comparison result, and improving the trust efficiency of enterprises.

Based on the embodiment shown in fig. 1, in this embodiment, as shown in fig. 2, step S30 is preceded by steps S21 to S22.

Step S21, acquiring historical data as a training set;

and S22, training the training set to generate the preset decision tree model.

In a specific embodiment, as shown in table 1, table 1 defines the index and the criterion of the index, the feature and the definition criterion of the feature value, and the feature is determined according to the passing proportion or the accuracy proportion of the index, and the index can be customized (increased or decreased) according to the specific service of the user, and the rule can also be set by the user.

TABLE 1

Referring to fig. 3, fig. 3 is a schematic flowchart of a decision tree-based data evaluation method according to a third embodiment of the present application. The decision tree-based data evaluation method can be applied to a server, is used for carrying out standardization processing on data after target data are acquired, determining each data index from the data characteristics after extracting the data characteristics, comparing each data index with a preset index in a decision tree model, determining a final trust evaluation result according to the comparison result, and improving the trust efficiency of enterprises.

Based on the embodiment shown in fig. 2, as shown in fig. 3 in this embodiment, step S22 includes S221 to step S223.

Step S221, extracting training data features of the training set based on the preset rule engine and the training set;

step S222, determining at least one training index corresponding to the training data features through the preset rule engine;

step S223, calculating the information gain of each training index through a preset information gain function, and determining the node position corresponding to each training index according to the information gain to generate the preset decision tree model.

In a specific embodiment, taking the data in table 1 as an example, table 2 is generated after the pair is according to the rule.

TABLE 2

The total information entropy value is calculated according to table 2 as follows:

in this sample set, taking the feature "universal index" as an example, it has 3 values {1 st, 2 nd, 3 rd }, 13 samples in the corresponding subset (universal index=1 st), wherein there are 4 positive samples, 9 negative samples, 5 samples in (universal index=2 nd), 3 positive samples, 2 negative samples, 2 samples in (universal index=3 rd), 2 positive samples, and 0 negative samples.

Referring to fig. 4, fig. 4 is a schematic flowchart of a decision tree-based data evaluation method according to a fourth embodiment of the present application. The decision tree-based data evaluation method can be applied to a server, is used for carrying out standardization processing on data after target data are acquired, determining each data index from the data characteristics after extracting the data characteristics, comparing each data index with a preset index in a decision tree model, determining a final trust evaluation result according to the comparison result, and improving the trust efficiency of enterprises.

Based on the embodiment shown in fig. 2, as shown in fig. 4 in this embodiment, step S223 includes S2231 to step S2232.

Step S2231, dividing the training data characteristics into a preset number of value intervals in an impure reduction mode, and calculating a gear entropy value corresponding to each value interval;

entropy of general index 1 st gear:

entropy of general index 2 nd gear:

entropy of general index 3 rd gear:

information entropy of general index:

information gain of general index: g ₁ ＝I-I ₁ ＝0.14121。

Step S2232, calculating a total information entropy value of each training index based on the total information entropy function and the value interval, and calculating the information gain according to the total information entropy value and each gear entropy value.

Further, the total information entropy function is:

I(X)＝-∑pi*logPi,i＝1,2,…,n；

Further, the preset information gain function is:

ΔI(X,f)＝I(X)-(P ₁ I(X ₁ )+…+P _N I(X _N ))；

Specifically, as shown in table 3, table 3 is an information gain calculation result table.

TABLE 3 Table 3

Features (e.g. a character)	Information gain
		General index	0.14121
Tax lending index	0.138462
		Ticket lending index	0.506003
Three-party index	0.13457
		Credit sign index	0.072314
Financial index	0.066304

As can be seen from table 3, the information gain of the ticket credit index is the largest, that is, the credit index is optimally selected as the root node for classification, taking the above tables 1, 2 and 3 as examples, the decision tree model is determined as shown in fig. 5, and fig. 5 is a schematic diagram of the decision tree model in the embodiment of the present application.

Based on the embodiment shown in fig. 2, in this embodiment, step S22 includes:

Referring to fig. 6, fig. 6 is a schematic block diagram of a decision tree-based data evaluation apparatus for performing the decision tree-based data evaluation method according to the embodiment of the present application. The decision tree-based data evaluation device can be configured on a server.

As shown in fig. 6, the decision tree based data evaluation apparatus 400 includes:

the data normalization module 410 is configured to obtain target data, and perform normalization processing on the target data through a preset normalization engine, so as to generate target data features corresponding to the target data;

the index generation module 420 is configured to generate a target index of the target data feature through a preset rule engine and the data feature;

and a decision tree comparison module 430, configured to compare a preset decision tree model with the target index, so as to obtain an evaluation result of the target data based on the preset decision tree model.

Further, the decision tree-based data evaluation device further includes:

the training set module is used for acquiring historical data as a training set;

and the decision tree model generation module is used for training the training set and generating the preset decision tree model.

Further, the decision tree model generation module includes:

the data feature extraction unit is used for extracting training data features of the training set based on the preset rule engine and the training set;

the training index determining unit is used for determining at least one training index corresponding to the training data characteristics through the preset rule engine;

the decision tree model generating unit is used for calculating the information gain of each training index through a preset information gain function, determining the node position corresponding to each training index according to the information gain, and generating the preset decision tree model.

Further, the decision tree model generating unit includes:

the gear entropy value calculating subunit is used for dividing the training data characteristics into a preset number of value intervals in an unrepeated reduction mode and calculating the gear entropy value corresponding to each value interval;

and the information gain calculation subunit is used for calculating the total information entropy value of each training index based on the total information entropy function and the value interval, and calculating the information gain according to the total information entropy value and each gear entropy value.

Further, the decision tree model generating unit further includes:

and the index sorting subunit is used for carrying out descending order processing on the information gains, and generating the preset decision tree model according to descending order arrangement of the training indexes corresponding to the information gains.

It should be noted that, for convenience and brevity of description, the specific working process of the apparatus and each module described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a device as shown in fig. 7.

Referring to fig. 7, fig. 7 is a schematic block diagram of an apparatus according to an embodiment of the present application. The device may be a server.

Referring to fig. 7, the apparatus includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of a number of decision tree based data evaluation methods.

The processor is used to provide computing and control capabilities to support the operation of the entire device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform any of a number of decision tree-based data evaluation methods.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the apparatus to which the present application is applied, and that a particular apparatus may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-Programmable gate arrays (FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

In one embodiment, the comparison between the preset decision tree model and the target index is performed before the evaluation result of the target data based on the preset decision tree model is obtained, and the method is used for realizing:

acquiring historical data as a training set;

training the training set to generate the preset decision tree model.

In one embodiment, the training set is trained to generate the preset decision tree model for implementing:

In one embodiment, the information gain of each training index is calculated by a preset function, so as to realize:

In one embodiment, the information gain of each training index is calculated through a preset information gain function, and the node position corresponding to each training index is determined according to the information gain, so as to generate the preset decision tree model, which is used for realizing:

Embodiments of the present application further provide a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, and the processor executes the program instructions to implement any of the decision tree-based data evaluation methods provided in the embodiments of the present application.

The computer readable storage medium may be an internal storage unit of the device according to the foregoing embodiment, for example, a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the device.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A decision tree-based data evaluation method, characterized in that the decision tree-based data evaluation method comprises:

2. The decision tree-based data evaluation method according to claim 1, wherein the comparing the preset decision tree model with the target index, before obtaining the evaluation result of the target data based on the preset decision tree model, comprises:

acquiring historical data as a training set;

training the training set to generate the preset decision tree model.

3. The decision tree-based data evaluation method according to claim 2, wherein the training set to generate the preset decision tree model comprises:

4. A decision tree based data evaluation method according to claim 3, wherein said calculating the information gain of each of said training metrics by a predetermined function comprises:

5. The decision tree based data evaluation method of claim 4, wherein the total information entropy function is:

I(X)＝-∑pi*logpi,i＝1,2,…,n；

6. The decision tree based data evaluation method of claim 4, wherein the predetermined information gain function is:

ΔI(X,f)＝I(X)-(P ₁ I(X ₁ )+…+P _N I*X _N ))；

7. The decision tree-based data evaluation method according to claim 3, wherein the calculating the information gain of each training index by a preset information gain function, and determining the node position corresponding to each training index according to the information gain, generating the preset decision tree model comprises:

8. A decision tree based data evaluation device, comprising:

9. An apparatus comprising a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and for implementing the decision tree based data evaluation method according to any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the decision tree based data evaluation method according to any one of claims 1 to 7.