CN112417176B

CN112417176B - Method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics

Info

Publication number: CN112417176B
Application number: CN202011430159.9A
Authority: CN
Inventors: 仇钧; 姚利虎; 韩静; 李志刚
Original assignee: Bank of Communications Co Ltd
Current assignee: Bank of Communications Co Ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2024-04-02
Anticipated expiration: 2040-12-09
Also published as: CN112417176A

Abstract

The invention relates to a method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics, wherein the method comprises the following steps: 1: performing super node identification and rejection operation on existing data in a database to generate an association relation graph; 2: further extracting and generating a share right tree, and obtaining a share right tree pair and corresponding data information of the share right tree pair root node by further processing aiming at all the share right trees; step 3: constructing a characteristic variable system which adopts two dimensions of a share right tree pair and a share right tree root node respectively; step 4: performing index aggregation and model wide table integration operation on all the stock right tree indexes in the characteristic variable system to obtain final data for model training; step 5: training the lightGBM algorithm model by utilizing the final data, and mining the implicit association relation between enterprises by utilizing the trained lightGBM algorithm model. The invention effectively improves the insight capability of the relationship between enterprise clients and provides powerful reference for risk related decisions.

Description

Method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics

Technical Field

The invention relates to the technical field of financial science and technology, in particular to a method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics.

Background

Currently, the economy in China is in transformation development, depth adjustment and periodical gear shifting periods, and the contradiction among various economy developments is increasingly prominent. In this case, the group enterprise embraces group heating to have a certain advantage, but there is also a problem of damage. Compared with a single enterprise, various relations of enterprise groups are more complicated, internal organization structures are staggered in cross-region, cross-industry and diversified operation modes, and information among banking enterprises and banking enterprises is seriously asymmetric through mutual guarantee and multi-head application loan among related enterprises. To achieve the scale effect, banks strive for credit to an enterprise group, especially an enterprise group on the market, resulting in a credit line that the enterprise group obtains from the bank that is much greater than the maximum liability level that it may bear. The occurrence of credit risk in the individual enterprises under the group is extremely liable to cause the domino effect, form chain reaction and even cause systematic risk. Seemingly, a single bank is reasonable to trust an independent subsidiary, but a plurality of banks are not necessarily reasonable to trust a whole group. In recent years, cases in which banks suffer great losses due to bankruptcy of group enterprises are frequently used.

Second hundred sixteen of the public judicial: the association relationship refers to the relationship between a corporate stockholder, an actual controller, a board of directors, a supervisor, a senior manager and an enterprise directly or indirectly controlled by the corporate stockholder, and other relationships which may cause benefit transfer of the corporation. However, enterprises that are under control of the country have association relationships with each other not only because they are under control of the country.

Therefore, the association relation in the public judicial sense is emphasized to be "control", the relationship between the corporate stakeholder, the actual controller, the board, the supervision and the advanced manager and the enterprises directly or indirectly controlled by the corporate stakeholder, the actual controller, the board, and the relationship between the advanced manager and the enterprises are emphasized to be the control of the same controller on different enterprises.

However, neither the administrative group association trust nor the public judicial are very specific and well defined to the party involved. In the related government regulation documents, the supervision indicates that the implicit relationship is a related mode that the association relationship is not exposed on the surface between enterprises, but the investment relationship is hidden in reality or the relationship is controlled or influenced in the operation decision, the fund dispatching and the production operation.

According to the definition of the hidden association relationship given by supervision, the invention comprehensively considers the applicable scene of the machine learning model, and defines the hidden association relationship for modeling as a 'hidden control relationship', wherein the relationship has the following characteristics:

1. The bank can not obtain the control relation through the public share right data of the enterprise;

2. by signing an agreement with a third party, the individual or common control relationship of an enterprise is achieved.

One of the prior art schemes in the technical field of implicit control relation is to carry out credit-giving management on the enterprise group by each operation unit of the bank according to the control force of the group headquarters on the member units, the operation and financial characteristics of the group, the situation of acquiring the combined report, the difference of the degree of tightness of cooperation with the bank and the like, divide the enterprise group into different types of total-to-total, top-down, bottom-up and the like to verify the credit-giving scheme, and finally determine whether to bring the newly added enterprise into a credit-giving group tree through quantitative measurement and calculation.

In the scheme, the method mainly relies on a credit investigation report of a bank staff to an enterprise, and comprises a plurality of dimensions such as public business information, a supply chain upstream and downstream relation, a local trade background, an enterprise business financial newspaper and the like, and whether the enterprise has a potential association relation with a trusted client in a row is analyzed by using artificial experience. On one hand, huge labor cost is required to be consumed; on the other hand, the real operation conditions of the enterprise group among a plurality of areas, a plurality of banks and a plurality of sub-companies are difficult to obtain by a single bank, and the missing and erroneous judgment of the key association relation are easy to cause the leakage problems of inapplicability, inapplicability in construction, inauguration in construction and the like.

The second prior art scheme in the technical field of implicit control relation is to utilize commercial software which is widely used in recent years, such as sky eye examination, enterprise examination and other tools to carry out the penetration of the equity rights of enterprises, and through a commercial graph database and public business data, the corporate is penetrated upwards and the subsidiary is penetrated downwards, so that the visual display of the conditions of all partners of the enterprises which are revealed to the natural person and legal level is realized.

The main disadvantages of the stock right penetrating technology in the scheme are as follows:

1) The association relation type of the application is single, mainly takes the equity and tenure relation as the main part, and does not fully utilize other relation types such as funds, trade, guarantee, mortgage and the like;

2) For the equity and the tenure relationship, on one hand, enterprises can choose not to disclose equity information, and on the other hand, through a complex equity architecture (such as overseas registration companies), the domestic business information registration process can be bypassed, so that the purpose of hiding the real control relationship is achieved;

3) The existing stock right penetration technology does not fully utilize the graph characteristics and graph modes of the enterprise association relationship network, and potential association among enterprises is excavated through the characteristics of hidden spaces.

At present, the exploration of the implicit association relationship among enterprises is mostly dependent on manual carding and investigation by experienced credit-giving censoring staff in the scheme one, and is time-consuming and labor-consuming and can not be updated in time. Although the stock right relationship visualization tool in the scheme II appears in the market, the data source is too single, and the graph characteristics and graph modes of the association relationship network are not fully utilized, so that the hidden relationship mining is insufficient.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method, equipment and medium for mining the implicit association relation between enterprises based on graph characteristics.

The aim of the invention can be achieved by the following technical scheme:

a method for mining implicit association relation between enterprises based on graph features comprises the following steps:

step 1: performing super node identification and rejection operation on existing data in a database to generate an association relation graph;

step 2: further extracting and generating a share right tree based on the existing data and the association relation graph, and obtaining a share right tree pair and corresponding data information of the share right tree pair root node by further processing aiming at all the share right trees;

step 3: based on the association relation diagram, the share right tree pair and the corresponding data information of the share right tree pair root node, constructing a characteristic variable system which respectively adopts two dimensions of the share right tree pair and the share right tree root node by combining with a Y variable identification rule;

step 4: performing index aggregation and model wide table integration operation on all the stock right tree indexes in the characteristic variable system of two dimensions of the stock right tree pairs and the stock right tree root nodes to obtain final data for model training;

Step 5: training the LightGBM algorithm model by utilizing the final data to obtain a trained LightGBM algorithm model, namely a hidden association relation mining model, and mining the hidden association relation among enterprises by utilizing the trained LightGBM algorithm model.

Further, the step 1 specifically includes: and performing preliminary partition identification on existing data in a database according to the centrality, the ingress degree, the egress degree, the page rank and the tight centrality index result to obtain super nodes, and performing rejection operation to generate an association relation diagram, wherein the database is a Tigergraph database.

Further, the step 2 comprises the following sub-steps:

step 201: aiming at the data of the lender with the total amount of time and the existing data extraction, further generating a share right tree through a share right penetration rule based on the data of the lender with the lender;

step 202: integrating and de-duplicating based on the corresponding data of the share-right tree, and defining a share-right tree level;

step 203: acquiring connected components based on the association relation graph, and generating connected component numbers of the nodes in each step 1;

step 204: removing node data, supernode data and node data with the connectivity component number of being empty of the root nodes and the non-root nodes in the same layer of the root nodes in the corresponding data of the stock right tree;

Step 205: based on step 204, rejecting the equity tree only comprising isolated nodes, communicating bodies with each node in the tree belonging to different types, and isolated equity tree among trees;

step 206: based on the stock right tree data and the connected component number results generated in the steps 202-205, every two of the stock right tree pairs are generated in a combination way in the same connected component number, the stock right tree pairs with individual root nodes are removed, and the stock right trees among different connected component numbers are not grouped;

step 207: and after the group is completed, obtaining the corresponding data information of the share right tree pair and the root node of the share right tree pair.

Further, in the step 3, a characteristic variable system of two dimensions of a share right tree pair and a share right tree root node is adopted, wherein the share right tree characteristic variable comprises a share right tree characteristic variable and a share right tree root node pair characteristic variable, the share right tree characteristic variable comprises a share right tree internal graph index, a share right tree inter-graph index, a share right tree fund transaction and a share right tree pair graph mode index, and the share right tree root node pair characteristic variable comprises a share right tree root node pair graph index and a share right tree root node pair graph mode index.

Further, the mining method further includes step 6: and performing positive and negative sample data balance processing on the developed samples by adopting a two-stage PU-Learning modeling method, and performing model evaluation on the hidden association mining model by using the developed samples subjected to the positive and negative sample data balance processing.

Further, the step 5 specifically includes: setting final data as model-in data of the LightGBM algorithm model, setting an evaluation target, training key super parameters and a model training strategy, finally obtaining model super parameters and model results, namely, correspondingly obtaining the trained LightGBM algorithm model, namely, a hidden association relation mining model, and mining the hidden association relation among enterprises by utilizing the model super parameters and model results.

Further, the evaluation target adopts AUC and binary_logoss, and the key super parameters comprise general parameters, boost parameters and model learning parameters.

Further, the model training strategy adopts a leave-out method to evaluate or K-fold cross validation.

The invention also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the method for mining the implicit association relation between enterprises based on graph characteristics when executing the computer program.

The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the method for mining the implicit association relation between enterprises based on graph characteristics when being executed by a processor.

Compared with the prior art, the invention has the following advantages:

(1) The invention describes a method for mining the implicit association relation between enterprises by utilizing a machine learning model based on graph characteristics, and by the method, the implicit relation identification is realized, the accuracy, coverage rate and timeliness of prediction are improved, and the bank can better perform group credit management on clients, so that the group risk management capability is enhanced.

(2) The invention provides an enterprise implicit association relation mining model based on graph characteristics. The implicit association relation mining model is based on big data processing, a graph database, a knowledge graph, machine learning and other technologies, adopts multiple data sources outside the line, fully utilizes and effectively combines graph analysis technology and artificial intelligence technology, deeply mines the implicit association relation among enterprise clients, effectively improves the insight capability of relation among enterprise clients, and provides powerful references for decisions such as risk management, risk prevention and control and the like.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a list of equity tree pair features in the method of the present invention;

FIG. 3 is a list of the characteristics of the root node pairs of the equity tree in the method of the present invention;

FIG. 4 is a schematic diagram of the LightGBM algorithm modeling process according to the method of the present invention;

FIG. 5 is a schematic diagram of a process flow of feature variables in an embodiment of the method of the present invention;

FIG. 6 is a graph of the result of supernode index partitioning in an embodiment of the method of the present invention;

FIG. 7 is a graph of connected component statistics of a behavior graph in an embodiment of the method of the present invention, wherein FIG. 7 (a) is a graph of the results of the statistics and the number of nodes in the connected body, and FIG. 7 (b) is a graph of the results of the number of nodes in the connected body and the number of connected bodies;

FIG. 8 is a schematic diagram of the behavior graph relationship types in an embodiment of the method of the present invention;

FIG. 9 is a schematic diagram of a process for generating a equity tree in an embodiment of the method of the present invention;

FIG. 10 is a schematic diagram of a two-stage model scheme in an embodiment of the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

1. The relevant abbreviations and key terms of the technical scheme of the invention are defined as follows:

Implicit association relation: the method is a correlation mode which does not expose the correlation relationship on the surface and actually implies investment relationship or has control or influence relationship in operation decision-making, fund scheduling and production operation. Notification of a bank credit client recessive associated risk prompt by Zhejiang silver bureau < 2012-

Trust group tree (or group relationship): when banks develop credit services, group credit management can be performed for larger enterprises. The members in the tree have been uniformly trusted according to a group, and it can be confirmed that the members must be determined to belong to the group.

Equity penetration tree: and (3) combining the relationship of the industrial and commercial equity, the relationship of the high management authority and the group credit giving relationship, and drawing out the hierarchical structure of the corporate equity in a tree-shaped equity structure mode. The method comprises the step of crediting the crediting through tree of the group and the step of crediting the non-group.

Knowledge graph: a large-scale semantic network takes entities or concepts as nodes and is connected through semantic relations.

By the association among the discovery entities, semi-structured and unstructured data are integrated, and the knowledge graph can help a machine understand the data, explain phenomena and knowledge reasoning, so that deep relationships are discovered, and intelligent search and intelligent interaction are realized.

Graph database: a non-relational database, applying graph theory to store entity information and relationship information between entities, a main tool TigerGraph, neo j, etc.

Supervised machine learning: constructing a training set (the training set) of machine learning based on the characteristic variables (the feature variables) defined by the window period data and the target variables (the target variable) formed by credit manifestations at a specific time; based on the training set, a machine learning algorithm (the machine learning algorithm) is developed to train a classification model (the classification model), and the trained model is finally applied to predict the credit performance of the customer.

Time window: according to the requirement on the modeling period, the historical data is segmented into a plurality of data sets in the time dimension, and then data materials are provided for model training. The design of the observation points predicts the hidden association of the enterprise according to the actual requirement of the business application model, namely when the hidden association mining model is needed, and the hidden association is usually at the end of a season, half year and the like; the observation point takes the selected fixed period as the observation period before, thereby constructing a characteristic variable (X variable) of the training set; the observation point is also followed by a fixed period of time selected as the expression period, i.e., the expression of the client sample after the observation point is collected, thereby constructing the target variable (Y variable) of the training set.

2. The complete technical scheme provided by the invention is as follows:

firstly, abstracting the association between business entities into a graph structure and storing the graph structure in a graph database; secondly, taking the equity relation tree as a dominant relation basis, and extracting equity tree pairs as research samples; then, utilizing the graph features on the relation graph of funds, trade, guarantee, mortgage and the like to construct feature variables of the model according to the business features of the funds transaction; and finally, training, verifying and predicting a model by using a classification model LightGBM algorithm by using a two-stage PU-Learning modeling method, and predicting the probability of the hidden association relationship between any enterprises.

The modeling steps of each stage in the figure are described in detail below:

1. association graph construction

The association relation graph combines the intra-row data and the inter-row data, contains 7 general relation types of tenure, guarantee, funds, beneficiary, trade and the same address telephone, and eliminates super nodes on the graph.

Due to enterprise economic behavior and the side-of-relationship characteristics, a "supernode" such as a payment treasures may appear on the relationship graph. At present, the super node problem is still the focus of research in academia and industry, and no effective identification and solution exists. The existence of the super node affects the application efficiency of the graph database; secondly, influencing model development and predicting guest group generation results; thirdly, the calculation result of partial graph indexes is abnormal. In view of the fact that the super nodes possibly affect the hidden association relation mining model scheme, super node identification and processing work needs to be carried out first.

The invention adopts a method of taking qualitative as a main part and quantitative analysis as an auxiliary part to identify the super node. The qualitative analysis mainly starts from the application of the service scene, and combines management experience to locate the nodes without application meaning in various relations, so that the node identification rules are combed. For example: third party paymate in funding relationship, government agency in equity relationship. The quantitative analysis is to calculate six graph indexes of degree of departure, degree of incidence, degree of center, pageRank, inter-phase center and tight center for the nodes on the association relation graph, and assist qualitative analysis. For example: third party paymate nodes in the fund relationship show high income and low income on the graph index, and government agency points in the equity relationship are calculation results of low income and high income. And (5) synthesizing qualitative and quantitative results, and determining the super node.

2. Guest group sample extraction and Y variable identification

The invention mainly researches that the origin of the implicit relation object is a trusted client of a bank. Firstly, starting from a trusted client, carrying out up-and-down penetration of the equity and the tenure to form a plurality of equity relation trees. And then classifying each stock right tree into a trust group tree or a non-trust group tree according to the group trust relationship of the bank. Y variable identification rules: and (3) for the stock control right tree pair in the model development sample, if the group client relationship exists, the identification is 'hidden relationship', and if the group client relationship does not exist, the identification is 'no relationship'.

3. Feature variable design

Combining the Y characteristics of target variables, firstly, designing a two-dimensional characteristic variable system through a share right tree pair and a share right tree root node pair, and comprehensively analyzing the association degree between the lender controlled share right trees; secondly, through two index structures of graph characteristics and non-graph characteristics, the structure information of the map is enriched, and the accuracy of the model is improved.

As shown in fig. 2, the dimensions of the equity tree for feature design considerations include: graph feature variables and graph mode feature variables. Wherein:

the graph feature variables are based on the topological structure in the share right tree and the structure among the share right trees, and comprise center features, structure features, path features and neighborhood features.

The graph mode characteristic variable is based on the path structure of the enterprise among the equity penetration trees on the behavior graph, constructs an index reflecting the mode characteristic formed by the side relationship, and mainly comprises 7 major relationships and compound relationships thereof.

As shown in fig. 3, the dimensions of the equity tree root node for feature design considerations include: graph feature variables, graph pattern feature variables, and non-graph feature variables. Wherein:

the graph characteristic variable is a graph characteristic index calculated based on the topological structure among the stock right tree root nodes, and comprises a neighborhood characteristic, a node self characteristic and a path class characteristic.

The graph mode characteristic variable is used for calculating the relationship mode among nodes based on the path mode among the root nodes of the share weight tree, and comprises 7 major classes of relationships and compound relationships thereof.

The non-graph feature variables are the fund transactions based on the root node of the equity tree.

4. Algorithm modeling

The invention adopts the LightGBM algorithm, is a new member of the Booting algorithm, is an efficient implementation of the gradient promotion tree, and is introduced by Microsoft in 2016. In principle it is similar to XGBoost. Compared with XGBoost, the training method has the advantages of faster training efficiency, low memory use, support for directly using class features and the like. The training process is schematically shown in figure 4:

the model is characterized in that the AUC value is selected as the evaluation index.

ROC (Receiver Operating Characteristic) curve is a comprehensive index reflecting sensitivity and specificity continuous variables. Abscissa: (False positive rate, FPR), the proportion of samples predicted to be positive but actually negative to all negative samples; ordinate: (True positive rate, TPR), the proportion of samples predicted to be positive and actually positive to all positive samples.

AUC is the area under the ROC curve: representing the probability that a positive sample is randomly chosen and a negative sample is preceded by the current classification algorithm based on the calculated Score value. Auc=1, a perfect classifier, 0.5< AUC <1, is superior to random guesses.

3. Specific examples are as follows:

the main work implemented by the recessive association relation mining model can be divided into two stages of characteristic variable processing and algorithm modeling.

The characteristic variable processing is shown in fig. 5, wherein:

1) The basic data processing stage comprises the discovery of super nodes and the construction of an association relation diagram

2) The stock right tree pair generates and marks the extraction and sampling of the positive and negative samples corresponding to the model

3) Processing of base indexes of stock right tree/root node corresponds to processing of several types of characteristic variables

4) Integrating index aggregation and model broad table into final data arrangement work before model training

3.1, building an association relationship diagram

The hidden model association relation graph is a processing basis of two types of graph characteristic variables, namely, follow-up model guest group sample processing, hidden model graph indexes and graph modes. The specific generation steps are as follows:

a. super node culling

Due to the economic behavior of enterprises and the side characteristics of the association relationship, super nodes appear on the association relationship graph. The existence of the super node affects the application efficiency of the graph database; secondly, influencing model development and predicting guest group generation results; thirdly, the calculation result of partial graph indexes is abnormal. In view of the fact that the super nodes possibly affect the hidden association relation mining model scheme, the super node identification and processing work is firstly carried out.

The total number of clients at two time points 20190930 and 20191231 are analyzed according to the index results of the degree centrality, the degree of ingress, the degree of egress, the page ranking and the tight centrality, and the total number of clients is 19351560. The specific index and the dividing rule are shown in fig. 6;

aiming at the identified 55 super nodes, combining with the actual enterprise operation situation, determining 28 clients as super nodes, and adopting a processing mode of eliminating the point-side files.

In order to better know the connectivity degree of nodes in the equity tree on the behavior diagram, connectivity component analysis is introduced. Calculating the connected components of the nodes in the equity tree on the 20191231 time point behavior diagram based on the condition that the identified 28 supernodes are removed and the nodes are limited to be between 0-5 layers in the equity tree, generating 119,300 connected bodies, wherein 104,871 connected bodies only comprise one node, and the profile analysis of the other 14,429 connected bodies is shown in fig. 7 (a) and 7 (b);

based on the analysis of the condition of the communicating body, the behavior diagram is provided with an oversized communicating body consisting of 435,107 weight tree nodes, and the communicating component number is 1. And according to the index results of the degree center degree, the degree input and the degree output, respectively carrying out statistical analysis on the client group with the communication component number of 1 at the time point 20191231 under the aperture of the summarized relation type and the sub relation type, and extracting a super node list.

b. Incidence relation diagram generation

Based on the 7-class and 11-subclass association relation point edge data with the super nodes removed, generating a 20190930 and 20191231 time behavior diagram required by the hidden model in the Tigergraph database, and further carrying out model guest group sample processing, diagram index and diagram mode index calculation.

The association diagram is composed of 7 major 11 minor relations of company and individual entities, tenure, guarantee, funds, beneficiary, trade and same address telephone, as shown in figure 8. The guarantee circle, the fund circle and the trade circle are graph indexes obtained by looping within 10 steps according to a circle searching algorithm. The association relation graph does not adopt a common share right relation, and because the share right relation belongs to an explicit relation, the association relation is used for constructing a share right relation tree to extract a Y label sample, so that the share right relation is removed from a model explored by the implicit relation.

The predicted sample set of the model is about 2.1 hundred million, so the graph database adopted by the invention is TigerGraph. Compared with other graph computing schemes such as Neo4j, janusGraph, spark, the tiger graph supports distributed parallel computing of the original graph, and can obtain higher computing performance.

3.2, guest group sample extraction and Y variable identification

The processing logic of the recessive model guest group sample is as follows:

1) And (5) extracting the lender. And extracting lender data of the final full-dose legal person.

2) And generating a stock right tree. The screened legal person lenders are used as a starting point, the share right tree is generated through the share right penetrating rule (namely, the E-class tree is crossed to generate rule), and the share right controlling relationship used in penetrating comprises: more than 50% of the control stakeholders, including less than 50% (inclusive) of the first large stakeholders, include the parallel first large stakeholders. Stock right penetration target point processing rule:

if the penetration target point is all levels of government (such as national government authorities such as national committee, financial department, education department, health department, etc.), the user needs to fall back to the next layer of enterprise to form a stock control right tree;

if the penetration target point is a public institution (such as school, hospital, television station, newspaper company, etc.), the penetration target point is the public institution to form a stock control right tree;

if the penetration target point is an overseas (harbor, australia and platform area) enterprise, the penetration target point is the overseas (harbor, australia and platform area) enterprise, and a stock control right tree is formed;

for group customer equity trees, the penetrated result needs to ensure that at least two equity trees exist in each group, and for single group equity tree, the equity tree structure of the bifurcation layer needs to be backed back; the non-clique tree need not follow this rule.

3) Integrating the share right tree. And based on the share right tree data generated in the last step, integrating and de-duplicating if the share right tree root nodes are the same.

4) Defining a level of equity trees. And limiting the tree depth from the lower three layers of the root node based on the stock right tree data generated in the steps.

5) And generating a connected component number. Based on the generated association relation graph, calculating the connected component, and generating the connected component number of each node on the behavior graph.

6) And eliminating nodes in the stock right tree. And eliminating node data of the same-layer non-root nodes of the root node, super node data and node data with the connectivity component number of being null.

7) And eliminating the share right tree. Based on the steps, the stock right trees only comprising isolated nodes, the communicating bodies with different membership of each node in the tree and the isolated stock right trees among the trees are eliminated.

8) Combining the share right tree pairs. Based on the generated stock right tree data and the connected component number result, every two of the stock right tree pairs are generated in a combination mode (no reverse pair is generated) in the same connected component number, and stock right tree pairs with individual root nodes are removed. The stock right trees among different connected component numbers are not paired.

9) And generating result data. The result data field includes the tree number of the equity tree pair; customer SIDs, customer names, starting points, root nodes, and the hierarchy of the tree to which the customer belongs for all nodes in the equity tree.

The corresponding processes of the steps 2) to 9) are shown in fig. 9.

After the 5 ten thousand trusted clients are subjected to the equity penetration, 54 ten thousand nodes are finally obtained and respectively belong to 3.3 ten thousand equity relation trees. Wherein 7358 clique trust trees and 26088 non-clique trust trees. And performing pairwise pairing on the 3.3 ten thousand stock weight trees, removing node pairs of which root nodes are personal, and obtaining the rest 2.5 hundred million node pairs which are model guest group samples.

In the generated model guest group sample, the combination of the share right tree pairs in the same group number is marked as 'hidden relation', and the combination of other share right tree pairs is marked as 'irrelevant system'.

3.3 characteristic variables

3.3.1, feature variable overview

According to the information usefulness assumption, the design of the characteristic variables finally determines the upper limit of the model performance, so that the design of the characteristic variables needs to be better understood on the risk business, and aims to comprehensively describe the credit risk situation of the client in a multi-angle, three-dimensional and multi-channel manner.

Based on the confirmed hidden association relation exploration scheme, in order to more comprehensively explain the association degree among the credit equity trees, the association degree among the credit equity trees is comprehensively analyzed by adopting two large-dimension characteristic variable system design schemes of an equity tree pair and an equity tree root node pair. Meanwhile, the map structure information is enriched by constructing two indexes of the map index and the map mode, so that the accuracy of the model is improved. The share right tree pair has 71 characteristic variables, and the share right tree root node has 35 characteristic variables.

3.3.2 rights of stock tree pairs feature variables

Based on the topological structure of the stock weight tree on the behavior graph, the characteristic variables are constructed through a graph algorithm. Depending on the type of variable, the variables can be further subdivided into four classes: the method comprises the steps of drawing indexes in the share right tree, drawing indexes among the share right trees, capital transaction among the share right trees and drawing mode indexes of the share right tree.

3.3.2.1, rights-of-stock tree inner map index

Based on the topological structure of the nodes in the stock right penetration tree on the behavior diagram, the diagram characteristic indexes reflecting the respective structural characteristics of the nodes are constructed, and the diagram characteristic indexes mainly comprise three main indexes including central characteristics, structural characteristics and path characteristics. The specific table is as follows:

table-1: index of internal map of stock right tree

3.3.2.2, rights-of-stock inter-tree graph index

Based on the topological structure of the nodes in the stock right penetration tree on the behavior diagram, the diagram characteristic indexes reflecting the mutual difference of the nodes are constructed, and the diagram characteristic indexes mainly comprise four major indexes including center characteristic, structural characteristic, path characteristic and neighborhood characteristic. The specific table is as follows:

table-2: map index between equity trees

3.3.2.3, funds transaction between equity trees

Based on the variables of the fund flow information processing, the fund exchange behavior and the stability between the equity tree pairs are focused. The specific table is as follows:

table-3: funds transaction between equity trees

3.3.2.4, equity tree mapping pattern index

Based on the topological structure of the share-right penetration tree on the virtual share-right tree behavior diagram, an index reflecting the side relation pattern characteristics of the virtual share-right penetration tree is constructed, and the virtual share-right penetration tree mainly comprises 7 kinds of relations and compound relations thereof. The specific table is as follows:

table-4: equity tree mapping mode index

/>

3.3.3, the root node of the equity tree pairs the characteristic variable

The characteristic variables of the stock right tree root node pairs mainly reflect the relation characteristics among the stock right tree root node pairs, and are constructed through a graph algorithm based on the topological structure of the stock right tree root nodes on the behavior graph and mainly comprise graph indexes and graph mode indexes.

3.3.3.1 and rights-of-stock tree root node mapping index

The graph characteristic index is calculated based on the topological structure among the stock right tree root nodes, and mainly comprises three main indexes including node adjacency characteristics, node self characteristics and node path characteristics. The specific table is as follows:

table-5: mapping index of stock right tree root node

/>

3.3.3.2 and rights-of-stock tree root node mapping mode index

The graph mode characteristic index is calculated based on the path mode among the root nodes of the share weight tree, and mainly comprises 7 major relations and compound relations thereof. The specific table is as follows:

table-6: mapping mode index of stock right tree root node

3.4 model construction

3.4.1 two-stage model frame overview

In the development samples of the implicit association model, negative samples have the data problem of inaccurate identification, and positive samples with the implicit association are mixed. In order to ensure the result of the classification model algorithm, the invention adopts a two-stage model method to ensure the accuracy of positive and negative sample identification in the model development sample. Meanwhile, in combination with the model imbalance sample problem (the positive sample ratio is 0.00134%), a specific model scheme is shown in fig. 10:

the two-stage model scheme is specifically implemented as follows:

3.4.1.1, one-stage model development

Because the negative sample has the data problem of inaccurate identification and is mixed with the positive sample with the implicit association relation, the invention obtains the real negative sample through one-stage model development, and the method comprises the following steps:

sampling data: and extracting 10 groups of samples by adopting a non-replacement sampling algorithm according to the ratio of positive samples to negative samples of 1:20.

Model training: and circularly extracting 1 group of training, and verifying the other 9 groups of training to train a corresponding model. The average AUC of a total of 10 models and their validation data was obtained.

Model selection: based on the 10-group model, the group with the highest average AUC of the validation data was selected as the one-stage model.

3.4.1.2 two-stage model development

Based on a one-stage model, pure negative sample data is obtained through a model prediction result, and then full positive sample data is collected for two-stage training to obtain a final model.

Sampling data: and extracting 10 groups of samples by adopting a non-replacement sampling algorithm according to the ratio of positive samples to negative samples of 1:100.

Threshold determination: based on a model trained in one stage, 10 groups of samples are predicted, thresholds of the last 20% of the prediction probability of each group are obtained respectively, and average thresholds of the 10 groups of samples are obtained.

Training samples: and adopting the determined average threshold value, selecting samples with prediction probability smaller than the average threshold value based on the extracted 10 groups of sample prediction results respectively, and combining the total positive samples to obtain 10 groups of samples as samples for training a two-stage model.

Model training: and circularly extracting 1 group of training, taking the other 9 groups as verification samples, and training a two-stage model. The average AUC of the 10 models and their validation data was finally obtained.

Model selection: based on the 10 sets of models, the set with the highest mean AUC of the validation data was selected as the final model.

3.4.2, lightGBM Algorithm overview

3.4.2.1 description of the algorithm

The LightGBM is a new member of the Booting algorithm, which is an efficient implementation of the gradient-lifted tree, introduced by microsoft in 2016. In principle, the method is similar to XGBoost, and negative gradients of the loss function are used as residual approximation values of the current decision tree to fit a new decision tree. Compared with XGBoost, the training method has the advantages of faster training efficiency, low memory use, support for directly using class features and the like. The method has certain advantages in industrial application, and is specifically as follows:

The preprocessing of the characteristic variable data has relatively less requirements, is insensitive to input, and comprises tolerance degree to abnormal values, automatic processing of missing values, no requirement on variable correlation processing, no requirement on normalization of the characteristic variable and the like. And taking in the modulus data, and rapidly obtaining the baseline result.

The industrialization degree is high, and the method is widely applied in industry. The main reason is that the underlying language C++, the efficiency is higher, the parallel computing is supported, and meanwhile, when the data volume is larger, the method of data compression and segmentation is introduced, so that the algorithm efficiency is improved as much as possible.

More over-fitting prevention treatments are introduced, including adding regularization terms, reduction factors, line sampling, etc.

The method has high flexibility, and users can customize the optimization targets and the evaluation criteria.

The LightGBM optimizes the support of the category characteristics, can directly input the category characteristics, and does not need the data preprocessing of additional category characteristics.

3.4.2.2 model development

Because the LightGBM belongs to the gradient decision tree GBDT method, based on the algorithm principle, the model tuning mainly comprises the determination of a model training target, the key parameters of the model and a model training strategy. The key process of model development is shown in fig. 4:

And evaluating the target setting. Based on the recessive association relation, the model business target is mined, and the item adopts AUC (automatic Power control) and binary_loglos as training targets.

And setting key parameters. Mainly comprises three types: 1. general parameters: some can determine that usually no adjustment is needed, others need detailed optimization; boost parameter: relevant parameters of the weak learner and key adjustment parameters. 3. Model learning parameters for controlling a model learning process.

Model training strategies. Based on the key settings of model development, the model development mainly selects parameters with great influence on the model and implements a good optimization strategy. According to the invention, an optimal super-parameter is found by adopting a cross-validation method to train the lightGBM model.

3.4.2.3, parameter training

Because the parameter training is a kind of art, there is no certain standard, different model personnel have different parameter-adjusting habits, and the principle is that the parameter training is firstly coarse adjustment and then fine adjustment. Based on IBM project experience, the following scheme is adopted for the project goal, specifically as follows.

1. The following parameters are determined based on the present business scenario: boost_ type, objective, metric, early _supporting_rounds.

2. n_counter and leaving_rate settings. Since the learning_rate and n_counter are strongly correlated, the learning_rate can be determined based on the computing resources, and if the computing resources are not particularly adequate, 0.1 is set, and if not, a smaller value can be set. In addition, a larger n_counter is set in combination with early stop, and the training is fully performed.

3. Keeping the n_counter and learning rate unchanged, the following parameters are adjusted in sequence, as follows:

1) num_leave, determining the complexity of the tree, is a key parameter;

2) max_depth and min_child_samples, determining the complexity of the tree, which is the most important parameter;

3) subsamples and colsample_byte, features and samples are sampled to prevent overfitting of important parameters;

4) lambda, alpha, regularization parameters, prevent overfitting, but the effect may not be significant.

4. If the starting learning_rate setting is relatively large, the learning rate is reduced again, and the number of n_estimators is increased, so that the training time is longer, and the optimal learning rate and the value of n_estimators are obtained, namely, the training is completed.

Table-7: introduction to model parameters

3.4.3 model development results

3.4.3.1, model hyper-parameter setting

By combining the parameter training method in the upper section, the GridSearch method is adopted for parameter tuning of the recessive association relation mining model, and main tuning parameters and tuning results are shown in the following table:

table-8: model parameter setting

3.4.3.2 importance of features

According to the model feature importance result, the 106 feature variable enrollment variables are 99, and the 7 feature variables are not enrolled. The unselected variables are 2 equity tree versus graph schema variables, 5 root node versus graph schema variables. The feature distribution of the top 50 and the bottom 50 of the 99 feature variables selected is shown in the following table:

Table-9: feature importance top 50 feature distribution

Table-10: feature distribution after feature importance ranking 50

	Feature variable dimension of ranks 51-99	Quantity (number of design solutions)
			1	Equity tree pair-graph index	14(42)
2	Equity tree pair-graph mode	8(17)
			3	Equity tree pair-non-graph index	12(12)
4	Root node pair_graph index	6(18)
			5	Root node pair_graph mode	9(17)

In order to further evaluate the model results, the distribution situation of index variables with ten feature importance ranks is analyzed, see the foregoing tables 1 to 6, the common neighbor number of the variables TG40 with ten ranks, the shortest distance between TG39 weight trees, the duty ratio of enterprise legal relations between TP03 weight, the TG41Jaccard similarity index, the Katz distance of G17 root node, the duty ratio of G12 root node to origin degree center, the duty ratio of the ten first capital relations between TP09 weight trees, the duty ratio of the ten first capital relations between TP01 weight trees, the TG42 resource allocation index, the G07 root node to origin tight center, and the positive and negative sample statistics have obvious differences.

3.4.3.3 results of model evaluation

3.4.3.3.1 and AUC evaluation results

The evaluation index of the recessive association relation mining model adopts average AUC. It is generally preferred that the AUC difference between the developed and validated samples be no greater than 10%. Meanwhile, the invention also calculates the average AUC of the 20191231 total sample to assist in judging the model expression result. The final model evaluation results are shown in the following table:

Table-11: model evaluation results

And (3) integrating the model evaluation index expression of each data set of the table, wherein the AUC of the development set, the verification set and the full prediction set of the recessive association relation mining model reaches 0.85. Meanwhile, the difference between the verification set, the full-scale prediction set and the development set is also smaller than 10%, and the model is good in performance.

3.4.3.3.2 PSI evaluation results

The total amount of samples of the model is mined by the recessive association relation, the total amount of the samples is about 2 hundred million, and meanwhile, the positive sample accounts for 0.00264 percent, so that the model belongs to an extremely unbalanced sample. In order to test the generalization effect of the model result on the whole sample and ensure the model application effect, the invention further evaluates the stability of the model. Model stability verification is mainly performed by calculating PSI indices over development samples and full prediction samples. The PSI index measures the degree of the change of two groups, and the PSI is not higher than 0.1 under the normal condition, so that the model prediction result is stable; above 0.25, the stability of the model is poor, and specific reasons need to be analyzed.

The PSI of the model result mined by the implicit association relation is 0.003347 in the development sample and the full sample, so that the stability of the model prediction result is met, and the unbiased sampling of the model development sample is proved. The specific results are shown in the following table.

Table-12: PSI evaluation results

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The method for mining the implicit association relation between enterprises based on the graph characteristics is characterized by comprising the following steps of:

step 5: training the LightGBM algorithm model by utilizing the final data to obtain a trained LightGBM algorithm model, namely a hidden association relation mining model, and mining the hidden association relation among enterprises by utilizing the trained LightGBM algorithm model;

wherein, step 2 comprises the following sub-steps:

step 207: after the group is completed, obtaining the corresponding data information of the root node of the share right tree pair;

and in the step 3, a characteristic variable system with two dimensions of the share right tree pair and the share right tree root node is adopted, wherein the share right tree characteristic variable comprises a share right tree characteristic variable and a share right tree root node pair characteristic variable, the share right tree characteristic variable comprises a share right tree internal graph index, a share right tree inter-graph index, a share right tree fund transaction and a share right tree pair graph mode index, and the share right tree root node pair characteristic variable comprises a share right tree root node pair graph index and a share right tree root node pair graph mode index.

2. The method for mining implicit association between enterprises based on graph features according to claim 1, wherein the step 1 specifically comprises: and performing preliminary partition identification on existing data in a database according to the centrality, the ingress degree, the egress degree, the page rank and the tight centrality index result to obtain super nodes, and performing rejection operation to generate an association relation diagram, wherein the database is a Tigergraph database.

3. The method for mining implicit association between enterprises based on graph features according to claim 1, wherein the mining method further comprises the step 6 of: and performing positive and negative sample data balance processing on the developed samples by adopting a two-stage PU-Learning modeling method, and performing model evaluation on the hidden association mining model by using the developed samples subjected to the positive and negative sample data balance processing.

4. The method for mining implicit association between enterprises based on graph features according to claim 1, wherein the step 5 specifically comprises: setting final data as model-in data of the LightGBM algorithm model, setting an evaluation target, training key super parameters and a model training strategy, finally obtaining model super parameters and model results, namely, correspondingly obtaining the trained LightGBM algorithm model, namely, a hidden association relation mining model, and mining the hidden association relation among enterprises by utilizing the model super parameters and model results.

5. The method for mining implicit association between enterprises based on graph features as claimed in claim 4, wherein the evaluation objective adopts AUC and binary_loglos, and the key super parameters include general parameters, boost parameters and model learning parameters.

6. The method for mining implicit association between enterprises based on graph features as claimed in claim 4, wherein the model training strategy adopts a set-aside method to evaluate or K-fold cross-validation.

7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of a graph-feature-based inter-enterprise implicit association relation mining method according to any one of claims 1 to 6 when the computer program is executed.

8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of a graph feature-based inter-enterprise implicit association relation mining method according to any one of claims 1 to 6.