CN112417176A

CN112417176A - Graph feature-based method, device and medium for mining implicit association relation between enterprises

Info

Publication number: CN112417176A
Application number: CN202011430159.9A
Authority: CN
Inventors: 仇钧; 姚利虎; 韩静; 李志刚
Original assignee: Bank of Communications Co Ltd
Current assignee: Bank of Communications Co Ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-02-26
Anticipated expiration: 2040-12-09
Also published as: CN112417176B

Abstract

The invention relates to a graph feature-based mining method, device and medium for implicit incidence relations among enterprises, wherein the method comprises the following steps: 1: performing super node identification and elimination operation on existing data in a database to generate an association relation graph; 2: further extracting and generating a stock right tree, and further processing all the stock right trees to obtain a stock right tree pair and corresponding data information of the root node of the stock right tree pair; and step 3: constructing a characteristic variable system which respectively adopts two dimensions of a stock right tree pair and a stock right tree root node; and 4, step 4: performing index aggregation and model wide table integration operation on all stock right tree indexes in the characteristic variable system to obtain final data for model training; and 5: and training the LightGBM algorithm model by using the final data, and mining the implicit association relation between enterprises on the actual data by using the trained LightGBM algorithm model. The invention effectively improves the capability of insights on the relationship among enterprise clients and provides powerful reference for risk-related decisions.

Description

Graph feature-based method, device and medium for mining implicit association relation between enterprises

Technical Field

The invention relates to the technical field of financial science and technology, in particular to a method, equipment and a medium for mining implicit association relations between enterprises based on graph features.

Background

Currently, the economy of China is in the transition development, deep adjustment and periodic gear shifting period, and the contradiction among various economic developments is increasingly prominent. In this case, the group enterprises have certain advantages of holding in the middle of heating, but have the problem of damage. Compared with a single enterprise, various relationships of enterprise groups are more complicated, internal organization structures are staggered under cross-regional, cross-industry and diversified operation modes, and information between the bank enterprises and the bank is seriously asymmetrical through mutual guarantee and multi-head loan application among related enterprises. In order to achieve the scale effect, the bank disputes credit to enterprise groups, especially to enterprise groups on the market, so that the credit line obtained by the enterprise groups from the bank is far larger than the maximum liability level possibly borne by the enterprise groups. The 'domino effect' is easily caused by the credit risk of a single enterprise under the group, a chain reaction is formed, and even systemic risk is caused. Apparently, a single bank is reasonable to give credit to independent subsidiaries, but a plurality of banks are not necessarily reasonable to form a credit set for the whole group. In recent years, cases that banks suffer from huge losses due to the fact that group enterprises break production are frequently rare.

Second hundred and sixty-six terms of "official law": associative relations refer to the relations between a company's stockholders, actual controllers, directors, supervisors, high-level managers and the enterprises they directly or indirectly control, and other relations that may lead to a transfer of interests of the company. However, enterprises that are under national control have an association not only for being under national control.

It can be seen that the relationship in the sense of justice emphasizes "control" and emphasizes "relationship between corporate shareholders, actual controllers, directors, supervisors, and senior managers and the enterprises that they directly or indirectly control", and emphasizes control of different enterprises by the same controller.

However, no specific explicit definition of the associated parties is made, whether by regulatory group affiliation or by official act. In the relevant government normative documents, the regulation states that implicit relationship is an associative way between enterprises that apparently does not reveal associative relationship but actually implicitly contains investment relationship or has control or influence relationship in business decision, fund scheduling, production and management.

According to the implicit association definition given by supervision, the method comprehensively considers the applicable scene of the machine learning model, and defines the implicit association for modeling as 'implicit control relationship', and the relationship has the following characteristics:

1. the bank can not obtain the control relationship through enterprise public equity data;

2. by making an agreement with a third party, an individual or common control relationship for a certain enterprise is achieved.

One of the prior art schemes in the technical field of implicit control relations is that each business unit of a bank carries out credit granting management on an enterprise group according to the control force of a group headquarters on the member units of the business unit, the business and financial characteristics of the group, the acquisition condition of a combined report, the difference of the cooperation tightness degree with the bank and the like, the enterprise group is divided into different types of 'total to total', 'top to bottom' and 'bottom to top' to verify the credit granting schemes, and then quantitative measurement and calculation are carried out to finally determine whether to bring a newly-added enterprise into a credit granting group tree.

In the scheme, the credit investigation report of the enterprise by bank staff is mainly relied on, and the credit investigation report comprises a plurality of dimensions such as public business information, upstream and downstream relations of a supply chain, local trade background, enterprise operation financial reports and the like, and whether the enterprise has a potential association relation with internally trusted clients or not is analyzed by manual experience. On one hand, huge labor cost is consumed; on the other hand, a single bank is difficult to obtain the real operation conditions of the enterprise group among a plurality of regions, a plurality of banks and a plurality of subsidiaries, and missing and misjudgment on the key association relationship are easily caused, so that the problems of no storage or construction and the like are caused.

The second technical scheme in the technical field of implicit control relations is that the second technical scheme is that the enterprise is penetrated through the share right by using the commercial software which is widely used in recent years, such as a sky eye investigation tool, an enterprise investigation tool and the like, and the shareholder company is penetrated upwards and the sub-companies are penetrated downwards through the commercial graph database and the open industrial and commercial data, so that the conditions of all partners of the enterprise exposed to the natural person and legal person levels are visually displayed.

The stock right penetration technique in the above scheme has three main disadvantages:

1) the type of the association relationship of the application is single, mainly the equity and the job relationship, but the other relationship types such as fund, trade, guarantee, mortgage, etc. are not fully utilized;

2) for the relationship between the equity and the job, on one hand, enterprises can choose not to disclose the equity information publicly, and on the other hand, through a complex equity architecture (such as overseas registered companies), the domestic business information registration process can be bypassed, so that the purpose of hiding the real control relationship is achieved;

3) the existing equity penetration technology does not fully utilize the graph characteristics and the graph mode of the enterprise association relationship network, and potential association among enterprises is mined through the characteristics of a hidden space.

At present, the exploration of the implicit association relationship among enterprises mostly depends on manual combing and investigation of experienced credit examiners in the first scheme, and the exploration is time-consuming and labor-consuming and cannot be updated in time. Although the stock right relationship visualization tool in the second scheme appears in the market, the data source is too single, and the graph characteristics and the graph mode of the association relationship network are not fully utilized, so that the implicit relationship mining is insufficient.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a graph feature-based mining method, equipment and medium for implicit association between enterprises.

The purpose of the invention can be realized by the following technical scheme:

a graph feature-based mining method for implicit association relation between enterprises comprises the following steps:

step 1: performing super node identification and elimination operation on existing data in a database to generate an association relation graph;

step 2: further extracting and generating a stock right tree based on the existing data and the incidence relation graph, and further processing all the stock right trees to obtain corresponding data information of a stock right tree pair and a root node of the stock right tree pair;

and step 3: based on the incidence relation graph, the right-of-stock tree pairs and the corresponding data information of the right-of-stock tree pair root nodes, combining the Y variable identification rule to construct a characteristic variable system respectively adopting two dimensions of the right-of-stock tree pairs and the right-of-stock tree root nodes;

and 4, step 4: performing index aggregation and model wide table integration operation on all stock right tree indexes in a characteristic variable system of two dimensions of a stock right tree pair and a stock right tree root node to obtain final data for model training;

and 5: and training the LightGBM algorithm model by using the final data to obtain a trained LightGBM algorithm model, namely a recessive association relation mining model, and mining the recessive association relation between enterprises by using the model to the actual data.

Further, the step 1 specifically includes: and performing primary division and identification on existing data in a database according to centrality, entrance degree, exit degree, page ranking and tight centrality index results to obtain super nodes, and generating an association relation graph after removing operation, wherein the database is a Tiger graph database.

Further, the step 2 comprises the following sub-steps:

step 201: the method comprises the steps that (1) the data of the credited legal persons at the end of the existing data extraction time point are subjected to full-scale law, and a stock right tree is further generated through a stock right penetrating rule based on the data of the credited users;

step 202: integrating and removing the weight based on the corresponding data of the rights-to-stock tree, and defining the hierarchy of the rights-to-stock tree;

step 203: acquiring a connected component based on the incidence relation graph, and generating a connected component number of each node in the step 1;

step 204: node data of root nodes on the same layer of non-root nodes, super node data and node data with empty connected component numbers in the data corresponding to the stock right tree are removed;

step 205: based on the step 204, rejecting the stock tree which only contains isolated nodes and each node in the tree belongs to different communicating bodies, and the isolated stock tree between trees;

step 206: based on the stock tree data and the communicated component number results generated in the steps 202 to 205, within the same communicated component number, pairwise combinations are generated to control stock tree pairs, and the stock tree pairs with individual root nodes and different communicated component numbers are removed without pairing;

step 207: and obtaining the corresponding data information of the right of stock tree pair and the right of stock tree pair root node after finishing the pairing.

Further, the characteristic variable systems of the two dimensions of the right-of-stock tree pair and the right-of-stock tree root node respectively adopted in the step 3 comprise a right-of-stock tree characteristic variable and a right-of-stock tree root node pair characteristic variable, wherein the right-of-stock tree characteristic variable comprises a right-of-stock tree inner graph index, a right-of-stock tree inter-fund transaction and a right-of-stock tree pair graph mode index, and the right-of-stock tree root node pair characteristic variable comprises a right-of-stock tree root node pair graph index and a right-of-stock tree root node pair graph mode index.

Further, the excavation method further comprises the step 6: and carrying out positive and negative sample data balance processing on the development sample by adopting a two-stage PU-Learning modeling method, and carrying out model evaluation on the recessive association relation mining model by utilizing the development sample subjected to the positive and negative sample data balance processing.

Further, the step 5 specifically includes: setting the final data as lightGBM algorithm model entry data, setting an evaluation target, training key hyper-parameters and a model training strategy, finally obtaining model hyper-parameters and model results, namely corresponding to the trained lightGBM algorithm model, namely a recessive association relation mining model, and mining the recessive association relation between enterprises by utilizing the model hyper-parameters and the model results.

Furthermore, the assessment target adopts AUC and binary _ loglos, and the key hyper-parameters comprise general parameters, boost parameters and model learning parameters.

Furthermore, the model training strategy adopts leave-out evaluation or K-fold cross validation.

The invention also provides a terminal device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the steps of the mining method of the implicit relationship between enterprises based on the graph characteristics when executing the computer program.

The invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to implement the steps of the graph feature-based mining method for the implicit relationship between enterprises.

Compared with the prior art, the invention has the following advantages:

(1) the invention describes a method for mining implicit association relations among enterprises by utilizing a machine learning model based on graph characteristics, which realizes implicit relation identification, improves the accuracy, coverage rate and timeliness of prediction, and facilitates better group credit management of a bank to customers, thereby enhancing the group risk management capability.

(2) The invention provides a set of graph feature-based mining model for an enterprise implicit incidence relation. The implicit association relation mining model is based on technologies such as big data processing, graph database, knowledge graph and machine learning, adopts multiple data sources in and out of a row, fully utilizes and effectively combines a graph analysis technology and an artificial intelligence technology, deeply mines the implicit association relation among enterprise customers, effectively improves the capability of insights on the relation among the enterprise customers, and provides powerful reference for decisions such as risk management, risk prevention and control and the like.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a list of rights tree pair characteristics in the method of the present invention;

FIG. 3 is a characteristic list of root node pairs of the equity tree in the method of the present invention;

FIG. 4 is a schematic diagram of a LightGBM algorithm modeling process in the method of the present invention;

FIG. 5 is a schematic view of a characteristic variable process flow in an embodiment of the method of the present invention;

FIG. 6 is a diagram of the partition result of the index of the super node in the embodiment of the method of the present invention;

FIG. 7 is a statistical result diagram of connected components of the behavior diagram in the embodiment of the method of the present invention, where FIG. 7(a) is a diagram showing results of statistics and the number of nodes in connected bodies, and FIG. 7(b) is a diagram showing results of the number of nodes in connected bodies and the number of connected bodies;

FIG. 8 is a schematic diagram of behavior diagram relationship types in a method embodiment of the invention;

FIG. 9 is a diagram illustrating a process of generating a rights tree in an embodiment of the method of the present invention;

FIG. 10 is a schematic diagram of a two-stage model scenario in a method embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Firstly, the related abbreviations and key terms of the technical scheme of the present invention are defined as follows:

implicit relationship of association: the method is an association mode which does not show the association relationship on the surface and actually contains the investment relationship or has the control or influence relationship in the operation decision, fund scheduling and production operation. -notification of chejiang silver office on bank credit customer implicit association risk prompt <2012>

Trust group tree (or group relationship): when the bank develops the credit service, the group credit management can be carried out for larger enterprises. The members in the tree have been uniformly trusted by a clique, and it can be confirmed that the members certainly belong to the clique.

Rights penetration tree: and describing the hierarchical structure of the company stock right by combining the relation of the business stock right, the high management and employment relation and the group credit granting relation in a tree-shaped stock right structure chart. The stock penetrating tree comprises a group credit granting and a non-group credit granting.

Knowledge graph: a large-scale semantic network takes entities or concepts as nodes and is connected through semantic relations.

By discovering the association between entities, the semi-structured and unstructured data are integrated, and the knowledge graph can help a machine to understand data, explain phenomena and knowledge reasoning, so that deep-level relationships are discovered, and intelligent search and intelligent interaction are realized.

Graph database: a non-relational database stores entity information and relation information between entities by applying graph theory, and the mainstream work of the database comprises TigerGraph, Neo4j and the like.

Supervised machine learning: constructing a training set for machine learning based on feature variables (the features variables) defined by window period data and target variables (the target variables) formed by credit expression at a specific time; based on the training set, a machine learning algorithm (the classification model) is developed to train a classification model, and the trained model is finally applied to predict the credit performance of the client.

Time window: according to the requirement on a modeling period, historical data are cut into a plurality of data sets in a time dimension, and data materials are further provided for model training. The design of the observation point is based on the actual demand of the business application model, namely when the implicit association mining model is needed to predict the implicit association of the enterprise, which is common at the end of the season, the end of the half year and the like; selecting a fixed period as an observation period before the observation point, thereby constructing a characteristic variable (X variable) of a training set; after the observation point, a fixed period is also selected as a presentation period, namely, the presentation of the customer sample after the observation point is collected, and then the target variable (Y variable) of the training set is constructed.

Secondly, the complete technical scheme provided by the invention is as follows:

firstly, abstracting the association between industrial and commercial entities into a graph structure and storing the graph structure in a graph database; secondly, taking the stock right relation tree as the dominant relation basis, and extracting a stock right tree pair as a research sample; then, using graph characteristics on the relationship graphs of funds, trade, guarantee, mortgage and the like and business characteristics of fund transaction to construct characteristic variables of the model; and finally, carrying out model training, verification and prediction by using a two-stage PU-Learning modeling method and using a classification model LightGBM algorithm to predict the probability of the existence of a recessive association relation between any enterprises.

The following describes the modeling steps at each stage in the figure in detail:

1. association graph construction

The incidence relation graph of the invention combines the internal and external data, and comprises 7 types of relation types of job, guarantee, fund, beneficiary, trade and same address telephone, and eliminates super nodes on the graph.

Due to the economic behavior of the enterprise and the characteristics of the incidence relation edges, a 'super node', such as a payment treasure, appears on the incidence relation graph. At present, the super node problem is still the focus of research in academia and industry, and no effective identification and solution exists. The existence of the super node can affect the application efficiency of the graph database; secondly, influencing model development and predicting the generation result of the guest group; thirdly, the calculation result of the partial graph index is abnormal. Since such super nodes may have an influence on the implicit association mining model scheme, the super node identification and processing work needs to be developed first.

The invention adopts a method of taking qualitative analysis as the main and quantitative analysis as the auxiliary to identify the super nodes. The qualitative analysis is mainly based on business scene application, and combines management experience to position nodes without application significance in various relations, so as to comb node identification rules. For example: third party payment platforms in the funding relationship, government agencies in the equity relationship. And the quantitative analysis is to calculate six graph indexes of out degree, in degree, degree centrality, PageRank, interphase centrality and tight centrality of the nodes on the association relation graph to assist in qualitative analysis. For example: third party payment platform nodes in the fund relationship present high income degree and low out degree on the graph indexes, and government institutions in the equity relationship calculate results of the low income degree and the high out degree. And (5) integrating qualitative and quantitative results to determine the super node.

2. Guest group sample extraction and Y variable identification

The starting point of the recessive relation object mainly researched by the invention is a credit client of the bank. Firstly, starting from a credit granting client, the upper and lower penetration of the equity and the job is carried out to form a plurality of equity relation trees. Each equity tree may then be classified as either a trust group tree or a non-trust group tree based on the bank's group trust relationship. Y variable identification rule: for the stock control right tree pair in the model development sample, if the identifier of the group customer relationship is 'implicit relationship', and if the identifier of the group customer relationship is not 'irrelevant'.

3. Design of characteristic variables

Combining the characteristics of a target variable Y, firstly, designing a two-dimensional characteristic variable system through a stock right tree pair and a stock right tree root node pair, and comprehensively analyzing the association degree between the lender user control stock right trees; and secondly, by means of two types of index structures of image characteristics and non-image characteristics, map structure information is enriched, and model accuracy is improved.

As shown in fig. 2, the dimensions of the stock tree for feature design consideration include: a graph characteristic variable and a graph mode characteristic variable. Wherein:

the graph characteristic variables are based on the topology structure in the stock right tree and the structure between the stock right trees, and comprise centrality characteristics, structure characteristics, path characteristics and neighborhood characteristics.

The graph mode characteristic variable is based on the path structure of the enterprise among the stock right penetration trees on the behavior graph, and constructs an index reflecting the edge relation to form the mode characteristic, and mainly comprises 7 types of relations and the compound relation thereof.

As shown in FIG. 3, the dimensions of the feature design consideration for the root node of the stock right tree include: graph characteristic variables, graph mode characteristic variables, and non-graph characteristic variables. Wherein:

the graph characteristic variable is a graph characteristic index calculated based on a topological structure among the root nodes of the stock right tree, and comprises neighborhood characteristics, the characteristics of the nodes and path characteristics.

The graph mode characteristic variable is used for calculating the relationship mode among the nodes based on the path mode among the root nodes of the stock right tree, and comprises 7 types of relationships and composite relationships thereof.

The non-map characteristic variable is the fund transaction based on the root node of the equity tree.

4. Algorithm modeling

The invention adopts the LightGBM algorithm, is a new member of the boosting algorithm, is an efficient implementation of the gradient lifting tree, and is introduced by Microsoft in 2016. In principle it is similar to XGBoost. Compared with XGboost, the method has the advantages of higher training efficiency, low memory use, support for directly using class characteristics and the like. The schematic diagram of the training process is shown in fig. 4:

the model selects AUC value for distinguishing evaluation index.

The ROC (receiver Operating characteristics) curve is a comprehensive index reflecting continuous variables of sensitivity and specificity. The abscissa: (FPR), the proportion of samples predicted to be positive but actually negative to all negative samples; ordinate: (TPR), the samples predicted to be positive and actually positive are a proportion of all positive samples.

AUC is the area under the ROC curve: indicating the probability of randomly picking a positive and a negative sample, which the current classification algorithm ranks before according to the computed Score value. AUC 1, perfect classifier, 0.5< AUC <1, better than random guess.

3. The specific embodiment is as follows:

the main work of implementation of the implicit incidence relation mining model can be divided into two stages of characteristic variable processing and algorithm modeling.

The characteristic variable processing is shown in fig. 5, in which:

1) the basic data processing stage comprises the discovery of the super nodes and the construction of an association relation graph

2) The right-to-stock tree generates and identifies the extraction and sampling of the positive and negative samples corresponding to the model

3) Processing of stock right tree/root node basic index corresponds to processing of several types of characteristic variables

4) Index aggregation and model wide table integration as the final data arrangement work before model entering training

3.1 Association graph construction

The implicit model incidence relation graph is a processing basis of two kinds of graph characteristic variables of follow-up model passenger group sample processing, implicit model graph indexes and graph modes. The specific generation steps are as follows:

a. super node culling

Due to the economic behavior of the enterprise and the characteristics of the incidence relation edges, a 'super node' appears on the incidence relation graph. The existence of the super node can affect the application efficiency of the graph database; secondly, influencing model development and predicting the generation result of the guest group; thirdly, the calculation result of the partial graph index is abnormal. In view of the fact that the super nodes may affect the implicit association relation mining model scheme, the super node identification and processing work is firstly carried out.

According to the index results of degree centrality, in-degree, out-degree, page rank and close centrality, the total number of clients at 20190930 and 20191231 are analyzed, and the total number of clients is 19351560. Specific indexes and division rules are shown in fig. 6;

and aiming at the identified 55 super nodes, determining 28 clients as the super nodes by combining the actual enterprise operation condition, and adopting a point edge file elimination processing mode.

In order to better understand the connectivity degree of the nodes in the stock right tree on the behavior diagram, connected component analysis is introduced. Based on the situation that the identified 28 super nodes are removed and the nodes are limited to be arranged between 0-5 layers in the equity tree, calculating the connected components on the point behavior diagram of the nodes in the equity tree at 20191231, and generating 119,300 connected bodies, wherein 104,871 connected bodies only contain one node, and the outline analysis of other 14,429 connected bodies is shown in fig. 7(a) and fig. 7 (b);

based on the above analysis of the connected body condition, the behavior diagram has an oversized connected body composed of 435,107 femoral right tree nodes, and the connected component number is 1. According to the index results of degree centrality, degree of entry and degree of exit, statistical analysis under the aperture of summary relationship type and branch relationship type is respectively carried out on the client group with the 20191231 time point connected component number of 1, and a super node list is extracted.

b. Association graph generation

And generating a behavioral graph of 20190930 and 20191231 time points required by the implicit model in a Tiger graph database based on the 7-class and 11-class association relationship point-edge data after the super nodes are removed, and further carrying out model passenger group sample processing, graph index and graph mode index calculation in the follow-up process.

The association relation graph is composed of 7 types of main and sub-types of relations, 11 types of relations, including company entity, personal entity, job function, guarantee function, fund, beneficiary, trade and same address telephone, as shown in fig. 8. The guarantee circle, the fund circle and the trade circle are graph indexes obtained by looping within 10 steps according to a circle searching algorithm. The incidence relation graph does not adopt a common stock right relation, and because the stock right relation belongs to a dominant relation and is used for constructing a stock right relation tree to extract a Y label sample, the stock right relation is removed from a model for detecting a recessive relation.

The prediction sample set of the model is about 2.1 hundred million, so the graph database adopted by the invention is Tiger Graph. Compared with other graph computing schemes such as Neo4j, janussgraph, Spark and the like, the TigerGraph supports distributed parallel computing of the protograph and can obtain higher computing performance.

3.2 passenger group sample extraction and Y variable identification

The hidden model guest group sample processing logic is as follows:

1) and drawing with lender. And extracting the data of the loan account of the whole legal person at the end of the time.

2) And generating the stock right tree. Taking the screened legal person with a credit account as a starting point, generating a stock right tree (namely a cross-line E-type tree generation rule) through a stock right penetrating rule, wherein the stock right controlling relationship used during penetrating comprises the following steps: more than 50% of the controlling shareholders and less than 50% (inclusive) of the first large shareholders, including the juxtaposed first large shareholders. Stock right penetration target point processing rule:

if the penetration target point is all levels of governments (such as national resources committee, finance department, education department, health department and other national all-level government institutions), returning to the next layer of enterprise to form a stock control right tree;

if the penetrating target point is a public institution (such as a school, a hospital, a television station, a newspaper company and the like), the penetrating target point is the public institution to form a stock control right tree;

if the penetration target point is an overseas (including harbor, Australia and Taiwan) enterprise, the penetration target point is the overseas (including harbor, Australia and Taiwan) enterprise to form a stock control right tree;

for group client rights trees, the penetrated result must ensure that each group has at least two rights trees, and for a single group rights tree, the group rights tree needs to be returned to the structure of the split layer; the non-clique tree need not follow this rule.

3) Integrating the stock right tree. And based on the stock right tree data generated in the last step, if the root nodes of the stock right trees are the same, integrating and removing the weight.

4) And defining a hierarchy of the rights tree. And defining the tree depth from the lower three layers of the root node based on the stock tree data generated in the steps.

5) And generating a connected component number. And calculating the connected components based on the generated incidence relation graph, and generating the connected component number of each node on the behavior graph.

6) And eliminating the internal nodes of the stock tree. And eliminating node data, super node data and node data with empty connected component numbers of root nodes and non-root nodes on the same layer of the root nodes.

7) And removing the stock right tree. Based on the steps, the stock tree and the isolated stock tree among trees, which only contain isolated nodes and are provided with different connecting bodies to which each node in the tree belongs, are removed.

8) And combining the pair of rights trees. Based on the generated stock right tree data and the communicated component numbering result, in the same communicated component numbering, pairwise combination (reverse pairs are not generated) is performed to generate stock control stock right tree pairs, and the removal root nodes are personal stock right tree pairs. And the stock right trees among different connected component numbers are not paired.

9) And generating result data. The result data field includes a tree number for the rights tree pair; client SID of all nodes in the stock right tree, client name, whether starting point, whether root node, and the level of the tree to which the client belongs.

The corresponding process of the above steps 2) to 9) is shown in fig. 9.

After the stock right penetration is carried out on 5 ten thousand credit clients, 54 ten thousand nodes which belong to 3.3 ten thousand stock right relation trees respectively are finally obtained. Wherein 7358 clique trust trees and 26088 non-clique trust trees. Pairwise pairing is carried out on the 3.3 ten-thousand rights trees, the removed root nodes are all personal node pairs, and the remaining 2.5 hundred million node pairs are the model guest group samples.

In the generated model customer group sample, the combination of the right of stock tree pairs in the same group number is marked as 'having implicit relation', and the combination of other right of stock tree pairs is marked as 'having no relation'.

3.3, characteristic variables

3.3.1 overview of characteristic variables

According to the information usefulness assumption, the design of the characteristic variables finally determines the upper limit of the model expression, so that the design of the characteristic variables needs to better understand the risk service and strives to comprehensively depict the credit risk condition of a client in multiple angles, three-dimensionally, multiple channels and multiple channels.

Based on the confirmed implicit association relation exploration scheme, in order to more comprehensively interpret the association degree between the credited user and the right-to-shares tree, the right-to-shares tree and the right-to-shares tree root node are adopted to design a scheme of a two-dimensional characteristic variable system, and the association degree between the credited user and the right-to-shares tree is comprehensively analyzed. Meanwhile, by constructing two indexes of graph indexes and graph modes, the graph structure information is enriched, and the model accuracy is improved. The stock right tree pair has 71 characteristic variables, and the stock right tree root node has 35 characteristic variables.

3.3.2 stock rights Tree pairs feature variables

And constructing characteristic variables through a graph algorithm based on the topological structure of the stock right tree on the behavior graph. The method can be further subdivided into four types according to the types of variables: the index of the internal map of the equity tree, the index of the map between the equity trees, the fund transaction between the equity trees and the index of the map-to-map mode of the equity trees.

3.3.2.1 stock right tree inner graph index

Based on the topological structure of the nodes in the stock right penetration tree on the behavior diagram, diagram characteristic indexes reflecting respective structural characteristics of the nodes are constructed, and the diagram characteristic indexes mainly comprise three major indexes of centrality characteristics, structural characteristics and path characteristics. In particular, see the following table:

TABLE-1: stock right tree inner graph index

3.3.2.2 graph index between rights tree

Based on the topological structure of the nodes in the stock right penetration tree on the behavior diagram, diagram characteristic indexes reflecting the difference among the nodes are constructed, and the diagram characteristic indexes mainly comprise four major indexes of centrality characteristics, structure characteristics, path characteristics and neighborhood characteristics. In particular, see the following table:

TABLE-2: equity tree map index

3.3.2.3 trading capital between stock trees

Based on the variable of capital flow information processing, the attention is paid to capital traffic behavior and stability between the equity tree pairs. In particular, see the following table:

TABLE-3: trading of funds between equity trees

3.3.2.4 stock right tree map matching mode index

Based on the topological structure of the rights penetration tree on the virtual rights tree behavior diagram, an index reflecting the edge relation mode characteristics of the rights penetration tree is constructed, and the index mainly comprises 7 types of relations and compound relations of the relations. In particular, see the following table:

TABLE-4: stock right tree map matching mode index

3.3.3 stock rights Tree root node pair characteristic variables

The characteristic variables of the root node pairs of the stock right tree mainly reflect the relationship characteristics among the root node pairs of the stock right tree, are constructed through a graph algorithm based on the topological structure of the root node pairs of the stock right tree on a behavior graph, and mainly comprise graph indexes and graph mode indexes.

3.3.3.1 graph index of root node of stock right tree

The characteristic indexes of the calculation graph based on the topological structure among the root nodes of the stock right tree mainly comprise three types of indexes of node adjacency characteristics, node self characteristics and node path characteristics. In particular, see the following table:

TABLE-5: stock right tree root node pair graph index

3.3.3.2 graph model index of root node of stock right tree

The graph mode characteristic indexes are calculated based on the path mode among the root nodes of the stock right tree, and mainly comprise 7 large-class relations and compound relations thereof. In particular, see the following table:

TABLE-6: stock right tree root node pair graph mode index

3.4 model construction

3.4.1 overview of two-stage model framework

In a recessive incidence relation model development sample, a negative sample has the problem of data with inaccurate identification, and a positive sample with a recessive incidence relation is mixed. In order to ensure the classification model algorithm result, the invention adopts a two-stage model method to ensure the accuracy of positive and negative sample identification in the model development sample. Meanwhile, in combination with the unbalanced sample problem of the model (positive sample ratio is 0.00134%), the specific model scheme is shown in fig. 10:

the two-stage model scheme is implemented as follows:

3.4.1.1, one-stage model development

Because negative samples have the problem of inaccurate identification data and are mixed with positive samples with implicit incidence relations, the real negative samples are obtained through one-stage model development, and the method specifically comprises the following steps:

data sampling: according to the proportion of 1:20 of the positive samples and the negative samples, 10 groups of samples are extracted by adopting a non-put-back sampling algorithm.

Model training: and circularly extracting 1 group of training, and verifying the other 9 groups to train the corresponding model. Finally, the average AUC of 10 models and the validation data thereof is obtained.

Selecting a model: based on the 10 sets of models, the set with the highest mean AUC of the validation data was selected as the one-stage model.

3.4.1.2 two-stage model development

And based on the one-stage model, obtaining pure negative sample data through a model prediction result, and then collecting full positive sample data to perform two-stage training to obtain a final model.

Data sampling: according to the proportion of 1:100 of the positive samples and the negative samples, 10 groups of samples are extracted by adopting a non-put-back sampling algorithm.

Threshold determination: and predicting 10 groups of samples based on the model trained in one stage, respectively obtaining the threshold value of the last 20% of the prediction probability of each group, and averaging to obtain the average threshold value of 10 groups of samples.

Training a sample: and selecting samples with prediction probability smaller than the average threshold value based on the prediction results of the extracted 10 groups of samples respectively by adopting the determined average threshold value, and obtaining 10 groups of samples by combining with the full amount of positive samples to be used as samples for the two-stage model training.

Model training: and circularly extracting 1 group of training samples, and taking the other 9 groups of training samples as verification samples to train the two-stage model. Finally, the average AUC of 10 models and the validation data thereof is obtained.

Selecting a model: based on the 10 groups of models, the group with the highest mean AUC of the validation data was selected as the final model.

3.4.2 LightGBM Algorithm overview

3.4.2.1 introduction to algorithm

LightGBM is a new member of the boosting algorithm, is an efficient implementation of gradient lifting trees, and is introduced by Microsoft in 2016. In principle, similar to the XGboost, the negative gradient of the loss function is used as a residual error approximate value of the current decision tree to fit a new decision tree. Compared with XGboost, the method has the advantages of higher training efficiency, low memory use, support for directly using class characteristics and the like. The method has certain advantages when applied in the industry, and specifically comprises the following steps:

the preprocessing of the characteristic variable data requires relatively less requirements and is insensitive to input, including tolerance to abnormal values, automatic processing of missing values, no requirement of variable correlation processing, no requirement of normalization of the characteristic variables and the like. When the modeling data is taken, the baseline result can be quickly obtained.

The industrialization degree is relatively high, and the industrial application is large. The main reasons are that the bottom language C + +, which has high efficiency, supports the parallelism of calculation, and introduces a method of data compression and fragmentation when the data volume is large, so as to improve the efficiency of the algorithm as much as possible.

More processing methods for preventing overfitting are introduced, including adding regularization terms, reduction factors, row-column sampling and the like.

The method has high flexibility, and the user can customize the optimization target and the evaluation standard.

The LightGBM optimizes the support of the class characteristics, can directly input the class characteristics, and does not need additional data preprocessing of classification characteristics.

3.4.2.2, model development

Because the LightGBM belongs to a gradient decision tree GBDT method, based on the algorithm principle, model tuning mainly comprises determination of a model training target, model key parameters and a model training strategy. The key process of model development is shown in FIG. 4:

the target setting is evaluated. A model service target is mined based on a recessive association relation, and AUC and binary _ loglos are adopted as training targets in the project.

And setting key parameters. The method mainly comprises three types: 1. general parameters: some can determine that adjustment is not needed generally, and others need detailed optimization; boost parameter: weak learner-related parameters, and key adjustment parameters. 3. And model learning parameters used for controlling the model learning process.

And (5) model training strategy. Based on key settings of model development, the model development mainly selects parameters which have large influence on the model and implements a good optimization strategy. In the invention, the optimal hyper-parameter is searched by adopting a cross validation method to train the lightGBM model.

3.4.2.3, parameter training

Because the parameter training is an art without a certain standard, different model personnel have different parameter adjusting habits, and the principle of the parameter training is firstly to coarsely adjust and then to finely adjust. Based on the IBM project experience, the following scheme, and in particular, the following, is adopted for this project objective.

1. Determining the following parameters based on the business scene: boost _ type, objective, metric, early _ stopping _ rounds.

2. n _ estimator and leaving _ rate settings. Since the relation between the leaving _ rate and the n _ estimator is strong, the size of the leaving _ rate can be determined according to the computing resources, and if the computing resources are not particularly sufficient, 0.1 is set, and if not, a smaller value can be set. In addition, a larger n _ estimator is set in combination with early stop, and the training is fully performed.

3. Keeping n _ estimator and learning rate unchanged, and sequentially adjusting the following parameters, specifically as follows:

1) num _ leaves, which determines the complexity of the tree and is a key parameter;

2) max _ depth and min _ child _ samples, which are the most important parameters for determining the complexity of the tree;

3) subsample and colsample _ byte, sampling of features and samples, important parameters to prevent overfitting;

4) lambda, alpha, regularization parameter, prevents overfitting, but the effect may not be significant.

4. If the setting of the leaving _ rate is larger, the leaving rate is reduced again, at this time, the number of n _ estimators increases, the training time is longer, and the optimal values of the leaving rate and n _ estimators are obtained, i.e. the training is completed.

TABLE-7: model parameter introduction

3.4.3 model development results

3.4.3.1 model hyper-parameter setting

In combination with the parameter training method in the upper section, the GridSearch method is adopted for parameter tuning of the mining model of the implicit association, and the main tuning parameters and tuning results are shown in the following table:

TABLE-8: model parameter setting

3.4.3.2, importance of features

According to the model feature importance result, the 106 feature variable candidate variables are 99, and 7 feature variables are not selected. The unselected variables are 2 strand tree versus graph mode variables, 5 root nodes versus graph mode variables. The distribution of the top 50 and the post 50 feature of the 99 selected feature variables is shown in the following table:

TABLE-9: feature importance top-ranked 50 feature distribution

TABLE-10: feature distribution after feature importance ranking 50

	Characteristic variable dimensions for ranks 51-99	Quantity (quantity design scheme)
			1	Equity tree pair _ map indicator	14(42)
2	Equity tree vs. graph schema	8(17)
			3	Stock right tree versus _ non-map indicator	12(12)
4	Root node pair _ map index	6(18)
			5	Root node pair _ graph mode	9(17)

In order to further evaluate the model result, the distribution of index variables with characteristic importance ranked in the top ten is analyzed, see tables 1 to 6 above, the variables TG40 common neighbor number ranked in the top ten, the shortest distance between TG39 stock right trees, the percentage of enterprise legal relationships among TP03 shares weight, the TG41Jaccard similarity index, the Katz distance of G17 node pairs, the percentage of G12 node pairs to the origin degree centrality, the percentage of the top ten big fund relationships among TP09 stock right trees, the percentage of the top ten big fund relationships among TP01 stock right trees, the TG42 resource allocation index, the G07 node pairs to the origin point tight centrality, and the positive and negative sample statistics have obvious differences.

3.4.3.3, model evaluation results

3.4.3.3.1, AUC evaluation results

The evaluation index of the recessive incidence relation mining model adopts average AUC. It is generally preferred that the AUC difference between the development sample and the validation sample is no greater than 10%. Meanwhile, the average AUC of 20191231 full samples is calculated to assist in judging the model expression result. The final model evaluation results are shown in the following table:

TABLE-11: results of model evaluation

And (3) by comprehensively expressing the evaluation index performance of each data set model in the table, the AUC of the model development set, the verification set and the full prediction set of the implicit association relation mining reaches 0.85. Meanwhile, the difference between the verification set, the full prediction set and the development set is less than 10%, and the model performance is good.

3.4.3.3.2 PSI evaluation result

Because the number of the full-scale samples of the model is about 2 hundred million according to the recessive association relation, the proportion of positive samples is 0.00264 percent, and the model belongs to a sample with extreme imbalance. In order to test the generalization effect of the model result on the full amount of samples and ensure the application effect of the model, the stability of the model is further evaluated by the method. Model stability verification is mainly performed by calculating PSI indices on development samples and full prediction samples. The PSI index measures the degree of change of two groups, and the PSI is not higher than 0.1 under the normal condition, so that the model prediction result is stable; and if the value is more than 0.25, the stability of the model is not good enough, and specific reasons need to be analyzed.

The PSI of the model result mined by the implicit association relation between the development sample and the full sample is 0.003347, so that the stability of the model prediction result is met, and the unbiased sampling of the model development sample is proved. Specific results are shown in the following table.

TABLE-12: PSI evaluation results

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A graph feature-based mining method for implicit association between enterprises is characterized by comprising the following steps:

2. The method for mining the implicit relationship between enterprises based on graph characteristics according to claim 1, wherein the step 1 specifically comprises: and performing primary division and identification on existing data in a database according to centrality, entrance degree, exit degree, page ranking and tight centrality index results to obtain super nodes, and generating an association relation graph after removing operation, wherein the database is a Tiger graph database.

3. The method for mining the implicit relationship between enterprises based on graph characteristics as claimed in claim 1, wherein the step 2 comprises the following sub-steps:

4. The method for mining the implicit relationship between enterprises based on graph characteristics as claimed in claim 1, wherein the characteristic variable systems of the two dimensions of the right-of-stock tree pair and the right-of-stock tree root node adopted in the step 3 respectively comprise a right-of-stock tree characteristic variable and a right-of-stock tree root node pair characteristic variable, wherein the right-of-stock tree characteristic variable comprises a right-of-stock tree internal graph index, a right-of-stock tree inter-fund transaction and a right-of-stock tree pair graph pattern index, and the right-of-stock tree root node pair characteristic variable comprises a right-of-stock tree root node pair graph index and a right-of-stock tree root node pair graph pattern index.

5. The mining method of the implicit relationship between enterprises based on graph characteristics as claimed in claim 1, wherein the mining method further comprises the steps of 6: and carrying out positive and negative sample data balance processing on the development sample by adopting a two-stage PU-Learning modeling method, and carrying out model evaluation on the recessive association relation mining model by utilizing the development sample subjected to the positive and negative sample data balance processing.

6. The method for mining the implicit relationship between enterprises based on graph characteristics according to claim 1, wherein the step 5 specifically comprises: setting the final data as lightGBM algorithm model entry data, setting an evaluation target, training key hyper-parameters and a model training strategy, finally obtaining model hyper-parameters and model results, namely corresponding to the trained lightGBM algorithm model, namely a recessive association relation mining model, and mining the recessive association relation between enterprises by utilizing the model hyper-parameters and the model results.

7. The method as claimed in claim 6, wherein the evaluation objective is AUC and binary _ loglos, and the key hyper-parameters include general parameters, boost parameters and model learning parameters.

8. The method for mining the implicit relationship between enterprises based on graph features as claimed in claim 6, wherein the model training strategy adopts leave-out evaluation or K-fold cross validation.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the graph feature-based mining method for implicit relations between enterprises according to any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the steps of a graph-feature-based mining method for implicit relations between enterprises according to any one of claims 1 to 8.