CN114219287A

CN114219287A - Taxpayer risk evaluation method based on graph neural network

Info

Publication number: CN114219287A
Application number: CN202111535730.8A
Authority: CN
Inventors: 李超; 马达; 吴石磊; 钟晓刚; 康亚军; 梁少虎; 秦子鹏
Original assignee: China National Software & Service Co ltd
Current assignee: China National Software & Service Co ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-22

Abstract

The invention discloses a taxpayer risk evaluation method based on a graph neural network, which comprises the following steps: 1) constructing a data set of a graph risk propagation model; 2) constructing a graph network based on the basic information of each taxpayer in the data set, wherein the graph network is used as a taxpayer attribute information network; establishing a taxpayer bill flow relation information network by taking value-added tax special invoice information as a basis; 3) combining the taxpayer attribute information network and the taxpayer ticket flow relation information network to obtain a final graph network; then acquiring an adjacency matrix of the final graph network; 4) feature vectors of each taxpayer in the data set; 5) taking the adjacency matrix as a network parameter of the graph risk propagation network, and training the graph risk propagation network by using the characteristic vectors of the taxpayers in the data set; 6) and for a group of taxpayers to be evaluated, acquiring the characteristic vector of each taxpayer in the group of taxpayers, and inputting the characteristic vector into the trained graph risk propagation network for prediction to obtain the risk grade of each taxpayer and whether a group situation exists.

Description

Taxpayer risk evaluation method based on graph neural network

Technical Field

The invention relates to a taxpayer risk evaluation method, in particular to a Graph risk propagation model based on Graph neural Network (GCN), and belongs to the field of artificial intelligence.

Background

The taxpayer risk level evaluation is that the tax authority comprehensively evaluates the taxpayer risk level according to the historical tax payment condition of the taxpayer, the tax payer operation condition, the commodity selling condition of the taxpayer and the like. The current evaluation of the risk level of the taxpayer is relatively artificial, mainly related experts evaluate the risk of the taxpayer according to the existing data, the used data are only basic information and operation condition information of the taxpayer, a certain index of the taxpayer is scored according to a certain rule, the final score is obtained by summing all indexes, the artificial workload is more, more importantly, only the information of the taxpayer is considered, and the embedded position information of the taxpayer in a transaction link and a relationship network is not considered. The current model is difficult to find the mode of 'group partner plan', and the mode is extremely important for finding the risk propagation change in the taxpayer operation process. Therefore, a graph risk propagation model combining taxpayer characteristic information, ticket flow information and relation network topological structure information is provided, the taxpayer risk level can be intelligently evaluated, the manual workload is reduced, and partner information can be found by combining a graph network algorithm.

Currently, the risk level evaluation task mainly comprises two types of solutions, namely 1, a scheme based on machine learning and 2, a scheme based on expert modeling.

1. The scheme based on machine learning mainly considers modeling of basic information of taxpayers as machine learning indexes, and then carries out classification Decision by using machine learning algorithms such as Decision trees (Decision trees), Support Vector Machines (SVM), gradient boosting trees (GBDT) and the like. The method has the advantages that the influence of the attributes in the basic information on the classification result can be visually seen, and the interpretability is strong. The method has the advantages that characteristic vectors of taxpayers need to be designed artificially, more importantly, the ticket flow relation and the goods sale condition are not considered, the taxpayers can be analyzed independently, necessary topological information is lacked, obtained results are not accurate, and the ganged phenomenon cannot be found.

2. The expert modeling method mainly utilizes expert experience, experts summarize a set of business knowledge in practice to relatively accurately judge the risk level of the taxpayer, and the experts grade according to a series of characteristics to finally judge the risk level of the taxpayer. For example, a taxpayer who has a high risk score according to the characteristics of sales, importation and night invoicing, and if the taxpayer is in a business sensitive business, the risk score is relatively low. This method is scored by experts. The defects are obvious, expert knowledge is needed, labor cost is high, and intelligence is insufficient. Business knowledge has hysteresis and cannot well cope with new shape conditions.

Disclosure of Invention

Aiming at the technical problems, the invention aims to provide a taxpayer risk evaluation method based on graph convolution neural network risk propagation. And constructing a ticket flow network according to the data of the goods from the invoice buyer to the seller. Graph algorithms such as a centrality Algorithm (a centrality Algorithm, a near centrality Algorithm, an intermediary centrality Algorithm, a PageRank Algorithm, etc.) are then used, depending on the ticket flow network, to find the community (LabelPropagation Algorithm, ComponentsAlgorithm, etc.). The algorithm based on the graph theory better considers the ticket flow relationship (upstream and downstream relationship) of the enterprise, gives nodes with large influence in the graph and divides communities. The advantage of the graph theory algorithm is that the taxpayers form a relationship network through the ticket flow relationship, and the analysis is carried out in the relationship network through the graph algorithm. The scheme has the advantages that the ticket flow relationship is utilized, the group relationship can be well found, the defects are obvious, the basic attribute information of the taxpayer cannot be utilized, and the attribute information is also important for predicting the risk level of the taxpayer.

There is no method that considers both taxpayer basic information and topological information? Graph neural networks (GCNs) can solve this problem.

The basic information of the taxpayer (taxpayer name, enterprise age, industry type, credit level, employee number and the like), the association relation network (same registration/operation address, MAC address, legal person, tax handling person, ticket leading person, telephone number and the like), the invoice flow network (the taxpayer ticket communication forms a network) and the like can reflect the risk level change of the taxpayer. Through using the chart convolution risk propagation model, the basic information of the taxpayer, the incidence relation network and the invoice flow network are ingeniously combined, the risk propagation of the taxpayer is carried out, the risk grade evaluation can be accurately and reasonably carried out on the taxpayer, the tax administration is helped to carry out the risk evaluation on the taxpayer, the manual workload is reduced, and the pre-management and control capability of risk enterprises is improved. The method has important significance for realizing automatic and intelligent business processing, and is an important step for constructing an intelligent tax system.

The method evaluates and predicts the taxpayer risk grade, has good prediction effect, and carries out timely risk prompt reminding on the taxpayers with high risk and forming a group. The specific flow of the method is schematically shown in FIG. 1.

Step1, constructing a taxpayer attribute information network: and constructing a graph network by taking basic information of the taxpayers as a basis. The method specifically comprises the following steps: 1. register/register address of taxpayer, 2, invoice drawing MAC address of taxpayer, 3, enterprise telephone number, 4, identity document number of taxpayer legal person, 5, account of enterprise, tax clerk name identity document number. Specifically, with taxpayers as graph network nodes, if the registered/registered addresses between two taxpayers are the same, the MAC addresses are the same, the phones are the same, and the legal identity document numbers are any one same, the two taxpayers are connected by edges, and specifically, with the same MAC address as an example, first, all taxpayers with the same MAC address are found, and a fully connected graph is constructed for the taxpayers (that is, all the taxpayers except for the diagonal line of the adjacent matrix being 0 are 1). Similar processing with other attributes being the same.

Step2, constructing a taxpayer ticket flow relation information network, relying on value-added tax special invoice information, and if the taxpayer A sells goods to the taxpayer B and the sum of the goods is more than 10 thousands in one year, connecting the taxpayer A with the taxpayer B, so that the ticket flow network is formed.

Step3 builds the final graph network. And combining the attribute information network and the ticket flow relation information network, wherein the ticket flow relation network is directed due to the fact that the attribute information network is undirected. The two graphs cannot be directly merged, an attribute information network is converted into a directional network, an undirected edge is equivalent to two opposite directional edges which are connected, then the undirected attribute information network is converted into a directional attribute information network, then the two directional networks are merged, namely, adjacent matrixes corresponding to the two directional graphs are taken according to bits to form a final graph network, and the adjacent matrixes of the final graph network are used as parameters of a neural network.

Step4 tax payer name vectorization. Using a Word vector (Word2vec) technology, firstly using a python jieba Word segmentation library to segment words of taxpayer names to obtain Word segmentation sequences, using a pre-trained Word vector lookup table to search the Word segmentation sequences to obtain taxpayer name vectorization matrixes, and using global average pooling and global maximum pooling splicing as final taxpayer name vectors for fixed length due to the fact that the sequence lengths of the taxpayer names after Word segmentation are uncertain;

step5 tax payer registration time length vectorization. Subtracting the taxpayer registration date from the current time (2021) of the survey to obtain the taxpayer age; further taxpayer ages are divided into 5 grades: grade A in 0-1 year, grade B in 1-2 years, grade C in 2-5 years, grade D in 5-7 years, and grade E in more than 7 years. Using one-hot coding vectorization to obtain taxpayer age vectors with the length of 5 in the five levels;

step6 tax payer industry code vectorization. The industry code is a four-digit number, different industries are coded, and the four-digit industry code is used as an industry code vector as the industry code is a number;

step7 tax payer employee number vectorization. The number of hired workers of the taxpayers is divided into 7 grades, namely, 0-5A grades, 5-10B grades, 10-30C grades, 30-200D grades, 200-500E grades, 500-1000F grades and more than 1000G grades. The seven levels were vectorized using one-hot encoding to get a length-7 employer population vector.

Step8 taxpayer credit rating vectorization. The taxpayer information grades are obtained by using a series of taxpayer information evaluation and are divided into 5 grades: a level, B level, C level, D level and M level. And the M grade represents that the enterprise is not evaluated for a while, and the credit grades are reduced from the A grade to the D grade. The five levels are vectorized using one-hot encoding to obtain a credit level vector of length 5.

Step9 constructs taxpayer feature vectors: the taxpayer feature vector comprises: 1. taxpayer name, 2. taxpayer registration time, 3. taxpayer industry code, 4. taxpayer employee number, and 5. taxpayer information use level. Vectorizing the five types of features, and splicing the five types of features into feature vectors.

Step10 builds a dataset that maps the risk propagation model. We select a part of taxpayers in the national risk directory bank to set their risk value to 1.0. The taxpayers in the national risk directory are identified as taxpayers that are audited by the tax authority. A part of taxpayers identified as abnormal users are selected to set the risk value to be 0.8, and taxpayers identified as abnormal users are identified as abnormal users due to the reason that taxpayers do not pay taxes on time and the like. A part of normal taxpayers is selected to set their risk value to 0.0. And forming a final data set by the three parts of data, and then finely adjusting the risk value of one part of data by an expert to ensure that the risk value is distributed in an interval of 0-1. Finally, the interval 0-1 is graded into A grade 0-0.1, B grade 0.1-0.3, C grade 0.3-0.5 and E grade 0.7-1.0. Five levels serve as taxpayer risk labels.

Step11 training graph risk propagation network. The invention uses a GCN Graph convolution network model, wherein GCN is a neural network for performing convolution on a Graph (Graph), and effectively combines a local topological solution structure of a node and the characteristics of the node. The GCN is also considered as a label propagation algorithm, that is, node labels are propagated along the topological network, and the result obtained by the GCN is more accurate even if the label coverage rate is low. The structure of the GCN is shown in FIG. 2: the mathematical expression of the two-layer GCN is

Z＝f(X,A)＝Softmax(ARelu(AXW⁽⁰⁾)W⁽¹⁾)

Wherein X represents the feature vector of the node, A represents the normalized adjacency matrix, namely the adjacency matrix of the graph network obtained in the third step, Relu represents the Relu activation function, and W represents the feature vector of the node⁽⁰⁾Parameter matrix, W, representing a convolutional neural network of a first layer graph⁽¹⁾And (3) representing a graph convolution neural network parameter matrix of the second layer, connecting the first layer neural network and the second layer neural network through an activation function Relu, outputting the final result through a Softmax function, and selecting different loss functions according to different downstream tasks. Since our task is node classification, a cross entropy loss function is chosen.

Testing by a Step12 model: and testing the trained model by using a test set to verify the generalization capability of the model.

Step13 model evaluation: and dividing the data set into five parts by using 5-fold cross validation (5-fold cross validation), training 4 parts in the five parts by turns, verifying 1 part, taking the average value of results for 5 times as the estimation of the model precision, selecting the accuracy rate as an evaluation index, and adjusting the model parameters to retrain if the effect is not ideal.

Step14 taxpayer risk level prediction: and predicting the taxpayer risk level by using a trained model for a group of taxpayers to be evaluated to obtain the risk level of each taxpayer, further determining whether the high-risk taxpayers and the ganged situation exist, judging whether the high-risk taxpayers are ganged or not according to the corresponding network structure, if the prediction result in the network structure is the high-risk taxpayers and the ganged situation exists (judging whether the ganged situation exists or not through a maximum connected subgraph), pruning the maximum connected subgraph in actual judgment according to certain rule logic, and judging the ganged situation after pruning.

Compared with the prior art, the invention has the advantages that:

1. in steps 4, 5, 6 and 7, the vectorized taxpayer basic information comprises information such as taxpayer names, taxpayer enterprise years, taxpayer information use levels, taxpayer employee numbers and the like. The information has a great relation to evaluating the risk coefficient of the taxpayer.

2. In the steps 1, 2 and 3, a graph propagation network is constructed, and the topological structure of the taxpayer in the graph network is more accurately positioned by using the ticket flow information (the upstream and downstream relation of the invoice) and the taxpayer attribute information (the same invoicing MAC address, the registration address and the like).

3. In step11, a latest graph convolution network model double-layer GCN is used, and the GCN skillfully combines X characteristic matrix in GCN formula, which is basic information of taxpayers, and adjacent matrix, which is a topological network where the X characteristic matrix is located. The method not only considers the basic information of the taxpayer, but also combines the ticket flow relationship, ensures the reasonability and accuracy of the result, easily discovers the ganged situation, uses the neural network to train parameters, realizes the end-to-end task application, and obtains good effect when the average F1 value of the model evaluation index reaches 85%.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a diagram of a GCN model.

Detailed Description

The present invention is described in further detail below with reference to the attached drawings.

1. Building graph networks

1) And constructing a taxpayer attribute information network. And using the related information in the basic information table of the taxpayer of the tax authority. According to the specific implementation scheme, registration addresses in a taxpayer basic information table are taken, because irregular filling conditions exist in the registration addresses, external resource standardized registration addresses are used, the standardized registration addresses are used for performing edge association, and if the standardized registration addresses of two enterprises are the same, the two enterprises are connected in an edge manner. In the same scheme, the invoice issuing MAC address, the registered telephone number, the enterprise legal person identity document number, the financial responsible person, the taxpayer and the ticket leading person name identity document number in the taxpayer basic information table are taken. If at least one of the information of two enterprises is the same, connecting edges between the two enterprises. Finally forming the attribute information network.

2) And constructing a ticket flow information network. The value-added tax special invoice data is obtained and comprises fields such as a taxpayer identification number of a selling party, a taxpayer identification number of a purchasing party, a goods name and a goods amount. Collecting data of the last year, collecting the amount of the invoice data with the same number of the buying and selling parties, and screening and filtering the collected data to obtain the data with the amount of less than 10 ten thousand. And constructing a ticket flow information network according to the final invoice data, obtaining an edge relation from a seller taxpayer to a buyer taxpayer by connecting a directed line segment in one invoice data, and constructing edges of all invoice data according to the above mode to obtain the final ticket flow information network.

3) The combined attribute information network is combined with the ticket flow relationship information network, and because the attributes of the two graphs are different, one is an undirected graph and the other is a directed graph, the two graphs cannot be directly combined. We turn an undirected graph into a directed graph, with one undirected edge corresponding to two oppositely directed edges. And after converting the attribute information undirected graph into a directed graph, merging the undirected graph with the ticket flow relationship information network to obtain a final directed graph serving as a topological structure graph of the taxpayer.

2. Constructing taxpayer feature vectors

The taxpayer feature vector expresses the self information of the taxpayer, wherein the name of the taxpayer, the registration time, the industry, the number of employees and the credit level have great influence on the evaluation of the risk level of the taxpayer. The taxpayer name is a character string type, vectorization of the character string is realized through a word vector (word2vec) technology in natural language processing, firstly, word segmentation is carried out on the taxpayer name, then, a word vector table is searched to obtain a corresponding vector, and as the lengths after word segmentation are possibly inconsistent, a layer of maximum pooling layer is added behind to obtain a final taxpayer name vector. The dimension of the word vector lookup table is 50 dimensions, and the obtained taxpayer name vector is also 50 dimensions. And other attributes are graded to perform one-hot coding to obtain corresponding coding vectors, and finally all the vectors are spliced to obtain the taxpayer characteristic vector dimension length of 71.

3. Graph risk model building

The algorithm structure of the single-layer GCN network is shown in FIG. 2 by using the double-layer GCN network as a graph risk propagation model. The mathematical expression is:

H＝Relu(AWX)

wherein, A represents the adjacent matrix after icon standardization, X represents the input characteristic vector, W represents the weight coefficient, and Relu represents the nonlinear activation function. H denotes the resulting hidden layer vector.

In order to increase the extraction capability of the network model, two GCN networks are used for extracting features, and two layers of GCN can mine second-order neighbor information of nodes. The expression of the double-layer GCN model is as follows:

Z＝f(X,A)＝Softmax(ARelu(AXW⁽⁰⁾)W⁽¹⁾)

wherein W⁽¹⁾For the weight matrix of the second layer, softmax is the activation function of the second layer, and Z is the representation vector for obtaining the nodes. Specifically, the dimension of the hidden layer (i.e., the output layer of the first layer GCN) is set to 16, and the final node represents that the dimension of the vector is 5.

The invention builds a double-layer model through DGL. DGL is a Python package for the convenience of building a neural network of the graph, which can run with a pytorch, mxnet or Tensorflow as the back-end. The invention builds a double-layer GCN network taxpayer risk propagation model, wherein the input of the model is a topological graph and taxpayer characteristic vectors, and the output is a risk grade (ABCDE grade). The model adopts a neural network which is connected between two GCNs and takes Relu as an activation function, an Adam algorithm is selected as an optimizer, cross entropy loss is taken as a loss function of the model, and an F1 value is taken as an evaluation index of the model.

4. Graph risk model training and evaluation

After a double-layer GCN network is built, a training model is started, the batch processing size is set to be 256, beta 1 is set to be 0.9, beta 2 is set to be 0.999, and epsilon is set to be 10e-8 in adam optimizer parameters. Since modeling uses a cross-entropy function for the classification task:

where y is the true label, y _ hat is the label predicted by the model, and the evaluation model uses the F1 value:

the F1 value balances accuracy and recall and has wide application in classification tasks.

The method adopts N-fold cross validation to evaluate the model, specifically, a data set is divided into five parts, four parts are taken as training data in turn, and 1 part is taken as test data to carry out the test. Each experiment gave the corresponding F1. The positive F1 average of the 5 results was used as an estimate of the model effect, and the average F1 of the two-layer GCN model was 85%, with the results being more desirable.

5. Graph risk propagation model prediction

And obtaining a trained double-layer GCN risk propagation model, inputting the taxpayer feature vector model to predict the risk grade of the taxpayer, and dividing the risk grade into five grades, namely ABCDE grade. The highest level and other levels of the level A risk are reduced in sequence, a ganged phenomenon can be found according to the relation of the risk taxpayers in the graph, and the method has great significance in practical application.

Although specific embodiments of the invention have been disclosed for purposes of illustration, and for purposes of aiding in the understanding of the contents of the invention and its implementation, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A taxpayer risk evaluation method based on a graph neural network comprises the following steps:

1) constructing a data set of a graph risk propagation model, wherein the data set comprises taxpayers checked by a tax authority and set risk values of the taxpayers, taxpayers identified as abnormal users and set risk values of the taxpayers, and normal taxpayers and set risk values of the taxpayers; setting a plurality of intervals according to the value range of the risk value, wherein each interval corresponds to a taxpayer risk grade label; wherein the risk value of the taxpayer checked by the tax authority is greater than the risk value of the taxpayer determined as an abnormal user is greater than the risk value of the normal taxpayer;

2) constructing a graph network based on the basic information of each taxpayer in the data set, wherein the graph network is used as a taxpayer attribute information network; establishing a taxpayer bill flow relation information network by taking value-added tax special invoice information as a basis;

3) combining the taxpayer attribute information network and the taxpayer ticket flow relation information network to obtain a final graph network; then acquiring an adjacency matrix of the final graph network;

4) carrying out taxpayer name vectorization, taxpayer registration time vectorization, taxpayer business code vectorization, taxpayer employee number vectorization and taxpayer credit level vectorization on each taxpayer in the data set respectively; then constructing a characteristic vector corresponding to each taxpayer based on the vectorization result;

5) taking the adjacency matrix as a network parameter of the graph risk propagation network, and training the graph risk propagation network by using the characteristic vector of each taxpayer in the data set;

6) and (3) acquiring the characteristic vectors of each taxpayer in the group of taxpayers to be evaluated, and inputting the characteristic vectors into the graph risk propagation network trained in the step 5) for prediction, so as to obtain the risk grade of each taxpayer and whether a group situation exists.

2. The method of claim 1, wherein the graph risk propagation network is a GCN graph convolution network model having the mathematical expression: z ═ f (X, a) ═ Softmax (a Relu (AXW)⁽⁰⁾)W⁽¹⁾) (ii) a Wherein X represents the characteristic vector of the taxpayer, A represents the adjacency matrix, Relu represents the Relu activation function, and W⁽⁰⁾Parameter matrix, W, representing a convolutional neural network of a first layer graph⁽¹⁾Representing a parameter matrix of a second-layer graph convolution neural network, connecting the first-layer neural network and the second-layer neural network through an activation function Relu, and outputting the final result through a Softmax function.

3. The method of claim 2, wherein the loss function selected for training the training graph risk propagation network is a cross-entropy loss function.

4. The method of claim 1, 2 or 3, wherein the basic information comprises: register/register address of taxpayer, MAC address of invoice issued by taxpayer, telephone number of enterprise, ID number of legal person of taxpayer, ID number of name of responsible person, tax office or ticket leading person of enterprise finance; the taxpayer is taken as a graph network node, and if any set information between the two taxpayers is the same, the two taxpayers are connected through one edge.

5. The method of claim 4, wherein the setting information includes a same registration/registration address, a same MAC address, a same telephone, and a legal identity document number.

6. A method as claimed in claim 1, 2 or 3, wherein the taxpayer ticket flow relationship information network is constructed by: if the taxpayer A sells the goods to the taxpayer B and the sum of the goods exceeds the set sum within one year, an edge pointing to the taxpayer B from the taxpayer A is established to form a ticket flow network.

7. The method as claimed in claim 1, wherein the taxpayer registration duration vectorization method is as follows: subtracting the registration date of the taxpayer from the current time to obtain the registration duration of the taxpayer; and dividing the registration time into a plurality of levels, wherein each registration corresponds to one-hot code, and the code corresponding to the level corresponding to the registration time of the taxpayer is used as the taxpayer registration time vector of the taxpayer.