WO2021088499A1

WO2021088499A1 - False invoice issuing identification method and system based on dynamic network representation

Info

Publication number: WO2021088499A1
Application number: PCT/CN2020/113450
Authority: WO
Inventors: 郑庆华; 董博; 阮建飞; 范弘铖
Original assignee: 西安交通大学
Priority date: 2019-11-04
Filing date: 2020-09-04
Publication date: 2021-05-14
Also published as: CN110852856B; CN110852856A

Abstract

Disclosed are a false invoice issuing identification method and system based on dynamic network representation. The method comprises: firstly, organizing enterprise transaction information into a static network by taking an enterprise as a node and a transaction record as an edge; secondly, establishing representation of an enterprise transaction network by taking each day as a time node, establishing a time-sequence window with a duration of 30 days, fusing static network representation of 30 days in the window each time, and gradually fusing the static network representation of all the time nodes by means of moving the time-sequence window so as to obtain a final dynamic network representation result; thirdly, by means of a distributed optimization algorithm, decomposing a represented target function into independent sub-functions, and optimizing the sub-functions in parallel to improve the learning efficiency of a model; and finally, constructing a binary classifier on the basis of LightGBM to identify an enterprise that is suspected of issuing false invoices. In the present invention, an enterprise that is suspected of issuing false invoices is identified on the basis of dynamic network representation, thereby improving the efficiency and accuracy of false invoice issuing identification.

Description

Method and system for recognizing false invoice issuance based on dynamic network representation

【Technical Field】

The invention belongs to the technical field of tax control, and particularly relates to a method and system for identifying false invoice issuance based on dynamic network representation.

【Background technique】

False invoice issuance refers to the use of various behavioral means by enterprises to issue invoices that are inconsistent with actual business conditions in order to achieve the purpose of tax evasion.

The act of falsely issuing invoices will cause huge losses in national taxation and severely undermine the socialist economic order. The current tax bureau's main ways to identify enterprises suspected of issuing false invoices are: reporting, routine supervision and spot checks, and the involvement of problem companies, and then tax inspectors will check based on the statements provided by the company. These audits are of great contingency, and it is impossible to systematically analyze and evaluate all enterprises; moreover, the manual verification of tax inspectors alone has a large workload and low efficiency. The inspection data is also limited to the reports provided by a single enterprise, and cannot be combined with upstream and downstream companies. Associated companies.

In order to solve the current problems faced by the recognition of false invoices, network characterization technology provides a solution. The method of identifying false invoice issuance based on network representation can organize isolated report information into a corporate transaction network, thereby systematically verifying all companies, and at the same time, it can also use inter-enterprise contacts to obtain more corporate information to identify false invoice companies. The following patents provide reference methods based on network characterization technology to automatically identify false invoices through computers:

Literature 1. A detection method for false VAT invoices based on parallel loop detection (201710147850.8);

Document 2. A method for identifying suspicious taxpayers based on the taxpayer’s interest-related network (201410328391.X);

Literature 1 organizes invoice information into a static network with enterprises as nodes, and improves loop detection in the network. The improvement method is to distribute computing tasks to multiple computers in a distributed cluster through a distributed parallel computing method to improve efficiency , And finally use an improved loop detection method to detect false VAT invoices.

Literature 2 identifies suspicious taxpayers based on the topological characteristics of the taxpayer's interest-related network (TPIN), analyzes the topological characteristics of the taxpayer's interest-related network, and obtains the taxpayer's characterization in the interest-related network, and then uses the C4.5 classifier experiment , So as to realize the function of automatically identifying suspicious taxpayers.

The method described in the above literature mainly has the following problems: Literature 1 can only detect the false invoice issuance behavior of funds returning to the source account after passing through multiple accounts, and the invoice false issuance has various forms and is not limited to the loop form. The method of identification The type is too single, and the generalization ability of the model is poor; Literature 2 is only based on the topological structure of the taxpayer and the interest relationship, ignoring the attribute information of the enterprise, and homogenizing the enterprise, which cannot be analyzed from the perspective of enterprise scale, market share, etc.; Literature 1 and Literature 2 are both limited to static networks, unable to dynamically analyze the changes in corporate transactions combined with historical information, and unable to accurately grasp the dynamic changes, which allows some companies to take advantage of them. For example, the annual bills of a tax evasion company are no problem. They have been in a state of loss for several years, but the cost of water and electricity has increased year by year. The behavior of false invoices is usually hidden in this type of time series-related characteristics, while static The network cannot capture such characteristics.

[Summary of the invention]

In order to improve the efficiency of identifying false invoices, the purpose of the present invention is to provide a method and system for identifying false invoices based on dynamic network representation. The invention adopts dynamic network representation, dynamically analyzes the enterprise transaction network in combination with historical information, and accurately grasps the dynamic changes of enterprise transactions; and can identify different invoice false issuing behaviors based on the related information between enterprises; at the same time, it draws on the distributed optimization algorithm to The calculation function is decomposed into independent sub-functions to be executed in parallel, which improves the efficiency of identifying false invoices.

In order to achieve the above objectives, the present invention adopts the following technical solutions to achieve:

A method for identifying false invoice issuance based on dynamic network representation. First, the company’s transaction information is organized into a static network with the company as the node and transaction records as the edge; second, the company’s transaction network representation is established with each day as the time node. A 30-day time sequence window, in which 30-day static network representations are merged each time within the time sequence window, and the static network representations of all time nodes are gradually merged through the moving time sequence window to obtain the final dynamic network representation results; again, borrowing from the distributed The optimization algorithm decomposes the objective function of the characterization into independent sub-functions, and optimizes the sub-functions in parallel to improve the learning efficiency of the model; finally, a two-classifier is constructed based on LightGBM to identify the enterprises suspected of false invoices.

The further improvement of the present invention lies in:

The method specifically includes the following implementation steps:

Step 1, basic feature extraction

Firstly, the data is preprocessed, and then the basic information of the company is extracted. The basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized deal with;

Step 2. Feature extraction based on dynamic network representation

After extracting the basic characteristics of the enterprise, the enterprise is the node, the basic information of the enterprise is the node attribute, the transaction record is the edge, and the transaction information is the attribute of the edge, and each day is the time node, and the enterprise transaction information is organized into a static network; then 30 A time sequence window is established in units of days, and 30-day static network representations are merged within the window each time, and static network representations at all times are gradually merged through the moving time sequence window to optimize the objective function of the network representation, and finally obtain the optimal dynamic enterprise transaction network Characterization

Step 3. Based on distributed algorithm optimization

In order to improve the learning efficiency of dynamic network representation, learn from distributed optimization algorithms to decompose the objective function of dynamic corporate transaction network representation into independent sub-functions. Parallel optimization sub-functions accelerate the solution of large-scale and complex corporate transaction network representation;

Step 4. Build a classifier to identify false invoices

Construct a two-classification model based on the LightGBM classifier, use the calculated dynamic network representation as the learning data of the classifier, use the labeled enterprise sample set to train the model, and then put the characterization result of the enterprise sample set that needs to be predicted into the training. Make predictions in the model, and finally determine whether the target company has false invoices based on the output of the prediction model.

The implementation method of step 1 is as follows:

Step 101, data preprocessing

(1) Extract the "Taxpayer Electronic File Number" as the unique identifier of the company's characteristics;

(2) Dealing with missing values: attributes with severe data missing and attributes that are not related to the task of false invoices are directly deleted, and a small number of missing important attributes are used to fill in missing values with the same kind of mean interpolation method;

Step 102, processing text data

The processing of text information in the enterprise basic information table includes:

(1) Use Jieba word segmentation tool to segment the text data of the enterprise;

(2) Use the dictionary tree to count the results of word segmentation, and select words with larger weights as keywords;

(3) Convert the extracted N types of keywords into vectors based on word2vec;

Step 103, processing logo type data

Use One-Hot coding for the discrete category data in the basic information table of the enterprise; use the number of attribute values as the length to establish a status bit to mark each specific state;

Step 104, processing numerical data

The numerical data in the basic information table of the enterprise is processed by traditional standardized methods:

(1) Find the mean value of each attribute;

(2) Find the variance of each attribute;

(3) Z-Score standardization.

The implementation method of step 2 is as follows:

Step 201: Establish a static corporate transaction network

A representation model of the corporate transaction network is established every day, so that companies with similar topological structures or higher transaction weights are closer in the representation space. The objective optimization function is:

Wherein, H _i and H _j characterize enterprise i and j; w _ij is the weight between the trading enterprise; minimize _{_{_{w ij || h i -h j ||}}} 2 , the greater the forces the transaction corresponding to the weight w _ij The closer the enterprise representations i and j are;

Minimize the goal

Obtain the optimized corporate transaction network representation h on that day;

Step 202: Dynamically integrate historical information

Establish a 30-day timing window, merge 30-day static network representations within the window each time, and then move the timing window to gradually merge all static network representations, and finally obtain dynamic corporate transaction network representations. The corresponding optimization goals are:

among them,

Respectively represent the representation of p and q of the enterprise on day t and the weight of inter-firm transactions,

It represents the similarity of the representations of enterprise p and enterprise q; H _i represents the network representation of the i-th day in the timing window; penalty item

Make the matrix learned by the representation as close as possible to the matrix of the original enterprise transaction network. ρ is a parameter that defines the structural characteristics of the model and the degree of contribution to the degree of the original matrix. The larger the ρ the more the model pays attention to the time-series network representation, the smaller the more the node Characterization

Minimize the goal

Get the optimized dynamic corporate transaction network characterization H.

The implementation method of step 3 is as follows:

Step 301: Decompose the objective function

Refactor the optimization function (2) and write it in a decomposable form:

among them,

It represents the similarity of the representations of enterprise p and enterprise q; the penalty item

It is based on formula (2) that approximates the matrix of the original enterprise's transaction network, and divides the data into individual enterprises for calculation;

Minimize the goal

The optimized dynamic corporate transaction network characterization H;

Step 302, execute multiple sub-functions in parallel

Decompose formula (3) into N sub-optimization functions, where N is the number of network nodes, representing the number of enterprises in the enterprise transaction network, and solving them in parallel to obtain H _t ^k+1 :

among them,

On behalf of a company related to company v,

Represents the characterization of enterprise v on day t,

Represents the representation of enterprise v after k iterations on day t,

Represents the weight of the transaction between enterprise v and q on day t,

It represents the similarity of the representations of enterprise v and enterprise q after iteration (k-1) on the t day;

Indicates the similarity of the representation of enterprise v on the i-th day and the t-th day;

among them,

For the characterization of the requested enterprise v on day t, use an iterative optimization method to determine whether the calculation result meets the required accuracy: solve it through the gradient descent algorithm, when the convergence condition is reached

or

When the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1)th iteration of an enterprise reach the required accuracy; or when the iterative result of an enterprise is close enough to its affiliated enterprise , Stop updating, and the obtained characterization result of the kth iteration is the characterization of the enterprise on that day;

Step 303, comprehensively sort the parallel results

Calculate the N nodes of the transaction network in parallel to get the characterization of each enterprise on day t, and then for the dynamic transaction network distributed on time nodes 1 to T, calculate the characterization of the network on each time node in order .

Step 4, the implementation method is as follows:

Step 401: Combine the basic features obtained in step 1 and the dynamic network features obtained in step 3 as the learning data of the classifier;

Step 402: Construct a two-classification model based on LightGBM, and set the main parameters of the classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;

Step 403: Take the characterization results obtained from the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1, and then randomly divide the training set into two groups. Use the training set to train the classification model of step 2 and use the verification set to adjust the training. If over-fitting occurs, perform pruning operations; select the optimal model to verify the algorithm in the test set accuracy;

Step 404: Input the characterization result of the unmarked enterprise sample into the LightGBM-based prediction model of the suspected false invoice issuance enterprise, and finally, based on the output of the prediction model, determine whether the target company has false invoice issuance behavior.

Compared with the prior art, the present invention has the following beneficial effects:

The present invention is a method for identifying enterprises suspected of issuing false invoices based on the idea of dynamic network representation learning, and has the following advantages:

1. Using dynamic network representation, combining historical information, learning representation vectors for all time nodes and fusion, can accurately grasp the dynamic changes of the enterprise transaction network, and improve the accuracy of the recognition of false invoices;

2. Based on the related information between enterprises, it can identify different types of false invoices;

3. With reference to the distributed optimization algorithm, the calculation function is decomposed into independent sub-functions for parallel execution, which reduces the time complexity of computing network representation and improves the efficiency of identifying false invoices.

【Explanation of the drawings】

Figure 1 is the overall framework flow chart;

Figure 2 is a schematic diagram of the basic feature extraction process;

Figure 3 is a schematic diagram of a feature extraction process based on dynamic network representation;

Figure 4 is a schematic diagram of the optimization process of the network characterization algorithm;

Figure 5 is a schematic diagram of the process of constructing a classifier to identify false invoices;

Fig. 6 is a schematic diagram of a system for identifying false invoice issuance based on dynamic network representation according to an embodiment of the present invention.

【Detailed ways】

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only The embodiments are a part of the present invention, not all the embodiments, and are not intended to limit the scope of the present invention. In addition, in the following description, descriptions of well-known structures and technologies are omitted to avoid unnecessary confusion of the concepts disclosed in the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

The present invention will be further described in detail below in conjunction with the accompanying drawings:

Referring to Figure 1, the method for identifying false invoices based on dynamic network representation includes the following steps:

S101. Basic feature extraction

After preprocessing the data, extract the basic information of the company. The basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized. .

As shown in Figure 2, the basic feature extraction implementation process specifically includes the following steps:

S201. Data preprocessing

Step 1: Extract the "Taxpayer Electronic File Number" as the unique identifier of the company's characteristics, and delete all other attributes that cannot describe the company's own distribution rules;

Step 2: When the attribute contains a large number of missing values and only a few valid values, for example, the attributes of "taxpayer tax agency code", "financial report type" and "accounting form" are less than 10% of the enterprises with value. Choose to directly delete this feature; when the attribute has a small number of missing values, for example, "employees" and "registered capital" attributes have missing values in individual companies, choose the same mean imputation method to fill in the missing values.

S202. Processing text data

Perform data preprocessing and feature extraction on the text data "cargo information" and "business scope" in the basic information table of the enterprise. The specific steps of text data processing include:

Step 1: Use the Jieba word segmentation tool for word segmentation, construct a suitable stop table, and remove the stop words in the text. For example, the content of the "business scope" field of an enterprise in this embodiment is "production, sales: ceramics and products; goods import and export, technology import and export". After word segmentation and removal of stop words, the result is "production, sales, ceramics and products, goods import and export, technology import and export";

Step 2: Use the dictionary tree to count the results of step 1, and select words with larger weights as keywords;

Step 3: Convert the N types of keywords extracted in step 2 into vectors based on word2vec.

S203. Processing category data

One-Hot coding is used for the discrete categorical data "enterprise type" and "enterprise status" in the enterprise basic information table. The number of possible values of the attribute is expressed as the length of the status bit, one of which is marked as 1 and the other is marked as 0 to indicate a specific state. For example, in this embodiment, the "enterprise type" field has four possible values "individual proprietorship", "partnership", "limited liability company" and "limited liability company". Therefore, the length of the status bit of "enterprise type" is 4, where 1000 means "sole proprietorship", 0100 means "partnership", 0010 means "limited liability company", and 0001 means "limited liability company".

S204. Processing numerical data

Standardize the numerical data "registered capital", "total investment" and "number of employees" in the basic information table of the enterprise. This embodiment uses "registered capital" as an example to illustrate:

Step 1: Obtain the mean value of the "registered capital" attribute

Let u be the mean value of the "registered capital" attribute, and its specific calculation form is:

Among them, n represents the number of basic information samples of the enterprise, and x ^j represents the value of the j-th "registered capital"attribute;

Step 2: Get the variance of each attribute

Let σ ² be the variance of the "registered capital" attribute, and its specific calculation form is:

Mean and variance are the basic indicators of numerical attributes, and numerical attributes can be standardized through the mean and variance;

Step 3: Z-Score standardization

Let δ be the standardized value of “registered capital”, where δ=(δ ¹ ,δ ² ,L,δ ⁿ ), δ ^j represents the standardized value of the jth “registered capital”, and the specific calculation form of ^{δ j is} :

δ ^j =(x ^j -u)/σ,j=1,2,L,n

S102. Feature extraction based on dynamic network representation

First, establish a static enterprise transaction network with the enterprise as the node, transaction records as the edge, and each day as the time node; then establish a time sequence window with 30 days as the unit, integrate 30 days of static network representations in the window, and pass The moving timing window gradually integrates the static network representation at all times, optimizes the objective function of the network representation, and obtains the optimal dynamic corporate transaction network representation.

As shown in Figure 3, the specific steps of the implementation process of feature extraction based on dynamic network representation include:

Step 1: Establish a static corporate transaction network

Establish a representative model of a corporate transaction network every day, and the objective optimization function is:

Minimize the goal

The characterization h of each enterprise on the day can be obtained, so that the enterprises with similar transaction structure or significant transaction rights are closer in the characterization space, and then the characterization of the entire enterprise transaction network on that day can be obtained.

Step 2: Dynamically integrate historical information

Gradually integrate all static corporate transaction network representations within the time sequence window, and finally obtain a dynamic corporate transaction network representation. The optimization goals are:

The length of the timing window is 30 days. Within the timing window, 30 days of static network characteristics are merged each time, and then the timing window is moved to gradually merge all static network characteristics to minimize the target

The characterization H of each enterprise on that day can be obtained. In this embodiment, it is found that the effect is best when ρ=0.75. At this time, the network characterization and node characterization of the time series are more balanced.

S103. Based on distributed algorithm optimization

First decompose the objective function; then execute multiple sub-functions in parallel; finally, synthesize the parallel results.

As shown in Figure 4, the specific steps of the distributed algorithm optimization implementation process include:

S401. Decompose the objective function

Refactor the optimization function (2) and write it in a decomposable form:

In this embodiment, the enterprise transaction network involves a total of 3765 companies, so N=3765, and v is calculated from 1 to 3765 for each company and its associated transaction network; ρ=0.75 is used to pay attention to the timing in a more balanced manner. Network characterization and node characterization;

S402. Parallel execution of multiple sub-functions

The formula (3) is decomposed into 3765 sub-optimization functions for each enterprise v, which are solved in parallel and finally merged to obtain H _t ^k+1 , where the single sub-objective optimization function is:

In this embodiment, taking ρ=0.75 pays attention to the network characterization of time series and the characterization of nodes in a balanced manner. Calculate in order to get the calculation result of each sub-function,

The characterization of each enterprise after the kth iteration on the t day obtained by solving for each sub-function, so as to obtain

It is the characterization of the dynamic enterprise transaction network after the kth iteration on the t day;

S403. Organize the parallel results comprehensively

The gradient descent algorithm is used to solve equation (4). In this embodiment, the current

or

Stop updating at time, indicating that they are approximately equal when the representation is the representation of the corporate transaction network on that day. Therefore, for the dynamic trading network distributed on the first to T days, the characterization of the network can be obtained by calculating in order.

S104. Build a classifier to identify false invoices

First, combine the basic features obtained in S101 and the dynamic network features obtained in S102 as the learning data of the classifier; secondly, build a two-class model based on the LightGBM classifier; then use the enterprise sample set that has been marked for false invoices to train the model; finally The characterization results of the sample set of companies that need to be predicted are put into the trained model for prediction, and based on the output of the prediction model, it is determined whether the target company has false invoices.

As shown in Figure 5, the specific steps of the implementation process of constructing a classifier to identify false invoices include:

S501. Obtain the learning data of the classifier

Combine the basic features obtained in S101 and the dynamic network features obtained in S103 as the learning data of the classifier. In this embodiment, the basic feature vector of the enterprise obtained in S101 is directly placed after the dynamic network feature vector obtained in S103, and then combined into a new vector as the learning data of the classifier

S502. Building a two-class model based on LightGBM

The main parameters for setting the classifier are: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;

S503. Training model

Step 1: Take the characterization results obtained by the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1.

Step 2: Randomly select 10% of the data in the training set as the validation set.

Step 3: Use the training set to train the classification model built by S502, use the validation set to adjust the training, and perform pruning when over-fitting occurs;

Step 4: Iterative calculation. Since the number of iterations is set to 100, if the convergence condition is not reached for 100 iterations, the iteration is forced to stop, and the result of the last iteration is the calculated representation.

Step 5: Select the optimal model to verify the accuracy of the algorithm in the test set. The accuracy rate verified in this embodiment is 0.957, the precision is 0.921, and the recall rate is 0.87, indicating that the model has a very good effect on the test set and can reach Requirements for the identification of false invoices in actual tax scenarios. Compared with other methods for identifying false invoices based on static network characterization, the accuracy rate is 0.876, the accuracy is 0.856, and the recall rate is 0.794. The method of the present invention has improved recognition accuracy rate of 9.25%, accuracy of 7.6%, and recall rate of 9.57%. . The improvement of the method of the present invention for identifying false invoices is not only reflected in the improvement in accuracy, but also in the improvement of the identification efficiency of distributed parallel operations: the running time of the distributed algorithm for the data sample in this embodiment is 684.57s, which is more The running time of the distributed algorithm is reduced by 28.56% in 958.19s.

S504. Predicted enterprises suspected of issuing false invoices

Input the characterization results of the unlabeled enterprise samples into the trained prediction model of the suspected false invoice issuance enterprise. Based on the output of the prediction model, determine whether the target enterprise has false invoice issuance behavior. In this embodiment, the predicted value is sorted from high to low. , And take the top ten percent as a suspected enterprise of false invoices

In another embodiment of the present invention, a system for identifying false invoices based on dynamic network representation is provided, and the system includes:

The enterprise attribute feature extraction module is used to extract the basic information of the enterprise after preprocessing the data. The basic information of the enterprise is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, and the categorical data is encoded by One-Hot , To standardize numerical data;

The dynamic network characterization building module is used to process the attribute characteristics of the enterprise to obtain the static transaction network characterization of the enterprise with each day as the time node, and then establish a 30-day time sequence window, and integrate the static network characterization through the regular term in the window, and pass Sliding the window on the time series to gradually merge all static network representations to obtain dynamic network representations;

Parallel optimization of the dynamic network characterization module is used to decompose the goal of enterprise dynamic network characterization into independent sub-goals. Parallel optimization of the sub-objectives improves the efficiency of dynamic network characterization and obtains the final characterization result more efficiently;

The invoice false issuance recognition module is used to use the obtained enterprise dynamic network as the characteristics of the invoice false issuance behavior and input it into the binary classifier based on LightGBM, and use the marked enterprise sample set to train the invoice false issuance recognition model. The characterization results of the sample set of enterprises for prediction are input into the trained model for prediction, and then the enterprises suspected of issuing false invoices are obtained.

The above content is only to illustrate the technical ideas of the present invention, and cannot be used to limit the scope of protection of the present invention. Any changes made on the basis of the technical solutions based on the technical ideas proposed by the present invention fall into the claims of the present invention. Within the scope of protection.

Claims

A method for identifying false invoice issuance based on dynamic network representation, which is characterized in that, first, the enterprise is the node and the transaction record is the edge, and the enterprise transaction information is organized into a static network; second, the enterprise transaction network is established with each day as the time node Establish a 30-day timing window, merge 30-day static network representations within the timing window each time, and gradually merge the static network representations of all time nodes through the moving timing window to obtain the final dynamic network representation results; again, Using the distributed optimization algorithm for reference, the objective function of the characterization is decomposed into independent sub-functions, and the sub-functions are optimized in parallel to improve the learning efficiency of the model; finally, a two-classifier is constructed based on LightGBM to identify the suspected enterprises of false invoices.
The method for recognizing false invoice issuance based on dynamic network characterization according to claim 1, wherein the method specifically includes the following implementation steps:

Step 1, basic feature extraction

Firstly, the data is preprocessed, and then the basic information of the company is extracted. The basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized deal with;

Step 2. Feature extraction based on dynamic network representation

After extracting the basic characteristics of the enterprise, the enterprise is the node, the basic information of the enterprise is the node attribute, the transaction record is the edge, and the transaction information is the attribute of the edge, and each day is the time node, and the enterprise transaction information is organized into a static network; then 30 A time sequence window is established in units of days, and 30-day static network representations are merged within the window each time, and static network representations at all times are gradually merged through the moving time sequence window to optimize the objective function of the network representation, and finally obtain the optimal dynamic enterprise transaction network Characterization

Step 3. Based on distributed algorithm optimization

In order to improve the learning efficiency of dynamic network representation, draw on distributed optimization algorithms to decompose the objective function of dynamic corporate transaction network representation into independent sub-functions. Parallel optimization sub-functions accelerate the solution of large-scale and complex corporate transaction network representation;

Step 4. Build a classifier to identify false invoices

Construct a two-classification model based on the LightGBM classifier, use the calculated dynamic network representation as the learning data of the classifier, use the labeled enterprise sample set to train the model, and then put the characterization result of the enterprise sample set that needs to be predicted into the training. Make predictions in the model, and finally determine whether the target company has false invoices based on the output of the prediction model.
The method for recognizing false invoice issuance based on dynamic network characterization according to claim 2, wherein the method for implementing step 1 is as follows:

Step 101, data preprocessing

(1) Extract the "Taxpayer Electronic File Number" as the unique identification of corporate characteristics;

(2) Dealing with missing values: attributes with severe data missing and attributes that are not related to the task of false invoices are directly deleted, and a small number of missing important attributes are used to fill in missing values with the same kind of mean interpolation method;

Step 102, processing text data

The processing of text information in the enterprise basic information table includes:

(1) Use Jieba word segmentation tool to segment the text data of the enterprise;

(2) Use the dictionary tree to count the results of word segmentation, and select words with larger weights as keywords;

(3) Convert the extracted N types of keywords into vectors based on word2vec;

Step 103, processing logo type data

Use One-Hot coding for the discrete category data in the basic information table of the enterprise; use the number of attribute values as the length to establish a status bit to mark each specific state;

Step 104, processing numerical data

The numerical data in the basic information table of the enterprise is processed by traditional standardized methods:

(1) Find the mean value of each attribute;

(2) Find the variance of each attribute;

(3) Z-Score standardization.
The method for identifying false invoice issuance based on dynamic network characterization according to claim 3, wherein the method for implementing step 2 is as follows:

Step 201: Establish a static corporate transaction network

A representation model of the corporate transaction network is established every day, so that companies with similar topological structures or higher transaction weights are closer in the representation space. The objective optimization function is:

Wherein, H i and H j characterize enterprise i and j; w ij is the weight between the trading enterprise; minimize w ij || h i -h j || 2 , the greater the forces the transaction corresponding to the weight w ij The closer the enterprise representations i and j are;

Minimize the goal
Obtain the optimized corporate transaction network representation h on that day;

Step 202: Dynamically integrate historical information

Establish a 30-day timing window, merge 30-day static network representations within the window each time, and then move the timing window to gradually merge all static network representations, and finally obtain dynamic corporate transaction network representations. The corresponding optimization goals are:

among them,
Respectively represent the representation of p and q of the enterprise on day t and the weight of inter-firm transactions,
It represents the similarity of the representations of enterprise p and enterprise q; H i represents the network representation of the i-th day in the time series window; penalty item
Make the matrix learned by the representation as close as possible to the matrix of the original enterprise transaction network. ρ is a parameter that defines the structural characteristics of the model and the degree of contribution to the degree of the original matrix. The larger the ρ the more the model pays attention to the time-series network representation, the smaller the more the node Characterization

Minimize the goal
Get the optimized dynamic corporate transaction network characterization H.
The method for recognizing false invoice issuance based on dynamic network characterization according to claim 4, wherein the method for implementing step 3 is as follows:

Step 301: Decompose the objective function

Refactor the optimization function (2) and write it in a decomposable form:

among them,
Respectively represent the representation of p and q of the enterprise on day t and the weight of inter-firm transactions,
It represents the similarity of the representations of enterprise p and enterprise q; the penalty item
It is based on formula (2) that approximates the matrix of the original enterprise's transaction network, and divides the data into individual enterprises for calculation;

Minimize the goal
The optimized dynamic corporate transaction network characterization H;

Step 302, execute multiple sub-functions in parallel

Decompose formula (3) into N sub-optimization functions, where N is the number of network nodes, representing the number of enterprises in the enterprise transaction network, and solving them in parallel to obtain

among them,
On behalf of a company related to company v,
Represents the characterization of enterprise v on day t,
Represents the representation of enterprise v after k iterations on day t,
Represents the weight of the transaction between enterprise v and q on day t,
It represents the similarity of the representations of enterprise v and enterprise q after iteration (k-1) on the t day;
Indicates the similarity of the representation of enterprise v on the i-th day and the t-th day;

among them,
For the characterization of the requested enterprise v on day t, use an iterative optimization method to determine whether the calculation result meets the required accuracy: solve it through the gradient descent algorithm, when the convergence condition is reached
or
When the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1)th iteration of an enterprise reach the required accuracy; or when the iterative result of an enterprise is close enough to its affiliated enterprise , Stop updating, and the obtained characterization result of the kth iteration is the characterization of the enterprise on that day;

Step 303, comprehensively sort the parallel results

Calculate the N nodes of the transaction network in parallel to get the characterization of each enterprise on day t, and then for the dynamic transaction network distributed on time nodes 1 to T, calculate the characterization of the network on each time node in order .
The method for identifying false invoice issuance based on dynamic network characterization according to claim 5, wherein the method for implementing step 4 is as follows:

Step 401: Combine the basic features obtained in step 1 and the dynamic network features obtained in step 3 as the learning data of the classifier;

Step 402: Construct a two-classification model based on LightGBM, and set the main parameters of the classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;

Step 403: Take the characterization results obtained from the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1, and then randomly divide the training set into two groups. Use the training set to train the classification model of step 2 and use the verification set to adjust the training. If over-fitting occurs, perform pruning operations; select the optimal model to verify the algorithm in the test set accuracy;

Step 404: Input the characterization result of the unmarked enterprise sample into the LightGBM-based prediction model of the suspected false invoice issuance enterprise, and finally, based on the output of the prediction model, determine whether the target company has false invoice issuance behavior.