WO2021088499A1 - False invoice issuing identification method and system based on dynamic network representation - Google Patents

False invoice issuing identification method and system based on dynamic network representation Download PDF

Info

Publication number
WO2021088499A1
WO2021088499A1 PCT/CN2020/113450 CN2020113450W WO2021088499A1 WO 2021088499 A1 WO2021088499 A1 WO 2021088499A1 CN 2020113450 W CN2020113450 W CN 2020113450W WO 2021088499 A1 WO2021088499 A1 WO 2021088499A1
Authority
WO
WIPO (PCT)
Prior art keywords
enterprise
network
day
representation
characterization
Prior art date
Application number
PCT/CN2020/113450
Other languages
French (fr)
Chinese (zh)
Inventor
郑庆华
董博
阮建飞
范弘铖
Original Assignee
西安交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安交通大学 filed Critical 西安交通大学
Publication of WO2021088499A1 publication Critical patent/WO2021088499A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Definitions

  • the invention belongs to the technical field of tax control, and particularly relates to a method and system for identifying false invoice issuance based on dynamic network representation.
  • False invoice issuance refers to the use of various behavioral means by enterprises to issue invoices that are inconsistent with actual business conditions in order to achieve the purpose of tax evasion.
  • network characterization technology provides a solution.
  • the method of identifying false invoice issuance based on network representation can organize isolated report information into a corporate transaction network, thereby systematically verifying all companies, and at the same time, it can also use inter-enterprise contacts to obtain more corporate information to identify false invoice companies.
  • the following patents provide reference methods based on network characterization technology to automatically identify false invoices through computers:
  • Literature 1 A detection method for false VAT invoices based on parallel loop detection (201710147850.8);
  • Document 2 A method for identifying suspicious taxpayers based on the taxpayer’s interest-related network (201410328391.X);
  • Literature 1 organizes invoice information into a static network with enterprises as nodes, and improves loop detection in the network.
  • the improvement method is to distribute computing tasks to multiple computers in a distributed cluster through a distributed parallel computing method to improve efficiency , And finally use an improved loop detection method to detect false VAT invoices.
  • Literature 2 identifies suspicious taxpayers based on the topological characteristics of the taxpayer's interest-related network (TPIN), analyzes the topological characteristics of the taxpayer's interest-related network, and obtains the taxpayer's characterization in the interest-related network, and then uses the C4.5 classifier experiment , So as to realize the function of automatically identifying suspicious taxpayers.
  • TPIN topological characteristics of the taxpayer's interest-related network
  • Literature 1 can only detect the false invoice issuance behavior of funds returning to the source account after passing through multiple accounts, and the invoice false issuance has various forms and is not limited to the loop form.
  • the method of identification The type is too single, and the generalization ability of the model is poor;
  • Literature 2 is only based on the topological structure of the taxpayer and the interest relationship, ignoring the attribute information of the enterprise, and homogenizing the enterprise, which cannot be analyzed from the perspective of enterprise scale, market share, etc.;
  • Literature 1 and Literature 2 are both limited to static networks, unable to dynamically analyze the changes in corporate transactions combined with historical information, and unable to accurately grasp the dynamic changes, which allows some companies to take advantage of them.
  • the purpose of the present invention is to provide a method and system for identifying false invoices based on dynamic network representation.
  • the invention adopts dynamic network representation, dynamically analyzes the enterprise transaction network in combination with historical information, and accurately grasps the dynamic changes of enterprise transactions; and can identify different invoice false issuing behaviors based on the related information between enterprises; at the same time, it draws on the distributed optimization algorithm to The calculation function is decomposed into independent sub-functions to be executed in parallel, which improves the efficiency of identifying false invoices.
  • a method for identifying false invoice issuance based on dynamic network representation First, the company’s transaction information is organized into a static network with the company as the node and transaction records as the edge; second, the company’s transaction network representation is established with each day as the time node.
  • a 30-day time sequence window in which 30-day static network representations are merged each time within the time sequence window, and the static network representations of all time nodes are gradually merged through the moving time sequence window to obtain the final dynamic network representation results; again, borrowing from the distributed
  • the optimization algorithm decomposes the objective function of the characterization into independent sub-functions, and optimizes the sub-functions in parallel to improve the learning efficiency of the model; finally, a two-classifier is constructed based on LightGBM to identify the enterprises suspected of false invoices.
  • the method specifically includes the following implementation steps:
  • the data is preprocessed, and then the basic information of the company is extracted.
  • the basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized deal with;
  • Step 2 Feature extraction based on dynamic network representation
  • the enterprise After extracting the basic characteristics of the enterprise, the enterprise is the node, the basic information of the enterprise is the node attribute, the transaction record is the edge, and the transaction information is the attribute of the edge, and each day is the time node, and the enterprise transaction information is organized into a static network; then 30 A time sequence window is established in units of days, and 30-day static network representations are merged within the window each time, and static network representations at all times are gradually merged through the moving time sequence window to optimize the objective function of the network representation, and finally obtain the optimal dynamic enterprise transaction network Characterization
  • Step 3 Based on distributed algorithm optimization
  • Step 4 Build a classifier to identify false invoices
  • step 1 The implementation method of step 1 is as follows:
  • Step 101 data preprocessing
  • Step 102 processing text data
  • the processing of text information in the enterprise basic information table includes:
  • Step 103 processing logo type data
  • Use One-Hot coding for the discrete category data in the basic information table of the enterprise use the number of attribute values as the length to establish a status bit to mark each specific state;
  • Step 104 processing numerical data
  • step 2 The implementation method of step 2 is as follows:
  • Step 201 Establish a static corporate transaction network
  • a representation model of the corporate transaction network is established every day, so that companies with similar topological structures or higher transaction weights are closer in the representation space.
  • the objective optimization function is:
  • H i and H j characterize enterprise i and j;
  • w ij is the weight between the trading enterprise; minimize w ij
  • Step 202 Dynamically integrate historical information
  • is a parameter that defines the structural characteristics of the model and the degree of contribution to the degree of the original matrix. The larger the ⁇ the more the model pays attention to the time-series network representation, the smaller the more the node Characterization
  • step 3 The implementation method of step 3 is as follows:
  • Step 301 Decompose the objective function
  • Step 302 execute multiple sub-functions in parallel
  • Step 303 comprehensively sort the parallel results
  • Step 4 the implementation method is as follows:
  • Step 401 Combine the basic features obtained in step 1 and the dynamic network features obtained in step 3 as the learning data of the classifier;
  • Step 402 Construct a two-classification model based on LightGBM, and set the main parameters of the classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;
  • Step 403 Take the characterization results obtained from the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1, and then randomly divide the training set into two groups. Use the training set to train the classification model of step 2 and use the verification set to adjust the training. If over-fitting occurs, perform pruning operations; select the optimal model to verify the algorithm in the test set accuracy;
  • Step 404 Input the characterization result of the unmarked enterprise sample into the LightGBM-based prediction model of the suspected false invoice issuance enterprise, and finally, based on the output of the prediction model, determine whether the target company has false invoice issuance behavior.
  • the present invention has the following beneficial effects:
  • the present invention is a method for identifying enterprises suspected of issuing false invoices based on the idea of dynamic network representation learning, and has the following advantages:
  • the calculation function is decomposed into independent sub-functions for parallel execution, which reduces the time complexity of computing network representation and improves the efficiency of identifying false invoices.
  • Figure 1 is the overall framework flow chart
  • Figure 2 is a schematic diagram of the basic feature extraction process
  • Figure 3 is a schematic diagram of a feature extraction process based on dynamic network representation
  • Figure 4 is a schematic diagram of the optimization process of the network characterization algorithm
  • Figure 5 is a schematic diagram of the process of constructing a classifier to identify false invoices
  • Fig. 6 is a schematic diagram of a system for identifying false invoice issuance based on dynamic network representation according to an embodiment of the present invention.
  • the method for identifying false invoices based on dynamic network representation includes the following steps:
  • the basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized. .
  • the basic feature extraction implementation process specifically includes the following steps:
  • Step 1 Extract the "Taxpayer Electronic File Number" as the unique identifier of the company's characteristics, and delete all other attributes that cannot describe the company's own distribution rules;
  • Step 2 When the attribute contains a large number of missing values and only a few valid values, for example, the attributes of "taxpayer tax agency code", "financial report type” and “accounting form” are less than 10% of the enterprises with value. Choose to directly delete this feature; when the attribute has a small number of missing values, for example, "employees" and "registered capital” attributes have missing values in individual companies, choose the same mean imputation method to fill in the missing values.
  • Step 1 Use the Jieba word segmentation tool for word segmentation, construct a suitable stop table, and remove the stop words in the text.
  • the content of the "business scope" field of an enterprise in this embodiment is "production, sales: ceramics and products; goods import and export, technology import and export”.
  • the result is "production, sales, ceramics and products, goods import and export, technology import and export”;
  • Step 2 Use the dictionary tree to count the results of step 1, and select words with larger weights as keywords;
  • Step 3 Convert the N types of keywords extracted in step 2 into vectors based on word2vec.
  • One-Hot coding is used for the discrete categorical data "enterprise type” and "enterprise status" in the enterprise basic information table.
  • the number of possible values of the attribute is expressed as the length of the status bit, one of which is marked as 1 and the other is marked as 0 to indicate a specific state.
  • the "enterprise type” field has four possible values “individual proprietorship”, “partnership”, “limited liability company” and “limited liability company”. Therefore, the length of the status bit of "enterprise type” is 4, where 1000 means “sole proprietorship", 0100 means “partnership”, 0010 means “limited liability company”, and 0001 means "limited liability company”.
  • Step 1 Obtain the mean value of the "registered capital” attribute
  • n represents the number of basic information samples of the enterprise
  • x j represents the value of the j-th "registered capital”attribute
  • Step 2 Get the variance of each attribute
  • ⁇ 2 be the variance of the "registered capital” attribute, and its specific calculation form is:
  • Mean and variance are the basic indicators of numerical attributes, and numerical attributes can be standardized through the mean and variance;
  • Step 1 Establish a static corporate transaction network
  • the characterization h of each enterprise on the day can be obtained, so that the enterprises with similar transaction structure or significant transaction rights are closer in the characterization space, and then the characterization of the entire enterprise transaction network on that day can be obtained.
  • Step 2 Dynamically integrate historical information
  • the length of the timing window is 30 days. Within the timing window, 30 days of static network characteristics are merged each time, and then the timing window is moved to gradually merge all static network characteristics to minimize the target
  • the specific steps of the distributed algorithm optimization implementation process include:
  • the gradient descent algorithm is used to solve equation (4).
  • the current or Stop updating at time indicating that they are approximately equal when the representation is the representation of the corporate transaction network on that day. Therefore, for the dynamic trading network distributed on the first to T days, the characterization of the network can be obtained by calculating in order.
  • the basic feature vector of the enterprise obtained in S101 is directly placed after the dynamic network feature vector obtained in S103, and then combined into a new vector as the learning data of the classifier
  • the main parameters for setting the classifier are: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;
  • Step 1 Take the characterization results obtained by the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1.
  • Step 2 Randomly select 10% of the data in the training set as the validation set.
  • Step 3 Use the training set to train the classification model built by S502, use the validation set to adjust the training, and perform pruning when over-fitting occurs;
  • Step 4 Iterative calculation. Since the number of iterations is set to 100, if the convergence condition is not reached for 100 iterations, the iteration is forced to stop, and the result of the last iteration is the calculated representation.
  • Step 5 Select the optimal model to verify the accuracy of the algorithm in the test set.
  • the accuracy rate verified in this embodiment is 0.957, the precision is 0.921, and the recall rate is 0.87, indicating that the model has a very good effect on the test set and can reach Requirements for the identification of false invoices in actual tax scenarios.
  • the accuracy rate is 0.876, the accuracy is 0.856, and the recall rate is 0.794.
  • the method of the present invention has improved recognition accuracy rate of 9.25%, accuracy of 7.6%, and recall rate of 9.57%. .
  • the running time of the distributed algorithm for the data sample in this embodiment is 684.57s, which is more
  • the running time of the distributed algorithm is reduced by 28.56% in 958.19s.
  • Input the characterization results of the unlabeled enterprise samples into the trained prediction model of the suspected false invoice issuance enterprise. Based on the output of the prediction model, determine whether the target enterprise has false invoice issuance behavior. In this embodiment, the predicted value is sorted from high to low. , And take the top ten percent as a suspected enterprise of false invoices
  • a system for identifying false invoices based on dynamic network representation includes:
  • the enterprise attribute feature extraction module is used to extract the basic information of the enterprise after preprocessing the data.
  • the basic information of the enterprise is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, and the categorical data is encoded by One-Hot , To standardize numerical data;
  • the dynamic network characterization building module is used to process the attribute characteristics of the enterprise to obtain the static transaction network characterization of the enterprise with each day as the time node, and then establish a 30-day time sequence window, and integrate the static network characterization through the regular term in the window, and pass Sliding the window on the time series to gradually merge all static network representations to obtain dynamic network representations;
  • Parallel optimization of the dynamic network characterization module is used to decompose the goal of enterprise dynamic network characterization into independent sub-goals.
  • Parallel optimization of the sub-objectives improves the efficiency of dynamic network characterization and obtains the final characterization result more efficiently;
  • the invoice false issuance recognition module is used to use the obtained enterprise dynamic network as the characteristics of the invoice false issuance behavior and input it into the binary classifier based on LightGBM, and use the marked enterprise sample set to train the invoice false issuance recognition model.
  • the characterization results of the sample set of enterprises for prediction are input into the trained model for prediction, and then the enterprises suspected of issuing false invoices are obtained.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Technology Law (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed are a false invoice issuing identification method and system based on dynamic network representation. The method comprises: firstly, organizing enterprise transaction information into a static network by taking an enterprise as a node and a transaction record as an edge; secondly, establishing representation of an enterprise transaction network by taking each day as a time node, establishing a time-sequence window with a duration of 30 days, fusing static network representation of 30 days in the window each time, and gradually fusing the static network representation of all the time nodes by means of moving the time-sequence window so as to obtain a final dynamic network representation result; thirdly, by means of a distributed optimization algorithm, decomposing a represented target function into independent sub-functions, and optimizing the sub-functions in parallel to improve the learning efficiency of a model; and finally, constructing a binary classifier on the basis of LightGBM to identify an enterprise that is suspected of issuing false invoices. In the present invention, an enterprise that is suspected of issuing false invoices is identified on the basis of dynamic network representation, thereby improving the efficiency and accuracy of false invoice issuing identification.

Description

一种基于动态网络表征的发票虚开识别方法及系统Method and system for recognizing false invoice issuance based on dynamic network representation 【技术领域】【Technical Field】
本发明属于税控技术领域,特别涉及一种基于动态网络表征的发票虚开识别方法及系统。The invention belongs to the technical field of tax control, and particularly relates to a method and system for identifying false invoice issuance based on dynamic network representation.
【背景技术】【Background technique】
发票虚开是指企业动用各种行为手段开具与实际经营业务情况不符的发票,以达到偷漏税的目的。False invoice issuance refers to the use of various behavioral means by enterprises to issue invoices that are inconsistent with actual business conditions in order to achieve the purpose of tax evasion.
虚开发票的行为将造成国家税收的巨大损失,严重破坏社会主义经济秩序。目前的税务局识别发票虚开嫌疑企业的途径主要为:举报、日常监管抽查和问题企业牵连,然后再由税务稽查人员基于企业提供的报表进行核对。这些稽查都具有极大的偶然性,无法系统地对所有企业进行分析评估;而且单凭税务稽查人员人工核对工作量大效率低,检查数据还局限在单家企业提供的报表,无法结合上下游有关联的企业。The act of falsely issuing invoices will cause huge losses in national taxation and severely undermine the socialist economic order. The current tax bureau's main ways to identify enterprises suspected of issuing false invoices are: reporting, routine supervision and spot checks, and the involvement of problem companies, and then tax inspectors will check based on the statements provided by the company. These audits are of great contingency, and it is impossible to systematically analyze and evaluate all enterprises; moreover, the manual verification of tax inspectors alone has a large workload and low efficiency. The inspection data is also limited to the reports provided by a single enterprise, and cannot be combined with upstream and downstream companies. Associated companies.
为了解决当前发票虚开识别所面临的问题,网络表征技术提供了一种解决途径。基于网络表征的发票虚开识别方法可以把孤立的报表信息组织成为企业交易网络,从而系统地核查所有企业,同时还可以用企业间的联系得到更多的企业信息用以识别发票虚开企业。以下专利提供了可参考的基于网络表征技术通过计算机自动地进行发票虚开识别的相关方法:In order to solve the current problems faced by the recognition of false invoices, network characterization technology provides a solution. The method of identifying false invoice issuance based on network representation can organize isolated report information into a corporate transaction network, thereby systematically verifying all companies, and at the same time, it can also use inter-enterprise contacts to obtain more corporate information to identify false invoice companies. The following patents provide reference methods based on network characterization technology to automatically identify false invoices through computers:
文献1.一种基于并行环路检测的虚开增值税专用发票检测方法(201710147850.8); Literature 1. A detection method for false VAT invoices based on parallel loop detection (201710147850.8);
文献2.一种基于纳税人利益关联网络的可疑纳税人识别方法 (201410328391.X); Document 2. A method for identifying suspicious taxpayers based on the taxpayer’s interest-related network (201410328391.X);
文献1以企业为节点把发票信息组织成静态网络,并对网络中的环路检测进行改进,改进方法为通过分布式并行计算方法将计算任务分配给分布式集群中的多台计算机以提高效率,最终通过改进的环路检测方法来进行虚开增值税专用发票检测。 Literature 1 organizes invoice information into a static network with enterprises as nodes, and improves loop detection in the network. The improvement method is to distribute computing tasks to multiple computers in a distributed cluster through a distributed parallel computing method to improve efficiency , And finally use an improved loop detection method to detect false VAT invoices.
文献2基于纳税人利益关联网络(TPIN)的拓扑特征识别可疑纳税人,对纳税人利益关联网络进行拓扑特征的分析,得到纳税人在利益关联网络中的表征,再使用C4.5分类器实验,从而实现自动识别可疑纳税人的功能。 Literature 2 identifies suspicious taxpayers based on the topological characteristics of the taxpayer's interest-related network (TPIN), analyzes the topological characteristics of the taxpayer's interest-related network, and obtains the taxpayer's characterization in the interest-related network, and then uses the C4.5 classifier experiment , So as to realize the function of automatically identifying suspicious taxpayers.
以上文献所述方法主要存在以下问题:文献1仅能检测资金经过多个账户后重新回到源账户的发票虚开行为,而发票虚开形式多样,不局限于环路形式,该方法的识别类型过于单一,模型的泛化能力较差;文献2仅基于纳税人和利益关系的拓扑结构,忽略了企业的属性信息,将企业同一化,无法从企业的规模、市场份额等角度来分析;文献1和文献2都局限于静态网络,无法结合历史信息动态地分析企业交易的变化,无法准确把握其动态变化,就让一些企业有机可乘。例如,某偷漏税企业每年的账单单独看是毫无问题,连续几年处于亏损状态,但水电成本却逐年增加,发票虚开行为通常会隐藏在这类和时间序列相关的特征中,而静态网络无法捕捉这类特征。The method described in the above literature mainly has the following problems: Literature 1 can only detect the false invoice issuance behavior of funds returning to the source account after passing through multiple accounts, and the invoice false issuance has various forms and is not limited to the loop form. The method of identification The type is too single, and the generalization ability of the model is poor; Literature 2 is only based on the topological structure of the taxpayer and the interest relationship, ignoring the attribute information of the enterprise, and homogenizing the enterprise, which cannot be analyzed from the perspective of enterprise scale, market share, etc.; Literature 1 and Literature 2 are both limited to static networks, unable to dynamically analyze the changes in corporate transactions combined with historical information, and unable to accurately grasp the dynamic changes, which allows some companies to take advantage of them. For example, the annual bills of a tax evasion company are no problem. They have been in a state of loss for several years, but the cost of water and electricity has increased year by year. The behavior of false invoices is usually hidden in this type of time series-related characteristics, while static The network cannot capture such characteristics.
【发明内容】[Summary of the invention]
为了提高发票虚开识别的效率,本发明的目的在于提供一种基于动态网络表征的发票虚开识别方法及系统。本发明采用动态网络表征,结合历史信息动态地分析企业交易网络,准确把握企业交易的动态变化;而且基于企业间的关联信息能够识别不同的发票虚开行为;同时借鉴了分布式优化算法,把计算函数分解为 独立子函数并行执行,提高了发票虚开识别的效率。In order to improve the efficiency of identifying false invoices, the purpose of the present invention is to provide a method and system for identifying false invoices based on dynamic network representation. The invention adopts dynamic network representation, dynamically analyzes the enterprise transaction network in combination with historical information, and accurately grasps the dynamic changes of enterprise transactions; and can identify different invoice false issuing behaviors based on the related information between enterprises; at the same time, it draws on the distributed optimization algorithm to The calculation function is decomposed into independent sub-functions to be executed in parallel, which improves the efficiency of identifying false invoices.
为达到上述目的,本发明采用以下技术方案予以实现:In order to achieve the above objectives, the present invention adopts the following technical solutions to achieve:
一种基于动态网络表征的发票虚开识别方法,首先,以企业为节点、交易记录为边,把企业交易信息组织成静态网络;其次,以每一天为时间节点建立企业交易网络的表征,建立长度为30天的时序窗口,在时序窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间节点的静态网络表征得到最终的动态网络表征结果;再次,借鉴了分布式优化算法,把表征的目标函数分解为独立子函数,并行优化子函数提高模型的学习效率;最后,基于LightGBM构建二分类器识别出发票虚开嫌疑企业。A method for identifying false invoice issuance based on dynamic network representation. First, the company’s transaction information is organized into a static network with the company as the node and transaction records as the edge; second, the company’s transaction network representation is established with each day as the time node. A 30-day time sequence window, in which 30-day static network representations are merged each time within the time sequence window, and the static network representations of all time nodes are gradually merged through the moving time sequence window to obtain the final dynamic network representation results; again, borrowing from the distributed The optimization algorithm decomposes the objective function of the characterization into independent sub-functions, and optimizes the sub-functions in parallel to improve the learning efficiency of the model; finally, a two-classifier is constructed based on LightGBM to identify the enterprises suspected of false invoices.
本发明进一步的改进在于:The further improvement of the present invention lies in:
该方法具体包括以下实现步骤:The method specifically includes the following implementation steps:
步骤1,基本特征提取 Step 1, basic feature extraction
首先对数据进行预处理,然后提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理;Firstly, the data is preprocessed, and then the basic information of the company is extracted. The basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized deal with;
步骤2,基于动态网络表征的特征提取 Step 2. Feature extraction based on dynamic network representation
提取企业基本特征后,以企业为节点,企业基本信息为节点属性,以交易记录为边,交易信息为边的属性,以每一天为时间节点,把企业交易信息组织成静态网络;然后以30天为单位建立时序窗口,在窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间的静态网络表征,优化网络表征的目标函数,最后得到最优的动态企业交易网络表征;After extracting the basic characteristics of the enterprise, the enterprise is the node, the basic information of the enterprise is the node attribute, the transaction record is the edge, and the transaction information is the attribute of the edge, and each day is the time node, and the enterprise transaction information is organized into a static network; then 30 A time sequence window is established in units of days, and 30-day static network representations are merged within the window each time, and static network representations at all times are gradually merged through the moving time sequence window to optimize the objective function of the network representation, and finally obtain the optimal dynamic enterprise transaction network Characterization
步骤3,基于分布式的算法优化Step 3. Based on distributed algorithm optimization
为了提高动态网络表征的学习效率,借鉴分布式优化算法,把动态企业交易网络表征的目标函数分解为独立子函数,并行优化子函数加速了大规模复杂的企业交易网络表征的求解;In order to improve the learning efficiency of dynamic network representation, learn from distributed optimization algorithms to decompose the objective function of dynamic corporate transaction network representation into independent sub-functions. Parallel optimization sub-functions accelerate the solution of large-scale and complex corporate transaction network representation;
步骤4,构建分类器识别发票虚开Step 4. Build a classifier to identify false invoices
基于LightGBM分类器构建二分类模型,把计算得到的动态网络表征作为分类器的学习数据,用已标记的企业样本集来训练模型,然后把需要进行预测的企业样本集的表征结果放入训练好的模型中进行预测,最后根据预测模型的输出确定目标企业是否存在发票虚开行为。Construct a two-classification model based on the LightGBM classifier, use the calculated dynamic network representation as the learning data of the classifier, use the labeled enterprise sample set to train the model, and then put the characterization result of the enterprise sample set that needs to be predicted into the training. Make predictions in the model, and finally determine whether the target company has false invoices based on the output of the prediction model.
步骤1的实现方法如下:The implementation method of step 1 is as follows:
步骤101,数据预处理Step 101, data preprocessing
(1)提取”纳税人电子档案号”,作为企业特征唯一标识;(1) Extract the "Taxpayer Electronic File Number" as the unique identifier of the company's characteristics;
(2)处理缺失值:数据缺失严重的属性和与发票虚开任务不相关的属性直接删去,有少量缺失的重要属性用同类均值插补的方法补全缺失值;(2) Dealing with missing values: attributes with severe data missing and attributes that are not related to the task of false invoices are directly deleted, and a small number of missing important attributes are used to fill in missing values with the same kind of mean interpolation method;
步骤102,处理文本型数据Step 102, processing text data
对企业基本信息表中的文本信息处理包括:The processing of text information in the enterprise basic information table includes:
(1)使用Jieba分词工具把企业的文本型数据进行分词;(1) Use Jieba word segmentation tool to segment the text data of the enterprise;
(2)用词典树统计分词的结果,选择出权重较大的词作为关键词;(2) Use the dictionary tree to count the results of word segmentation, and select words with larger weights as keywords;
(3)基于word2vec将提取出来的N类关键词转成向量;(3) Convert the extracted N types of keywords into vectors based on word2vec;
步骤103,处理标志型数据 Step 103, processing logo type data
对企业基本信息表中离散的类别型数据采用One-Hot编码;以属性取值的数量为长度建立状态位标志每一特定状态;Use One-Hot coding for the discrete category data in the basic information table of the enterprise; use the number of attribute values as the length to establish a status bit to mark each specific state;
步骤104,处理数值型数据Step 104, processing numerical data
对企业基本信息表中的数值型数据采用传统的标准化方法进行处理:The numerical data in the basic information table of the enterprise is processed by traditional standardized methods:
(1)求各属性的均值;(1) Find the mean value of each attribute;
(2)求各属性的方差;(2) Find the variance of each attribute;
(3)Z-Score标准化。(3) Z-Score standardization.
步骤2的实现方法如下:The implementation method of step 2 is as follows:
步骤201:建立静态的企业交易网络Step 201: Establish a static corporate transaction network
每一天都建立一个企业交易网络的表征模型,使得具有相似拓扑结构或者交易权重更高的企业在表征空间离得更近,目标优化函数为:A representation model of the corporate transaction network is established every day, so that companies with similar topological structures or higher transaction weights are closer in the representation space. The objective optimization function is:
Figure PCTCN2020113450-appb-000001
Figure PCTCN2020113450-appb-000001
其中,h i和h j是企业i和j的表征;w ij是企业间交易的权重;最小化w ij||h i-h j|| 2时,就迫使越大的交易权重w ij对应的企业表征i和j越接近; Wherein, H i and H j characterize enterprise i and j; w ij is the weight between the trading enterprise; minimize w ij || h i -h j || 2 , the greater the forces the transaction corresponding to the weight w ij The closer the enterprise representations i and j are;
最小化目标
Figure PCTCN2020113450-appb-000002
得到该天优化后的企业交易网络表征h;
Minimize the goal
Figure PCTCN2020113450-appb-000002
Obtain the optimized corporate transaction network representation h on that day;
步骤202:动态融合历史信息Step 202: Dynamically integrate historical information
建立一个长度为30天的时序窗口,在窗口内每次融合30天的静态网络表征,然后移动时序窗口,逐步融合所有静态网络表征,最终得到动态的企业交易网络表征,对应的优化目标是:Establish a 30-day timing window, merge 30-day static network representations within the window each time, and then move the timing window to gradually merge all static network representations, and finally obtain dynamic corporate transaction network representations. The corresponding optimization goals are:
Figure PCTCN2020113450-appb-000003
Figure PCTCN2020113450-appb-000003
其中,
Figure PCTCN2020113450-appb-000004
分别表示第t天的企业p,q的表征和企业间交易的权重,
Figure PCTCN2020113450-appb-000005
则表示企业p和企业q的表征的近似程度;H i表示时序窗口内第i天的网络表征;惩罚项
Figure PCTCN2020113450-appb-000006
使表征学习到的矩阵尽量逼近原企业交易网络的矩阵,ρ是一个定义模型的结构特性和对原矩阵逼近程度贡献程度的参数,ρ越大模型越注重时序的网络表征,越小越注重节点的表征;
among them,
Figure PCTCN2020113450-appb-000004
Respectively represent the representation of p and q of the enterprise on day t and the weight of inter-firm transactions,
Figure PCTCN2020113450-appb-000005
It represents the similarity of the representations of enterprise p and enterprise q; H i represents the network representation of the i-th day in the timing window; penalty item
Figure PCTCN2020113450-appb-000006
Make the matrix learned by the representation as close as possible to the matrix of the original enterprise transaction network. ρ is a parameter that defines the structural characteristics of the model and the degree of contribution to the degree of the original matrix. The larger the ρ the more the model pays attention to the time-series network representation, the smaller the more the node Characterization
最小化目标
Figure PCTCN2020113450-appb-000007
得到优化后的动态企业交易网络表征H。
Minimize the goal
Figure PCTCN2020113450-appb-000007
Get the optimized dynamic corporate transaction network characterization H.
步骤3的实现方法如下:The implementation method of step 3 is as follows:
步骤301,分解目标函数Step 301: Decompose the objective function
对优化函数(2)进行重构,将其写成可分解的形式:Refactor the optimization function (2) and write it in a decomposable form:
Figure PCTCN2020113450-appb-000008
Figure PCTCN2020113450-appb-000008
其中,
Figure PCTCN2020113450-appb-000009
分别表示第t天的企业p,q的表征和企业间交易的权重,
Figure PCTCN2020113450-appb-000010
则表示企业p和企业q的表征的近似程度;惩罚项
Figure PCTCN2020113450-appb-000011
是在式(2)逼近原企业交易网络的矩阵的基础上,把数据拆分为单个企业进行计算;
among them,
Figure PCTCN2020113450-appb-000009
Respectively represent the representation of p and q of the enterprise on day t and the weight of inter-firm transactions,
Figure PCTCN2020113450-appb-000010
It represents the similarity of the representations of enterprise p and enterprise q; the penalty item
Figure PCTCN2020113450-appb-000011
It is based on formula (2) that approximates the matrix of the original enterprise's transaction network, and divides the data into individual enterprises for calculation;
最小化目标
Figure PCTCN2020113450-appb-000012
得到优化后的动态企业交易网络表征H;
Minimize the goal
Figure PCTCN2020113450-appb-000012
The optimized dynamic corporate transaction network characterization H;
步骤302,并行执行多个子函数Step 302, execute multiple sub-functions in parallel
把(3)式分解为N个子优化函数,N为网络节点数,表示企业交易网络中企业的个数,对其并行求解以得到H t k+1Decompose formula (3) into N sub-optimization functions, where N is the number of network nodes, representing the number of enterprises in the enterprise transaction network, and solving them in parallel to obtain H t k+1 :
Figure PCTCN2020113450-appb-000013
Figure PCTCN2020113450-appb-000013
其中,
Figure PCTCN2020113450-appb-000014
代表与企业v的有关联的企业,
Figure PCTCN2020113450-appb-000015
表示第t天的企业v的表征,
Figure PCTCN2020113450-appb-000016
表示第t天的企业v迭代计算k次后的表征,
Figure PCTCN2020113450-appb-000017
表示第t天企业v,q间交易的权重,
Figure PCTCN2020113450-appb-000018
则表示第t天迭代(k-1)次后的企业v和企业q的表征的近似程度;
Figure PCTCN2020113450-appb-000019
表示企业v在第i天和第t天的表征的近似程度;
among them,
Figure PCTCN2020113450-appb-000014
On behalf of a company related to company v,
Figure PCTCN2020113450-appb-000015
Represents the characterization of enterprise v on day t,
Figure PCTCN2020113450-appb-000016
Represents the representation of enterprise v after k iterations on day t,
Figure PCTCN2020113450-appb-000017
Represents the weight of the transaction between enterprise v and q on day t,
Figure PCTCN2020113450-appb-000018
It represents the similarity of the representations of enterprise v and enterprise q after iteration (k-1) on the t day;
Figure PCTCN2020113450-appb-000019
Indicates the similarity of the representation of enterprise v on the i-th day and the t-th day;
其中,
Figure PCTCN2020113450-appb-000020
为所要求解的企业v在第t天的表征,使用迭代的优化方法判断计算结果是否达到要求的精确度:通过梯度下降算法对其进行求解,当达到收敛条件
Figure PCTCN2020113450-appb-000021
或者
Figure PCTCN2020113450-appb-000022
时,优化函数取得最优值;当一个企业第k次迭代和第(k-1)次迭代后得到的结果达到要求精确度时;或者当一个企业的迭代结果与其关联企 业离得足够近时,停止更新,得到的第k次迭代的表征结果就为该天该企业的表征;
among them,
Figure PCTCN2020113450-appb-000020
For the characterization of the requested enterprise v on day t, use an iterative optimization method to determine whether the calculation result meets the required accuracy: solve it through the gradient descent algorithm, when the convergence condition is reached
Figure PCTCN2020113450-appb-000021
or
Figure PCTCN2020113450-appb-000022
When the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1)th iteration of an enterprise reach the required accuracy; or when the iterative result of an enterprise is close enough to its affiliated enterprise , Stop updating, and the obtained characterization result of the kth iteration is the characterization of the enterprise on that day;
步骤303,综合整理并行的结果Step 303, comprehensively sort the parallel results
并行计算交易网络的N个节点就可得到每个企业在第t天的表征,再对于分布在时间节点1到T上的动态交易网络,按顺序计算求出每个时间节点上的网络的表征。Calculate the N nodes of the transaction network in parallel to get the characterization of each enterprise on day t, and then for the dynamic transaction network distributed on time nodes 1 to T, calculate the characterization of the network on each time node in order .
步骤4,的实现方法如下:Step 4, the implementation method is as follows:
步骤401,将步骤1得到的基本特征和步骤3得到的动态网络特征结合到一起作为分类器的学习数据;Step 401: Combine the basic features obtained in step 1 and the dynamic network features obtained in step 3 as the learning data of the classifier;
步骤402,基于LightGBM构建二分类模型,将分类器的主要参数设置为:叶子数为13,学习速率为0.1,迭代次数为100;Step 402: Construct a two-classification model based on LightGBM, and set the main parameters of the classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;
步骤403,把标记为虚开发票的企业样本集和正常企业样本集获得的表征结果作为基础特征,并按照3:1的比例随机分为两组作为训练集和测试集,训练集中再随机分出百分之十的数据作为验证集;用训练集训练步骤2的分类模型,用验证集调整训练,如果出现过拟合现象,则进行剪枝操作;选取最优模型在测试集验证算法的准确性;Step 403: Take the characterization results obtained from the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1, and then randomly divide the training set into two groups. Use the training set to train the classification model of step 2 and use the verification set to adjust the training. If over-fitting occurs, perform pruning operations; select the optimal model to verify the algorithm in the test set accuracy;
步骤404,将未标记的企业样本的表征结果输入至基于LightGBM的发票虚开嫌疑企业预测模型,最后基于预测模型的输出,确定目标企业是否存在发票虚开行为。Step 404: Input the characterization result of the unmarked enterprise sample into the LightGBM-based prediction model of the suspected false invoice issuance enterprise, and finally, based on the output of the prediction model, determine whether the target company has false invoice issuance behavior.
与现有技术相比,本发明具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:
本发明是基于动态网络表征学习思想提出的一种发票虚开嫌疑企业识别的方法,具有以下优势:The present invention is a method for identifying enterprises suspected of issuing false invoices based on the idea of dynamic network representation learning, and has the following advantages:
1.采用动态网络表征,结合历史信息,为所有时间节点的网络学习出表征向量并融合,能够准确把握企业交易网络的动态变化,提高发票虚开识别的准确率;1. Using dynamic network representation, combining historical information, learning representation vectors for all time nodes and fusion, can accurately grasp the dynamic changes of the enterprise transaction network, and improve the accuracy of the recognition of false invoices;
2.基于企业间的关联信息,能够识别不同类型的虚开发票行为;2. Based on the related information between enterprises, it can identify different types of false invoices;
3.借鉴了分布式优化算法,把计算函数分解为独立子函数并行执行,降低了计算网络表征的时间复杂度,提高了发票虚开识别的效率。3. With reference to the distributed optimization algorithm, the calculation function is decomposed into independent sub-functions for parallel execution, which reduces the time complexity of computing network representation and improves the efficiency of identifying false invoices.
【附图说明】【Explanation of the drawings】
图1为整体框架流程图;Figure 1 is the overall framework flow chart;
图2为基本特征提取流程示意图;Figure 2 is a schematic diagram of the basic feature extraction process;
图3为基于动态网络表征的特征提取流程示意图;Figure 3 is a schematic diagram of a feature extraction process based on dynamic network representation;
图4为网络表征算法优化流程示意图;Figure 4 is a schematic diagram of the optimization process of the network characterization algorithm;
图5为构建分类器识别发票虚开流程示意图;Figure 5 is a schematic diagram of the process of constructing a classifier to identify false invoices;
图6为本发明实施例的一种基于动态网络表征的发票虚开识别系统的示意图。Fig. 6 is a schematic diagram of a system for identifying false invoice issuance based on dynamic network representation according to an embodiment of the present invention.
【具体实施方式】【Detailed ways】
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,不是全部的实施例,而并非要限制本发明公开的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要的混淆本发明公开的概念。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only The embodiments are a part of the present invention, not all the embodiments, and are not intended to limit the scope of the present invention. In addition, in the following description, descriptions of well-known structures and technologies are omitted to avoid unnecessary confusion of the concepts disclosed in the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
下面结合附图对本发明做进一步详细描述:The present invention will be further described in detail below in conjunction with the accompanying drawings:
参见图1,基于动态网络表征的发票虚开识别方法,包括下述步骤:Referring to Figure 1, the method for identifying false invoices based on dynamic network representation includes the following steps:
S101.基本特征提取S101. Basic feature extraction
对数据进行预处理后,提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理。After preprocessing the data, extract the basic information of the company. The basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized. .
如图2所示,基本特征提取实施过程具体包括以下步骤:As shown in Figure 2, the basic feature extraction implementation process specifically includes the following steps:
S201.数据预处理S201. Data preprocessing
步骤1:提取”纳税人电子档案号”作为企业特征唯一标识,其余不能刻画企业自身分布规律的属性都直接删去;Step 1: Extract the "Taxpayer Electronic File Number" as the unique identifier of the company's characteristics, and delete all other attributes that cannot describe the company's own distribution rules;
步骤2:当属性含有大量缺失值而仅有极少量有效值时,例如,”纳税人税务机构代码”、”财务报表种类”和”核算形式”属性仅有不到10%的企业有值,选择直接删除该特征;当属性有少量缺失值时,例如,”从业人数”和”注册资本”属性有个别企业出现缺失值,选择同类均值插补的方法来补全缺失值。Step 2: When the attribute contains a large number of missing values and only a few valid values, for example, the attributes of "taxpayer tax agency code", "financial report type" and "accounting form" are less than 10% of the enterprises with value. Choose to directly delete this feature; when the attribute has a small number of missing values, for example, "employees" and "registered capital" attributes have missing values in individual companies, choose the same mean imputation method to fill in the missing values.
S202.处理文本型数据S202. Processing text data
对企业基本信息表中的文本型数据”货物信息”和”经营范围”进行数据的预处理并进行特征提取。文本型数据处理具体步骤包括:Perform data preprocessing and feature extraction on the text data "cargo information" and "business scope" in the basic information table of the enterprise. The specific steps of text data processing include:
步骤1:使用Jieba分词工具进行分词,构建合适的停用表,去掉文本中的停用词。例如,本实施例中某企业的”经营范围”字段内容为”生产、销售:陶瓷并品;货物进出口、技术进出口”。经过分词并去掉停用词后结果为”生产、销售、陶瓷并品、货物进出口、技术进出口”;Step 1: Use the Jieba word segmentation tool for word segmentation, construct a suitable stop table, and remove the stop words in the text. For example, the content of the "business scope" field of an enterprise in this embodiment is "production, sales: ceramics and products; goods import and export, technology import and export". After word segmentation and removal of stop words, the result is "production, sales, ceramics and products, goods import and export, technology import and export";
步骤2:把步骤1的结果用词典树进行统计,选择出权重较大的词作为关键词;Step 2: Use the dictionary tree to count the results of step 1, and select words with larger weights as keywords;
步骤3:基于word2vec将步骤2提取出来的N类关键词转成向量。Step 3: Convert the N types of keywords extracted in step 2 into vectors based on word2vec.
S203.处理类别型数据S203. Processing category data
对企业基本信息表中的离散的类别型数据”企业类型”和”企业状态”采用One-Hot编码。把属性可能取值的数量表示为状态位的长度,把其中一位标志为1其余全标为0表示某一特定状态。例如,本实施例中”企业类型”字段有四种可能取值”个人独资企业”、”合伙企业”、”有限责任公司”和”股份有限公司”。所以”企业类型”的状态位长度为4,其中1000表示“个人独资企业”、0100表示”合伙企业”、0010表示”有限责任公司”、0001表示”股份有限公司”。One-Hot coding is used for the discrete categorical data "enterprise type" and "enterprise status" in the enterprise basic information table. The number of possible values of the attribute is expressed as the length of the status bit, one of which is marked as 1 and the other is marked as 0 to indicate a specific state. For example, in this embodiment, the "enterprise type" field has four possible values "individual proprietorship", "partnership", "limited liability company" and "limited liability company". Therefore, the length of the status bit of "enterprise type" is 4, where 1000 means "sole proprietorship", 0100 means "partnership", 0010 means "limited liability company", and 0001 means "limited liability company".
S204.处理数值型数据S204. Processing numerical data
对企业基本信息表中的数值型数据”注册资本”、”投资总额”和”从业人数”,进行标准化处理,本实施例以”注册资本”为例说明:Standardize the numerical data "registered capital", "total investment" and "number of employees" in the basic information table of the enterprise. This embodiment uses "registered capital" as an example to illustrate:
步骤1:获取”注册资本”属性的均值Step 1: Obtain the mean value of the "registered capital" attribute
记u为”注册资本”属性的均值,其具体的计算形式为:Let u be the mean value of the "registered capital" attribute, and its specific calculation form is:
Figure PCTCN2020113450-appb-000023
Figure PCTCN2020113450-appb-000023
其中,n表示企业基本信息样本的数量,x j表示第j个”注册资本”属性取值; Among them, n represents the number of basic information samples of the enterprise, and x j represents the value of the j-th "registered capital"attribute;
步骤2:获取各个属性的方差Step 2: Get the variance of each attribute
记σ 2为”注册资本”属性的方差,其具体的计算形式为: Let σ 2 be the variance of the "registered capital" attribute, and its specific calculation form is:
Figure PCTCN2020113450-appb-000024
Figure PCTCN2020113450-appb-000024
均值和方差是数值型属性的基本指标,通过均值和方差可对数值型属性做标准化处理;Mean and variance are the basic indicators of numerical attributes, and numerical attributes can be standardized through the mean and variance;
步骤3:Z-Score标准化Step 3: Z-Score standardization
记δ为”注册资本”标准化后的值,其中δ=(δ 12,L,δ n),δ j表示第j个”注册资本”标准化后的值,δ j具体的计算形式为: Let δ be the standardized value of “registered capital”, where δ=(δ 12 ,L,δ n ), δ j represents the standardized value of the jth “registered capital”, and the specific calculation form of δ j is :
δ j=(x j-u)/σ,j=1,2,L,n δ j =(x j -u)/σ,j=1,2,L,n
S102.基于动态网络表征的特征提取S102. Feature extraction based on dynamic network representation
首先以企业为节点、以交易记录为边、以每一天为时间节点建立静态的企业交易网络;然后以30天为单位建立时序窗口,在窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间的静态网络表征,优化网络表征的目标函数,得到最优的动态企业交易网络表征。First, establish a static enterprise transaction network with the enterprise as the node, transaction records as the edge, and each day as the time node; then establish a time sequence window with 30 days as the unit, integrate 30 days of static network representations in the window, and pass The moving timing window gradually integrates the static network representation at all times, optimizes the objective function of the network representation, and obtains the optimal dynamic corporate transaction network representation.
如图3所示,基于动态网络表征的特征提取实施过程具体步骤包括:As shown in Figure 3, the specific steps of the implementation process of feature extraction based on dynamic network representation include:
步骤1:建立静态的企业交易网络Step 1: Establish a static corporate transaction network
建立每天一个企业交易网络的表征模型,目标优化函数为:Establish a representative model of a corporate transaction network every day, and the objective optimization function is:
Figure PCTCN2020113450-appb-000025
Figure PCTCN2020113450-appb-000025
最小化目标
Figure PCTCN2020113450-appb-000026
就可求得该天各个企业的表征h,使得具有相似交易结构或者交易权重大的企业在表征空间离得更近,进而得到该天整个企业交易网络的表征。
Minimize the goal
Figure PCTCN2020113450-appb-000026
The characterization h of each enterprise on the day can be obtained, so that the enterprises with similar transaction structure or significant transaction rights are closer in the characterization space, and then the characterization of the entire enterprise transaction network on that day can be obtained.
步骤2:动态融合历史信息Step 2: Dynamically integrate historical information
在时序窗口内逐步融合所有静态企业交易网络表征,最终得到动态的企业交易网络表征,优化目标为:Gradually integrate all static corporate transaction network representations within the time sequence window, and finally obtain a dynamic corporate transaction network representation. The optimization goals are:
Figure PCTCN2020113450-appb-000027
Figure PCTCN2020113450-appb-000027
时序窗口长度为一个30天,在时序窗口内每次融合30天的静态网络表征,然后移动时序窗口,逐步融合所有静态网络表征,最小化目标
Figure PCTCN2020113450-appb-000028
就可求得该天各个企业的表征H。本实施例中,发现ρ=0.75时效果最好,此时较平衡地关注了时序的网络表征和节点的表征;
The length of the timing window is 30 days. Within the timing window, 30 days of static network characteristics are merged each time, and then the timing window is moved to gradually merge all static network characteristics to minimize the target
Figure PCTCN2020113450-appb-000028
The characterization H of each enterprise on that day can be obtained. In this embodiment, it is found that the effect is best when ρ=0.75. At this time, the network characterization and node characterization of the time series are more balanced.
S103.基于分布式的算法优化S103. Based on distributed algorithm optimization
首先分解目标函数;然后并行执行多个子函数;最后综合整理并行的结果。First decompose the objective function; then execute multiple sub-functions in parallel; finally, synthesize the parallel results.
如图4所示,基于分布式的算法优化实施过程具体步骤包括:As shown in Figure 4, the specific steps of the distributed algorithm optimization implementation process include:
S401.分解目标函数S401. Decompose the objective function
重构优化函数(2),将其写成可分解的形式:Refactor the optimization function (2) and write it in a decomposable form:
Figure PCTCN2020113450-appb-000029
Figure PCTCN2020113450-appb-000029
本实施例中,企业交易网络共涉及有3765个企业,所以取N=3765,v从1到3765取值计算每一个企业及其有关联的交易网络;取ρ=0.75较平衡地关注了时序的网络表征和节点的表征;In this embodiment, the enterprise transaction network involves a total of 3765 companies, so N=3765, and v is calculated from 1 to 3765 for each company and its associated transaction network; ρ=0.75 is used to pay attention to the timing in a more balanced manner. Network characterization and node characterization;
S402.并行执行多个子函数S402. Parallel execution of multiple sub-functions
把(3)式按每个企业v分解为3765个子优化函数,对其并行求解最终合并得到H t k+1,其中单个子目标优化函数为: The formula (3) is decomposed into 3765 sub-optimization functions for each enterprise v, which are solved in parallel and finally merged to obtain H t k+1 , where the single sub-objective optimization function is:
Figure PCTCN2020113450-appb-000030
Figure PCTCN2020113450-appb-000030
本实施例中,取ρ=0.75较平衡地关注了时序的网络表征和节点的表征。按顺序计算就可得到各子函数的计算结果,
Figure PCTCN2020113450-appb-000031
为各个子函数求解得到的每一企业在第t天第k次迭代后的表征,从而得到
Figure PCTCN2020113450-appb-000032
为第t天第k次迭代后动态企业交易网络的表征;
In this embodiment, taking ρ=0.75 pays attention to the network characterization of time series and the characterization of nodes in a balanced manner. Calculate in order to get the calculation result of each sub-function,
Figure PCTCN2020113450-appb-000031
The characterization of each enterprise after the kth iteration on the t day obtained by solving for each sub-function, so as to obtain
Figure PCTCN2020113450-appb-000032
It is the characterization of the dynamic enterprise transaction network after the kth iteration on the t day;
S403.综合整理并行的结果S403. Organize the parallel results comprehensively
用梯度下降算法对(4)式求解,本实施例中,设置了当
Figure PCTCN2020113450-appb-000033
或者
Figure PCTCN2020113450-appb-000034
时停止更新,表示他们近似相等时的表征就是该天企业交易网络的表征。于是对于分布在第1到T天上的动态交易网络,按顺序计算就可以求出每 一天的网络的表征。
The gradient descent algorithm is used to solve equation (4). In this embodiment, the current
Figure PCTCN2020113450-appb-000033
or
Figure PCTCN2020113450-appb-000034
Stop updating at time, indicating that they are approximately equal when the representation is the representation of the corporate transaction network on that day. Therefore, for the dynamic trading network distributed on the first to T days, the characterization of the network can be obtained by calculating in order.
S104.构建分类器识别发票虚开S104. Build a classifier to identify false invoices
首先将S101得到的基本特征和S102得到的动态网络特征结合作为分类器的学习数据;其次基于LightGBM分类器构建二分类模型;然后用已标记是否虚开发票的企业样本集来训练模型;最后把需要进行预测的企业样本集的表征结果放入训练好的模型中进行预测,基于预测模型的输出,确定目标企业是否存在发票虚开行为。First, combine the basic features obtained in S101 and the dynamic network features obtained in S102 as the learning data of the classifier; secondly, build a two-class model based on the LightGBM classifier; then use the enterprise sample set that has been marked for false invoices to train the model; finally The characterization results of the sample set of companies that need to be predicted are put into the trained model for prediction, and based on the output of the prediction model, it is determined whether the target company has false invoices.
如图5所示,构建分类器识别发票虚开实施过程具体步骤包括:As shown in Figure 5, the specific steps of the implementation process of constructing a classifier to identify false invoices include:
S501.得到分类器的学习数据S501. Obtain the learning data of the classifier
将S101得到的基本特征和S103得到的动态网络特征结合到一起作为分类器的学习数据。本实施例中直接把S101得到的企业基本特征向量放在S103得到的动态网络特征向量后,组合成为新的向量,作为分类器的学习数据Combine the basic features obtained in S101 and the dynamic network features obtained in S103 as the learning data of the classifier. In this embodiment, the basic feature vector of the enterprise obtained in S101 is directly placed after the dynamic network feature vector obtained in S103, and then combined into a new vector as the learning data of the classifier
S502.基于LightGBM构建二分类模型S502. Building a two-class model based on LightGBM
设置分类器的主要参数为:叶子数为13,学习速率为0.1,迭代次数为100;The main parameters for setting the classifier are: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;
S503.训练模型S503. Training model
步骤1:把标记为虚开发票的企业样本集和正常企业样本集获得的表征结果作为基础特征,并按照3:1的比例随机分为两组作为训练集和测试集。Step 1: Take the characterization results obtained by the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1.
步骤2:在训练集中随机分出百分之十的数据作为验证集。Step 2: Randomly select 10% of the data in the training set as the validation set.
步骤3:用训练集训练S502构建的分类模型,用验证集调整训练,出现过拟合现象时进行剪枝操作;Step 3: Use the training set to train the classification model built by S502, use the validation set to adjust the training, and perform pruning when over-fitting occurs;
步骤4:迭代计算,由于迭代次数设置了100,所以若迭代100次尚未到达到收敛条件则强制停止迭代,取最后一次迭代结果就是计算所得表征。Step 4: Iterative calculation. Since the number of iterations is set to 100, if the convergence condition is not reached for 100 iterations, the iteration is forced to stop, and the result of the last iteration is the calculated representation.
步骤5:选取最优模型在测试集验证算法的准确性,本实施例验证得的准确率为0.957,精度为0.921,回召率为0.87,说明模型在测试集的效果非常好,可以达到在实际税务场景发票虚开识别的要求。对比其他基于静态网络表征的发票虚开识别方法的准确率0.876,精度0.856,回召率0.794,本发明的方法识别准确率提高了9.25%,精度提高了7.6%,回召率提高了9.57%。本发明的方法识别发票虚开的效果提升除了表现在准确率提高,还体现在分布式并行运算的识别效率的提高:本实施例的数据样本采用分布式算法的运行时间为684.57s,比非分布式算法的运行时间958.19s缩短了28.56%。Step 5: Select the optimal model to verify the accuracy of the algorithm in the test set. The accuracy rate verified in this embodiment is 0.957, the precision is 0.921, and the recall rate is 0.87, indicating that the model has a very good effect on the test set and can reach Requirements for the identification of false invoices in actual tax scenarios. Compared with other methods for identifying false invoices based on static network characterization, the accuracy rate is 0.876, the accuracy is 0.856, and the recall rate is 0.794. The method of the present invention has improved recognition accuracy rate of 9.25%, accuracy of 7.6%, and recall rate of 9.57%. . The improvement of the method of the present invention for identifying false invoices is not only reflected in the improvement in accuracy, but also in the improvement of the identification efficiency of distributed parallel operations: the running time of the distributed algorithm for the data sample in this embodiment is 684.57s, which is more The running time of the distributed algorithm is reduced by 28.56% in 958.19s.
S504.预测发票虚开嫌疑企业S504. Predicted enterprises suspected of issuing false invoices
将未标记的企业样本的表征结果输入至训练好的发票虚开嫌疑企业预测模型,基于预测模型的输出,确定目标企业是否存在发票虚开行为,本实施例中把预测值从高到低排序,取前百分之十作为发票虚开嫌疑企业。Input the characterization results of the unlabeled enterprise samples into the trained prediction model of the suspected false invoice issuance enterprise. Based on the output of the prediction model, determine whether the target enterprise has false invoice issuance behavior. In this embodiment, the predicted value is sorted from high to low. , And take the top ten percent as a suspected enterprise of false invoices
在本发明的另一个实施例中,提供了一种基于动态网络表征的发票虚开识别系统,所述系统包括:In another embodiment of the present invention, a system for identifying false invoices based on dynamic network representation is provided, and the system includes:
企业属性特征提取模块,用以对数据进行预处理后,提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理;The enterprise attribute feature extraction module is used to extract the basic information of the enterprise after preprocessing the data. The basic information of the enterprise is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, and the categorical data is encoded by One-Hot , To standardize numerical data;
动态网络表征构建模块,用以以每一天为时间节点,处理企业属性特征得到企业的静态交易网络表征,然后建立长度为30天的时序窗口,在窗口内通过正则项融合静态网络表征,并通过在时间序列上滑动窗口来逐步融合所有静态网络表征得到动态网络表征;The dynamic network characterization building module is used to process the attribute characteristics of the enterprise to obtain the static transaction network characterization of the enterprise with each day as the time node, and then establish a 30-day time sequence window, and integrate the static network characterization through the regular term in the window, and pass Sliding the window on the time series to gradually merge all static network representations to obtain dynamic network representations;
并行优化动态网络表征模块,用以把企业动态网络表征的目标分解为独立子 目标,并行优化子目标提高动态网络表征的效率,更高效地得到最终表征结果;Parallel optimization of the dynamic network characterization module is used to decompose the goal of enterprise dynamic network characterization into independent sub-goals. Parallel optimization of the sub-objectives improves the efficiency of dynamic network characterization and obtains the final characterization result more efficiently;
发票虚开识别模块,用以把得到的企业动态网络的表征作为发票虚开行为特征,输入到基于LightGBM构建的二分类器中,用已标记的企业样本集训练发票虚开识别模型,将需要进行预测的企业样本集的表征结果输入训练好的模型中进行预测,进而得到发票虚开嫌疑企业。The invoice false issuance recognition module is used to use the obtained enterprise dynamic network as the characteristics of the invoice false issuance behavior and input it into the binary classifier based on LightGBM, and use the marked enterprise sample set to train the invoice false issuance recognition model. The characterization results of the sample set of enterprises for prediction are input into the trained model for prediction, and then the enterprises suspected of issuing false invoices are obtained.
以上内容仅为说明本发明的技术思想,不能以此限定本发明的保护范围,凡是按照本发明提出的技术思想,在技术方案基础上所做的任何改动,均落入本发明权利要求书的保护范围之内。The above content is only to illustrate the technical ideas of the present invention, and cannot be used to limit the scope of protection of the present invention. Any changes made on the basis of the technical solutions based on the technical ideas proposed by the present invention fall into the claims of the present invention. Within the scope of protection.

Claims (6)

  1. 一种基于动态网络表征的发票虚开识别方法,其特征在于,首先,以企业为节点、交易记录为边,把企业交易信息组织成静态网络;其次,以每一天为时间节点建立企业交易网络的表征,建立长度为30天的时序窗口,在时序窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间节点的静态网络表征得到最终的动态网络表征结果;再次,借鉴了分布式优化算法,把表征的目标函数分解为独立子函数,并行优化子函数提高模型的学习效率;最后,基于LightGBM构建二分类器识别出发票虚开嫌疑企业。A method for identifying false invoice issuance based on dynamic network representation, which is characterized in that, first, the enterprise is the node and the transaction record is the edge, and the enterprise transaction information is organized into a static network; second, the enterprise transaction network is established with each day as the time node Establish a 30-day timing window, merge 30-day static network representations within the timing window each time, and gradually merge the static network representations of all time nodes through the moving timing window to obtain the final dynamic network representation results; again, Using the distributed optimization algorithm for reference, the objective function of the characterization is decomposed into independent sub-functions, and the sub-functions are optimized in parallel to improve the learning efficiency of the model; finally, a two-classifier is constructed based on LightGBM to identify the suspected enterprises of false invoices.
  2. 根据权利要求1所述的一种基于动态网络表征的发票虚开识别方法,其特征在于,该方法具体包括以下实现步骤:The method for recognizing false invoice issuance based on dynamic network characterization according to claim 1, wherein the method specifically includes the following implementation steps:
    步骤1,基本特征提取Step 1, basic feature extraction
    首先对数据进行预处理,然后提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理;Firstly, the data is preprocessed, and then the basic information of the company is extracted. The basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized deal with;
    步骤2,基于动态网络表征的特征提取Step 2. Feature extraction based on dynamic network representation
    提取企业基本特征后,以企业为节点,企业基本信息为节点属性,以交易记录为边,交易信息为边的属性,以每一天为时间节点,把企业交易信息组织成静态网络;然后以30天为单位建立时序窗口,在窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间的静态网络表征,优化网络表征的目标函数,最后得到最优的动态企业交易网络表征;After extracting the basic characteristics of the enterprise, the enterprise is the node, the basic information of the enterprise is the node attribute, the transaction record is the edge, and the transaction information is the attribute of the edge, and each day is the time node, and the enterprise transaction information is organized into a static network; then 30 A time sequence window is established in units of days, and 30-day static network representations are merged within the window each time, and static network representations at all times are gradually merged through the moving time sequence window to optimize the objective function of the network representation, and finally obtain the optimal dynamic enterprise transaction network Characterization
    步骤3,基于分布式的算法优化Step 3. Based on distributed algorithm optimization
    为了提高动态网络表征的学习效率,借鉴分布式优化算法,把动态企业交易网络表征的目标函数分解为独立子函数,并行优化子函数加速了大规模复杂的企 业交易网络表征的求解;In order to improve the learning efficiency of dynamic network representation, draw on distributed optimization algorithms to decompose the objective function of dynamic corporate transaction network representation into independent sub-functions. Parallel optimization sub-functions accelerate the solution of large-scale and complex corporate transaction network representation;
    步骤4,构建分类器识别发票虚开Step 4. Build a classifier to identify false invoices
    基于LightGBM分类器构建二分类模型,把计算得到的动态网络表征作为分类器的学习数据,用已标记的企业样本集来训练模型,然后把需要进行预测的企业样本集的表征结果放入训练好的模型中进行预测,最后根据预测模型的输出确定目标企业是否存在发票虚开行为。Construct a two-classification model based on the LightGBM classifier, use the calculated dynamic network representation as the learning data of the classifier, use the labeled enterprise sample set to train the model, and then put the characterization result of the enterprise sample set that needs to be predicted into the training. Make predictions in the model, and finally determine whether the target company has false invoices based on the output of the prediction model.
  3. 根据权利要求2所述的一种基于动态网络表征的发票虚开识别方法,其特征在于,步骤1的实现方法如下:The method for recognizing false invoice issuance based on dynamic network characterization according to claim 2, wherein the method for implementing step 1 is as follows:
    步骤101,数据预处理Step 101, data preprocessing
    (1)提取“纳税人电子档案号”,作为企业特征唯一标识;(1) Extract the "Taxpayer Electronic File Number" as the unique identification of corporate characteristics;
    (2)处理缺失值:数据缺失严重的属性和与发票虚开任务不相关的属性直接删去,有少量缺失的重要属性用同类均值插补的方法补全缺失值;(2) Dealing with missing values: attributes with severe data missing and attributes that are not related to the task of false invoices are directly deleted, and a small number of missing important attributes are used to fill in missing values with the same kind of mean interpolation method;
    步骤102,处理文本型数据Step 102, processing text data
    对企业基本信息表中的文本信息处理包括:The processing of text information in the enterprise basic information table includes:
    (1)使用Jieba分词工具把企业的文本型数据进行分词;(1) Use Jieba word segmentation tool to segment the text data of the enterprise;
    (2)用词典树统计分词的结果,选择出权重较大的词作为关键词;(2) Use the dictionary tree to count the results of word segmentation, and select words with larger weights as keywords;
    (3)基于word2vec将提取出来的N类关键词转成向量;(3) Convert the extracted N types of keywords into vectors based on word2vec;
    步骤103,处理标志型数据Step 103, processing logo type data
    对企业基本信息表中离散的类别型数据采用One-Hot编码;以属性取值的数量为长度建立状态位标志每一特定状态;Use One-Hot coding for the discrete category data in the basic information table of the enterprise; use the number of attribute values as the length to establish a status bit to mark each specific state;
    步骤104,处理数值型数据Step 104, processing numerical data
    对企业基本信息表中的数值型数据采用传统的标准化方法进行处理:The numerical data in the basic information table of the enterprise is processed by traditional standardized methods:
    (1)求各属性的均值;(1) Find the mean value of each attribute;
    (2)求各属性的方差;(2) Find the variance of each attribute;
    (3)Z-Score标准化。(3) Z-Score standardization.
  4. 根据权利要求3所述的一种基于动态网络表征的发票虚开识别方法,其特征在于,步骤2的实现方法如下:The method for identifying false invoice issuance based on dynamic network characterization according to claim 3, wherein the method for implementing step 2 is as follows:
    步骤201:建立静态的企业交易网络Step 201: Establish a static corporate transaction network
    每一天都建立一个企业交易网络的表征模型,使得具有相似拓扑结构或者交易权重更高的企业在表征空间离得更近,目标优化函数为:A representation model of the corporate transaction network is established every day, so that companies with similar topological structures or higher transaction weights are closer in the representation space. The objective optimization function is:
    Figure PCTCN2020113450-appb-100001
    Figure PCTCN2020113450-appb-100001
    其中,h i和h j是企业i和j的表征;w ij是企业间交易的权重;最小化w ij||h i-h j|| 2时,就迫使越大的交易权重w ij对应的企业表征i和j越接近; Wherein, H i and H j characterize enterprise i and j; w ij is the weight between the trading enterprise; minimize w ij || h i -h j || 2 , the greater the forces the transaction corresponding to the weight w ij The closer the enterprise representations i and j are;
    最小化目标
    Figure PCTCN2020113450-appb-100002
    得到该天优化后的企业交易网络表征h;
    Minimize the goal
    Figure PCTCN2020113450-appb-100002
    Obtain the optimized corporate transaction network representation h on that day;
    步骤202:动态融合历史信息Step 202: Dynamically integrate historical information
    建立一个长度为30天的时序窗口,在窗口内每次融合30天的静态网络表征,然后移动时序窗口,逐步融合所有静态网络表征,最终得到动态的企业交易网络表征,对应的优化目标是:Establish a 30-day timing window, merge 30-day static network representations within the window each time, and then move the timing window to gradually merge all static network representations, and finally obtain dynamic corporate transaction network representations. The corresponding optimization goals are:
    Figure PCTCN2020113450-appb-100003
    Figure PCTCN2020113450-appb-100003
    其中,
    Figure PCTCN2020113450-appb-100004
    分别表示第t天的企业p,q的表征和企业间交易的权重,
    Figure PCTCN2020113450-appb-100005
    则表示企业p和企业q的表征的近似程度;H i表示时序窗口内第i天的网络表征;惩罚项
    Figure PCTCN2020113450-appb-100006
    使表征学习到的矩阵尽量逼近原企业交易网络的矩阵,ρ是一个定义模型的结构特性和对原矩阵逼近程度贡献程度的参数,ρ越大模型越注重时序的网络表征,越小越注重节点的表征;
    among them,
    Figure PCTCN2020113450-appb-100004
    Respectively represent the representation of p and q of the enterprise on day t and the weight of inter-firm transactions,
    Figure PCTCN2020113450-appb-100005
    It represents the similarity of the representations of enterprise p and enterprise q; H i represents the network representation of the i-th day in the time series window; penalty item
    Figure PCTCN2020113450-appb-100006
    Make the matrix learned by the representation as close as possible to the matrix of the original enterprise transaction network. ρ is a parameter that defines the structural characteristics of the model and the degree of contribution to the degree of the original matrix. The larger the ρ the more the model pays attention to the time-series network representation, the smaller the more the node Characterization
    最小化目标
    Figure PCTCN2020113450-appb-100007
    得到优化后的动态企业交易网络表征H。
    Minimize the goal
    Figure PCTCN2020113450-appb-100007
    Get the optimized dynamic corporate transaction network characterization H.
  5. 根据权利要求4所述的一种基于动态网络表征的发票虚开识别方法,其特征在于,步骤3的实现方法如下:The method for recognizing false invoice issuance based on dynamic network characterization according to claim 4, wherein the method for implementing step 3 is as follows:
    步骤301,分解目标函数Step 301: Decompose the objective function
    对优化函数(2)进行重构,将其写成可分解的形式:Refactor the optimization function (2) and write it in a decomposable form:
    Figure PCTCN2020113450-appb-100008
    Figure PCTCN2020113450-appb-100008
    其中,
    Figure PCTCN2020113450-appb-100009
    分别表示第t天的企业p,q的表征和企业间交易的权重,
    Figure PCTCN2020113450-appb-100010
    则表示企业p和企业q的表征的近似程度;惩罚项
    Figure PCTCN2020113450-appb-100011
    是在式(2)逼近原企业交易网络的矩阵的基础上,把数据拆分为单个企业进行计算;
    among them,
    Figure PCTCN2020113450-appb-100009
    Respectively represent the representation of p and q of the enterprise on day t and the weight of inter-firm transactions,
    Figure PCTCN2020113450-appb-100010
    It represents the similarity of the representations of enterprise p and enterprise q; the penalty item
    Figure PCTCN2020113450-appb-100011
    It is based on formula (2) that approximates the matrix of the original enterprise's transaction network, and divides the data into individual enterprises for calculation;
    最小化目标
    Figure PCTCN2020113450-appb-100012
    得到优化后的动态企业交易网络表征H;
    Minimize the goal
    Figure PCTCN2020113450-appb-100012
    The optimized dynamic corporate transaction network characterization H;
    步骤302,并行执行多个子函数Step 302, execute multiple sub-functions in parallel
    把(3)式分解为N个子优化函数,N为网络节点数,表示企业交易网络中企业的个数,对其并行求解以得到
    Figure PCTCN2020113450-appb-100013
    Decompose formula (3) into N sub-optimization functions, where N is the number of network nodes, representing the number of enterprises in the enterprise transaction network, and solving them in parallel to obtain
    Figure PCTCN2020113450-appb-100013
    Figure PCTCN2020113450-appb-100014
    Figure PCTCN2020113450-appb-100014
    其中,
    Figure PCTCN2020113450-appb-100015
    代表与企业v的有关联的企业,
    Figure PCTCN2020113450-appb-100016
    表示第t天的企业v的表征,
    Figure PCTCN2020113450-appb-100017
    表示第t天的企业v迭代计算k次后的表征,
    Figure PCTCN2020113450-appb-100018
    表示第t天企业v,q间交易的权重,
    Figure PCTCN2020113450-appb-100019
    则表示第t天迭代(k-1)次后的企业v和企业q的表征的近似程度;
    Figure PCTCN2020113450-appb-100020
    表示企业v在第i天和第t天的表征的近似程度;
    among them,
    Figure PCTCN2020113450-appb-100015
    On behalf of a company related to company v,
    Figure PCTCN2020113450-appb-100016
    Represents the characterization of enterprise v on day t,
    Figure PCTCN2020113450-appb-100017
    Represents the representation of enterprise v after k iterations on day t,
    Figure PCTCN2020113450-appb-100018
    Represents the weight of the transaction between enterprise v and q on day t,
    Figure PCTCN2020113450-appb-100019
    It represents the similarity of the representations of enterprise v and enterprise q after iteration (k-1) on the t day;
    Figure PCTCN2020113450-appb-100020
    Indicates the similarity of the representation of enterprise v on the i-th day and the t-th day;
    其中,
    Figure PCTCN2020113450-appb-100021
    为所要求解的企业v在第t天的表征,使用迭代的优化方法判断计算结果是否达到要求的精确度:通过梯度下降算法对其进行求解,当达到收敛条件
    Figure PCTCN2020113450-appb-100022
    或者
    Figure PCTCN2020113450-appb-100023
    时,优化函数取得最优值;当一个企业第k次迭代和第(k-1) 次迭代后得到的结果达到要求精确度时;或者当一个企业的迭代结果与其关联企业离得足够近时,停止更新,得到的第k次迭代的表征结果就为该天该企业的表征;
    among them,
    Figure PCTCN2020113450-appb-100021
    For the characterization of the requested enterprise v on day t, use an iterative optimization method to determine whether the calculation result meets the required accuracy: solve it through the gradient descent algorithm, when the convergence condition is reached
    Figure PCTCN2020113450-appb-100022
    or
    Figure PCTCN2020113450-appb-100023
    When the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1)th iteration of an enterprise reach the required accuracy; or when the iterative result of an enterprise is close enough to its affiliated enterprise , Stop updating, and the obtained characterization result of the kth iteration is the characterization of the enterprise on that day;
    步骤303,综合整理并行的结果Step 303, comprehensively sort the parallel results
    并行计算交易网络的N个节点就可得到每个企业在第t天的表征,再对于分布在时间节点1到T上的动态交易网络,按顺序计算求出每个时间节点上的网络的表征。Calculate the N nodes of the transaction network in parallel to get the characterization of each enterprise on day t, and then for the dynamic transaction network distributed on time nodes 1 to T, calculate the characterization of the network on each time node in order .
  6. 根据权利要求5所述的一种基于动态网络表征的发票虚开识别方法,其特征在于,步骤4的实现方法如下:The method for identifying false invoice issuance based on dynamic network characterization according to claim 5, wherein the method for implementing step 4 is as follows:
    步骤401,将步骤1得到的基本特征和步骤3得到的动态网络特征结合到一起作为分类器的学习数据;Step 401: Combine the basic features obtained in step 1 and the dynamic network features obtained in step 3 as the learning data of the classifier;
    步骤402,基于LightGBM构建二分类模型,将分类器的主要参数设置为:叶子数为13,学习速率为0.1,迭代次数为100;Step 402: Construct a two-classification model based on LightGBM, and set the main parameters of the classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;
    步骤403,把标记为虚开发票的企业样本集和正常企业样本集获得的表征结果作为基础特征,并按照3:1的比例随机分为两组作为训练集和测试集,训练集中再随机分出百分之十的数据作为验证集;用训练集训练步骤2的分类模型,用验证集调整训练,如果出现过拟合现象,则进行剪枝操作;选取最优模型在测试集验证算法的准确性;Step 403: Take the characterization results obtained from the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1, and then randomly divide the training set into two groups. Use the training set to train the classification model of step 2 and use the verification set to adjust the training. If over-fitting occurs, perform pruning operations; select the optimal model to verify the algorithm in the test set accuracy;
    步骤404,将未标记的企业样本的表征结果输入至基于LightGBM的发票虚开嫌疑企业预测模型,最后基于预测模型的输出,确定目标企业是否存在发票虚开行为。Step 404: Input the characterization result of the unmarked enterprise sample into the LightGBM-based prediction model of the suspected false invoice issuance enterprise, and finally, based on the output of the prediction model, determine whether the target company has false invoice issuance behavior.
PCT/CN2020/113450 2019-11-04 2020-09-04 False invoice issuing identification method and system based on dynamic network representation WO2021088499A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911066791.7A CN110852856B (en) 2019-11-04 2019-11-04 Invoice false invoice identification method based on dynamic network representation
CN201911066791.7 2019-11-04

Publications (1)

Publication Number Publication Date
WO2021088499A1 true WO2021088499A1 (en) 2021-05-14

Family

ID=69598895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113450 WO2021088499A1 (en) 2019-11-04 2020-09-04 False invoice issuing identification method and system based on dynamic network representation

Country Status (2)

Country Link
CN (1) CN110852856B (en)
WO (1) WO2021088499A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326377A (en) * 2021-06-02 2021-08-31 上海生腾数据科技有限公司 Name disambiguation method and system based on enterprise incidence relation
CN113642735A (en) * 2021-07-28 2021-11-12 浪潮软件科技有限公司 Continuous learning method for pseudo-tax payer identification
CN114219287A (en) * 2021-12-15 2022-03-22 中国软件与技术服务股份有限公司 Taxpayer risk evaluation method based on graph neural network
CN115334005A (en) * 2022-03-31 2022-11-11 北京邮电大学 Encrypted flow identification method based on pruning convolution neural network and machine learning

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852856B (en) * 2019-11-04 2022-10-25 西安交通大学 Invoice false invoice identification method based on dynamic network representation
CN111382843B (en) * 2020-03-06 2023-10-20 浙江网商银行股份有限公司 Method and device for establishing enterprise upstream and downstream relationship identification model and mining relationship
CN111966889B (en) * 2020-05-20 2023-04-28 清华大学深圳国际研究生院 Generation method of graph embedded vector and generation method of recommended network model
CN111724241B (en) * 2020-06-05 2024-03-29 西安交通大学 Enterprise invoice virtual issuing detection method based on dynamic edge feature graph annotation meaning network
CN112215616B (en) * 2020-11-30 2021-04-30 四川新网银行股份有限公司 Method and system for automatically identifying abnormal fund transaction based on network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108461A1 (en) * 2009-01-07 2014-04-17 Oracle International Corporation Generic Ontology Based Semantic Business Policy Engine
CN104103011A (en) * 2014-07-10 2014-10-15 西安交通大学 Suspicious taxpayer recognition method based on taxpayer interest incidence network
CN106920162A (en) * 2017-03-14 2017-07-04 西京学院 A kind of detection method of writing out falsely special invoices of increasing taxes based on loap-paralled track detection
CN109583978A (en) * 2018-11-30 2019-04-05 税友软件集团股份有限公司 The method, device and equipment of invoice enterprise is write out falsely in a kind of identification
CN110852856A (en) * 2019-11-04 2020-02-28 西安交通大学 Invoice false invoice identification method based on dynamic network representation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2679209C2 (en) * 2014-12-15 2019-02-06 Общество с ограниченной ответственностью "Аби Продакшн" Processing of electronic documents for invoices recognition
CN106780001A (en) * 2016-12-26 2017-05-31 税友软件集团股份有限公司 A kind of invoice writes out falsely enterprise supervision recognition methods and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108461A1 (en) * 2009-01-07 2014-04-17 Oracle International Corporation Generic Ontology Based Semantic Business Policy Engine
CN104103011A (en) * 2014-07-10 2014-10-15 西安交通大学 Suspicious taxpayer recognition method based on taxpayer interest incidence network
CN106920162A (en) * 2017-03-14 2017-07-04 西京学院 A kind of detection method of writing out falsely special invoices of increasing taxes based on loap-paralled track detection
CN109583978A (en) * 2018-11-30 2019-04-05 税友软件集团股份有限公司 The method, device and equipment of invoice enterprise is write out falsely in a kind of identification
CN110852856A (en) * 2019-11-04 2020-02-28 西安交通大学 Invoice false invoice identification method based on dynamic network representation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YU HONGCHAO YUHONGCHAO@STU.XJTU.EDU.CN; HE HUAN HEHUAN@MAIL.XJTU.EDU.CN; ZHENG QINGHUA QHZHENG@MAIL.XJTU.EDU.CN; DONG BO DONG.BO@M: "TaxVis: a Visual System for Detecting Tax Evasion Group", THE WORLD WIDE WEB CONFERENCE, ACM, 2 PENN PLAZA, SUITE 701NEW YORKNY10121-0701USA, 13 May 2019 (2019-05-13) - 17 May 2019 (2019-05-17), 2 Penn Plaza, Suite 701New YorkNY10121-0701USA, pages 3610 - 3614, XP058471442, ISBN: 978-1-4503-6674-8, DOI: 10.1145/3308558.3314144 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326377A (en) * 2021-06-02 2021-08-31 上海生腾数据科技有限公司 Name disambiguation method and system based on enterprise incidence relation
CN113326377B (en) * 2021-06-02 2023-10-13 上海生腾数据科技有限公司 Name disambiguation method and system based on enterprise association relationship
CN113642735A (en) * 2021-07-28 2021-11-12 浪潮软件科技有限公司 Continuous learning method for pseudo-tax payer identification
CN113642735B (en) * 2021-07-28 2023-07-18 浪潮软件科技有限公司 Continuous learning method for identifying virtual tax payers
CN114219287A (en) * 2021-12-15 2022-03-22 中国软件与技术服务股份有限公司 Taxpayer risk evaluation method based on graph neural network
CN115334005A (en) * 2022-03-31 2022-11-11 北京邮电大学 Encrypted flow identification method based on pruning convolution neural network and machine learning
CN115334005B (en) * 2022-03-31 2024-03-22 北京邮电大学 Encryption flow identification method based on pruning convolutional neural network and machine learning

Also Published As

Publication number Publication date
CN110852856B (en) 2022-10-25
CN110852856A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
WO2021088499A1 (en) False invoice issuing identification method and system based on dynamic network representation
CN110415111A (en) Merge the method for logistic regression credit examination & approval with expert features based on user data
Mehta et al. Stock price prediction using machine learning and sentiment analysis
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN108734567A (en) A kind of asset management system and its appraisal procedure based on big data artificial intelligence air control
CN111783829A (en) Financial anomaly detection method and device based on multi-label learning
CN111754317A (en) Financial investment data evaluation method and system
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
CN113590807A (en) Scientific and technological enterprise credit evaluation method based on big data mining
Ding et al. A novel hybrid method for oil price forecasting with ensemble thought
CN111626331B (en) Automatic industry classification device and working method thereof
CN112329862A (en) Decision tree-based anti-money laundering method and system
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
Zhai et al. Big data analysis of accounting forecasting based on machine learning
CN111724241A (en) Enterprise invoice virtual invoice detection method based on dynamic edge feature enhanced graph attention network
WO2022143431A1 (en) Method and apparatus for training anti-money laundering model
Zhang A model combining LightGBM and neural network for high-frequency realized volatility forecasting
Mao et al. Information system construction and research on preference of model by multi-class decision tree regression
Najadat et al. Performance evaluation of industrial firms using DEA and DECORATE ensemble method.
Yang et al. Reform and competitive selection in China: An analysis of firm exits
Hong et al. Early warning of enterprise financial risk based on decision tree algorithm
Yang et al. An evidential reasoning rule-based ensemble learning approach for evaluating credit risks with customer heterogeneity
Guo et al. Statistical decision research of long-term deposit subscription in banks based on decision tree
CN114187081A (en) Estimated value table processing method and device, electronic equipment and computer readable storage medium
CN111967937A (en) E-commerce recommendation system based on time series analysis and implementation method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20884592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20884592

Country of ref document: EP

Kind code of ref document: A1