CN110852856A - Invoice false invoice identification method based on dynamic network representation - Google Patents

Invoice false invoice identification method based on dynamic network representation Download PDF

Info

Publication number
CN110852856A
CN110852856A CN201911066791.7A CN201911066791A CN110852856A CN 110852856 A CN110852856 A CN 110852856A CN 201911066791 A CN201911066791 A CN 201911066791A CN 110852856 A CN110852856 A CN 110852856A
Authority
CN
China
Prior art keywords
enterprise
network
characterization
invoice
transaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911066791.7A
Other languages
Chinese (zh)
Other versions
CN110852856B (en
Inventor
董博
郑庆华
范弘铖
田雨润
高宇达
袁靖松
阮建飞
张发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201911066791.7A priority Critical patent/CN110852856B/en
Publication of CN110852856A publication Critical patent/CN110852856A/en
Priority to PCT/CN2020/113450 priority patent/WO2021088499A1/en
Application granted granted Critical
Publication of CN110852856B publication Critical patent/CN110852856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Technology Law (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an invoice false invoice identification method based on dynamic network representation. Firstly, taking enterprises as nodes and transaction records as edges, and organizing enterprise transaction information into a static network; secondly, establishing a representation of the enterprise transaction network by taking each day as a time node, establishing a time sequence window with the length of 30 days, fusing the static network representations of 30 days in the window each time, and gradually fusing the static network representations of all the time nodes through a mobile time sequence window to obtain a final dynamic network representation result; thirdly, a distributed optimization algorithm is used for reference, the represented target function is decomposed into independent sub-functions, and the sub-functions are optimized in parallel, so that the learning efficiency of the model is improved; and finally, constructing a classifier based on the LightGBM to identify the suspected enterprise of the false invoice. The method and the system identify the suspected enterprise of the false invoice based on the dynamic network representation, and improve the efficiency and the accuracy of the false invoice identification.

Description

Invoice false invoice identification method based on dynamic network representation
Technical Field
The invention belongs to the technical field of tax control, and particularly relates to an invoice false-open identification method based on dynamic network representation.
Background
The invoice virtual invoice is that enterprises use various behavior means to produce invoices which are inconsistent with the actual operation business conditions so as to achieve the purpose of tax evasion.
The act of false invoicing will cause huge losses of national revenue and seriously destroy the national economic order. The current approach for tax bureau to identify the suspected enterprise of invoice is mainly as follows: reporting, daily supervision and spot check and problem enterprise involvement, and then checking by tax inspection personnel based on reports provided by the enterprise. These inspections are all extremely occasional and cannot systematically analyze and evaluate all enterprises; moreover, the manual checking workload of tax inspection personnel is large and the efficiency is low, the checking data is limited to the report forms provided by single enterprises, and the enterprises which are related upstream and downstream cannot be combined.
In order to solve the problems faced by the current invoice false-open identification, a solution is provided by a network characterization technology. The invoice virtual-open identification method based on the network representation can organize isolated report information into an enterprise transaction network, thereby systematically checking all enterprises, and simultaneously obtaining more enterprise information by using the connection among the enterprises to identify the invoice virtual-open enterprises. The following patents provide referenced related methods for invoice fraud identification automatically by computer based network characterization techniques:
document 1. a false value-added tax special invoice detection method based on parallel loop detection (201710147850.8);
document 2. a suspected taxpayer identification method based on a taxpayer benefit correlation network (201410328391. X);
document 1 organizes invoice information into a static network with an enterprise as a node, and improves loop detection in the network, and the improved method is to distribute a calculation task to a plurality of computers in a distributed cluster by a distributed parallel calculation method to improve efficiency, and finally, to perform pseudo-open value-added tax special invoice detection by an improved loop detection method.
Document 2 identifies a suspected taxpayer based on a topological feature of a taxpayer benefit correlation network (TPIN), analyzes the topological feature of the taxpayer benefit correlation network to obtain a characterization of the taxpayer in the benefit correlation network, and then uses a C4.5 classifier for an experiment, thereby implementing a function of automatically identifying the suspected taxpayer.
The methods described in the above documents mainly have the following problems: document 1 can only detect the false invoice issuing behavior that funds return to the source account again after passing through a plurality of accounts, but the false invoice issuing forms are various and are not limited to loop forms, the identification type of the method is too single, and the generalization capability of the model is poor; document 2 ignores attribute information of an enterprise and normalizes the enterprise only based on a topological structure of taxpayers and interest relations, and cannot analyze the enterprise from the perspectives of the scale, market share and the like of the enterprise; documents 1 and 2 are both limited to static networks, and cannot dynamically analyze changes of enterprise transactions in combination with historical information, and cannot accurately grasp the dynamic changes, so that some enterprises can be provided with the functions. For example, the annual bill of a tax evasion enterprise is not a problem alone and is in a loss state for years, but the cost of water and electricity is increased year by year, the invoice virtual opening behavior is usually hidden in the characteristics related to the time sequence, and the static network cannot capture the characteristics.
Disclosure of Invention
In order to improve the efficiency of invoice false invoice identification, the invention aims to provide an invoice false invoice identification method based on dynamic network representation. The invention adopts dynamic network representation, dynamically analyzes the enterprise transaction network by combining historical information, and accurately grasps the dynamic change of enterprise transaction; different invoice false-open behaviors can be identified based on the correlation information among enterprises; meanwhile, by using a distributed optimization algorithm for reference, the calculation function is decomposed into independent sub-functions to be executed in parallel, and the invoice false-open recognition efficiency is improved.
In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:
a false invoice identification method based on dynamic network representation comprises the steps of firstly, taking enterprises as nodes and transaction records as edges, and organizing enterprise transaction information into a static network; secondly, establishing a representation of the enterprise transaction network by taking each day as a time node, establishing a time sequence window with the length of 30 days, fusing the static network representations of 30 days in the time sequence window each time, and gradually fusing the static network representations of all the time nodes through a mobile time sequence window to obtain a final dynamic network representation result; thirdly, decomposing the represented target function into independent sub-functions by using a distributed optimization algorithm for reference, and optimizing the sub-functions in parallel to improve the learning efficiency of the model; and finally, constructing a classifier based on the LightGBM to identify the suspected enterprise of the false invoice.
The invention is further improved in that the method specifically comprises the following implementation steps:
1) basic feature extraction
Firstly, preprocessing data, and then extracting basic enterprise information, wherein the basic enterprise information is roughly divided into three types: converting text type data into vectors by using word2vec algorithm, encoding category type data by using One-Hot, and standardizing numerical type data;
2) feature extraction based on dynamic network characterization
After the basic characteristics of the enterprise are extracted, organizing the enterprise transaction information into a static network by taking the enterprise as a node, the basic information of the enterprise as a node attribute, the transaction record as a side and the transaction information as a side attribute and taking each day as a time node; then, a time sequence window is established by taking 30 days as a unit, the static network representations of 30 days are fused in the window every time, the static network representations of all the time are gradually fused through the mobile time sequence window, the objective function of the network representations is optimized, and finally the optimal dynamic enterprise transaction network representation is obtained;
3) distributed-based algorithm optimization
In order to improve the learning efficiency of the dynamic network representation, a distributed optimization algorithm is used for reference, an objective function of the dynamic enterprise transaction network representation is decomposed into independent sub-functions, and the parallel optimization sub-functions accelerate the solution of the large-scale complex enterprise transaction network representation;
4) construction classifier identification of false invoicing
Constructing a two-classification model based on a LightGBM classifier, taking the calculated dynamic network representation as learning data of the classifier, training the model by using a marked enterprise sample set, then putting the representation result of the enterprise sample set needing to be predicted into the trained model for prediction, and finally determining whether the target enterprise has invoice false-open behavior according to the output of the prediction model.
The further improvement of the invention is that the implementation method of the step 1) is as follows:
step 101: data pre-processing
(1) Extracting 'taxpayer electronic file number' as an enterprise characteristic unique identifier;
(2) processing missing values: the attributes with serious data loss and the attributes irrelevant to the invoice virtual-open task are directly deleted, and the important attributes with a small amount of loss are filled with the missing values by using a similar mean interpolation method;
step 102: processing text-type data
The processing of the text information in the enterprise basic information table comprises the following steps:
(1) segmenting the text type data of the enterprise by using a Jieba segmentation tool;
(2) counting the result of word segmentation by using a dictionary tree, and selecting a word with higher weight as a keyword;
(3) converting the extracted N types of keywords into vectors based on word2 vec;
step 103: processing flag-type data
Adopting One-Hot coding for discrete type data in the enterprise basic information table; establishing each specific state of the state bit marks by taking the number of attribute values as the length;
step 104: processing numerical data
The numerical data in the enterprise basic information table is processed by adopting a traditional standardization method:
(1) calculating the average value of each attribute;
(2) solving the variance of each attribute;
(3) Z-Score normalization.
The further improvement of the invention is that the implementation method of the step 2) is as follows:
step 201: establishing static enterprise transaction networks
Establishing a characterization model of an enterprise transaction network every day, so that enterprises with similar topological structures or higher transaction weights are closer to each other in a characterization space, and an objective optimization function is as follows:
Figure BDA0002259626750000041
wherein h isi,hjIs a representation of an enterprise i, j; w is aijIs the weight of inter-enterprise transactions; minimization of wij||hi-hj||2The greater the transaction weight w is forcedijCorresponding business characterization hi,hjThe closer together;
minimization of the target
Figure BDA0002259626750000042
Obtaining an enterprise transaction network representation h optimized on the day;
step 202: dynamically fusing historical information
Establishing a time sequence window with the length of 30 days, fusing 30-day static network representations in the window each time, then moving the time sequence window, gradually fusing all the static network representations to finally obtain a dynamic enterprise transaction network representation, wherein the corresponding optimization target is as follows:
Figure BDA0002259626750000051
wherein
Figure BDA0002259626750000052
Respectively representCharacterization of p, q and weight of inter-business transactions for t days,then the degree of approximation of the characterization of business p and business q is represented; hiRepresenting a network characterization for day i within a timing window; penalty term
Figure BDA0002259626750000054
Enabling the matrix learned by the representation to approach the matrix of the original enterprise transaction network as much as possible, wherein rho is a parameter for defining the structural characteristics of the model and the degree of contribution to the approximation degree of the original matrix, and the larger rho is, the more time sequence network representation is emphasized by the model, and the smaller rho is, the more node representation is emphasized;
minimization of the target
Figure BDA0002259626750000055
And obtaining the optimized dynamic enterprise transaction network representation H.
The further improvement of the invention is that the implementation method of the step 3) is as follows:
step 301: decomposing an objective function
The optimization function (2) is reconstructed and written in a decomposable form:
Figure BDA0002259626750000056
wherein
Figure BDA0002259626750000057
Respectively representing the characterization of p, q and the weight of the transaction between enterprises on the t day,
Figure BDA0002259626750000058
then the degree of approximation of the characterization of business p and business q is represented; penalty term
Figure BDA0002259626750000059
On the basis that the formula (2) approaches to a matrix of an original enterprise transaction network, data is divided into single enterprises for calculation;
minimization of the target
Figure BDA00022596267500000510
Obtaining an optimized dynamic enterprise transaction network representation H;
step 302: parallel execution of multiple sub-functions
Decomposing the formula (3) into N sub-optimization functions, wherein N is the number of network nodes and represents the number of enterprises in the enterprise transaction network, and solving the N sub-optimization functions in parallel to obtain Ht k+1
Wherein
Figure BDA0002259626750000061
Representing an enterprise associated with enterprise v, ht vRepresenting a characterization of business v at day t,
Figure BDA0002259626750000062
representing the characterization of the enterprise v on the t-th day after k iterations,
Figure BDA0002259626750000063
represents the weight of the transaction between the business v, q on the t day,
Figure BDA0002259626750000064
then the approximation degree of the characterization of the enterprise v and the enterprise q after the (k-1) iterations on the t day is represented;
Figure BDA0002259626750000065
representing the proximity of the characterization of business v on day i and day t;
wherein
Figure BDA0002259626750000066
For the characterization of the enterprise v to be solved on the t day, an iterative optimization method is used for judging whether the calculation result meets the required accuracy: solving the problem by a gradient descent algorithm when convergence is achievedCondition
Figure BDA0002259626750000067
OrWhen the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1) th iteration of an enterprise reach the required accuracy; or when the iteration result of one enterprise is close enough to the related enterprise, stopping updating, and obtaining the characterization result of the kth iteration as the characterization of the enterprise in the day;
step 303: comprehensive collation of parallel results
The characterization of each enterprise on the T day can be obtained by computing N nodes of the transaction network in parallel, and then the characterization of the network on each time node is calculated and solved for the dynamic transaction network distributed on the time nodes 1 to T in sequence.
The further improvement of the invention is that the implementation method of the step 4) is as follows:
step 401: combining the basic features obtained in the step 1) and the dynamic network features obtained in the step 3) together to be used as learning data of a classifier;
step 402: constructing a two-classification model based on LightGBM, and setting main parameters of a classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the iteration number is 100;
step 403: and taking the characterization results obtained by the enterprise sample set marked as the virtual invoice and the normal enterprise sample set as basic characteristics, and according to the following steps of 3:1, randomly dividing the ratio into two groups as a training set and a testing set, and randomly dividing ten percent of data in the training set as a verification set; training the classification model in the step 2 by using a training set, adjusting and training by using a verification set, and if an overfitting phenomenon occurs, performing pruning operation; selecting an optimal model to verify the accuracy of the algorithm in the test set;
step 404: and inputting the characterization result of the unmarked enterprise sample into a LightGBM-based invoice false open suspicion enterprise prediction model, and finally determining whether the target enterprise has invoice false open behaviors or not based on the output of the prediction model.
The invention has at least the following beneficial technical effects:
the invention provides a method for identifying the invoice false open suspicion enterprise based on the dynamic network representation learning thought, which has the following advantages:
1. by adopting dynamic network representation and combining historical information, the representation vectors are learned and fused for the networks of all time nodes, the dynamic change of the enterprise transaction network can be accurately mastered, and the accuracy of invoice false invoice identification is improved;
2. based on the correlation information among enterprises, different types of false invoicing behaviors can be identified;
3. by using a distributed optimization algorithm for reference, the calculation function is decomposed into independent sub-functions to be executed in parallel, the time complexity of calculating the network representation is reduced, and the invoice false-open recognition efficiency is improved.
Drawings
FIG. 1 is an overall framework flow diagram.
Fig. 2 is a schematic diagram of a basic feature extraction process.
Fig. 3 is a schematic diagram of a feature extraction process based on dynamic network characterization.
Fig. 4 is a schematic diagram of a network characterization algorithm optimization flow.
FIG. 5 is a schematic diagram of a process of constructing a classifier to identify false invoices.
Detailed Description
The detailed description of the invoice false invoice identification method based on dynamic network representation according to the present invention is made below with reference to the accompanying drawings and embodiments.
As shown in fig. 1, the invoice false invoice identification method based on dynamic network characterization includes the following steps:
s101, extracting basic features
After data are preprocessed, enterprise basic information is extracted, and the enterprise basic information is roughly divided into three types: text type data is converted into vectors by word2vec algorithm, category type data is encoded by One-Hot, and numerical type data is standardized.
As shown in fig. 2, the implementation process of the basic feature extraction specifically includes the following steps:
s201. data preprocessing
Step 1: extracting 'taxpayer electronic file number' as an enterprise characteristic unique identifier, and directly deleting the other attributes which can not describe the self distribution rule of the enterprise;
step 2: when the attribute has a large number of missing values and only a very small number of valid values, for example, the attributes "taxpayer tax authority code", "financial statement type" and "accounting form" have values only for less than 10% of the enterprises, the feature is selected to be deleted directly; when the attribute has a small number of missing values, for example, the attribute of 'number of persons involved in the industry' and 'registered capital' has a missing value of individual enterprises, a method of mean interpolation of the same kind is selected to complement the missing value.
S202, processing text type data
And preprocessing the text data 'cargo information' and 'business range' in the enterprise basic information table and extracting the characteristics. The text type data processing method specifically comprises the following steps:
step 1: and performing word segmentation by using a Jieba word segmentation tool, constructing a proper stop list, and removing stop words in the text. For example, in this embodiment, the "operation range" field content of a certain enterprise is "production, sales: combining the ceramic products; import and export of goods, import and export of technology ". The result is 'import and export technology of producing and selling ceramic and goods after word segmentation and removal of stop words';
step 2: counting the result of the step 1 by using a dictionary tree, and selecting a word with larger weight as a keyword;
and step 3: and converting the N types of keywords extracted in the step 2 into vectors based on word2 vec.
S203, processing the classified data
And adopting One-Hot coding for discrete type data 'enterprise type' and 'enterprise state' in the enterprise basic information table. The number of possible values of the attribute is represented as the length of a state bit, and one bit is marked as 1, and the rest are marked as 0 to represent a specific state. For example, the "enterprise type" field in this embodiment has four possible values "individual exclusive enterprises", "partnership enterprises", "limited liability companies" and "stocks limited companies". The status bit length of "business type" is 4, where 1000 denotes "individual exclusive business", 0100 denotes "partner business", 0010 denotes "limited responsibility company", and 0001 denotes "stock limited company".
S204, processing numerical data
The numerical data "registered capital", "total investment" and "number of workers" in the enterprise basic information table are standardized, and the embodiment takes "registered capital" as an example:
step 1: taking the mean of the "registered capital" attribute
Let u be the mean of the "registered capital" attribute, which is calculated in the specific form:
wherein n represents the number of enterprise basic information samples, xjThe j 'registration capital' attribute value is represented;
step 2: obtaining variance of each attribute
Let us note sigma2Is the variance of the "registered capital" attribute, which is calculated specifically in the form of:
the mean value and the variance are basic indexes of the numerical attribute, and the numerical attribute can be standardized through the mean value and the variance;
and step 3: Z-Score normalization
Let δ be the normalized value of "registered capital", where δ is (δ)12,…,δn),δjDenotes the normalized value of the jth "registered capital", δjThe specific calculation form is as follows:
δj=(xj-u)/σ,j=1,2,…,n
s102, feature extraction based on dynamic network characterization
Firstly, establishing a static enterprise transaction network by taking an enterprise as a node, taking a transaction record as a side and taking each day as a time node; and then establishing a time sequence window by taking 30 days as a unit, fusing the static network representations of 30 days in the window every time, gradually fusing the static network representations of all the time through moving the time sequence window, and optimizing an objective function of the network representations to obtain the optimal dynamic enterprise transaction network representation.
As shown in fig. 3, the specific steps of the implementation process of feature extraction based on dynamic network characterization include:
step 1: establishing static enterprise transaction networks
Establishing a characterization model of one enterprise transaction network every day, wherein an objective optimization function is as follows:
Figure BDA0002259626750000101
minimization of the target
Figure BDA0002259626750000102
The representation h of each enterprise in the day can be obtained, so that enterprises with similar transaction structures or high transaction weight are closer to each other in the representation space, and further the representation of the whole enterprise transaction network in the day is obtained.
Step 2: dynamically fusing historical information
Gradually fusing all static enterprise transaction network representations in a time sequence window to finally obtain a dynamic enterprise transaction network representation, wherein the optimization goal is as follows:
the length of the time sequence window is 30 days, static network representations of 30 days are fused in the time sequence window every time, then the time sequence window is moved, all the static network representations are gradually fused, and the target is minimized
Figure BDA0002259626750000104
The day can be obtainedCharacterization H of each business. In this embodiment, it is found that when ρ is 0.75, the effect is the best, and in this case, the network characterization of the time sequence and the characterization of the node are considered in a more balanced manner;
s103. algorithm optimization based on distribution
Firstly, decomposing an objective function; then executing a plurality of sub-functions in parallel; and finally, comprehensively finishing the parallel results.
As shown in fig. 4, the specific steps of the optimization implementation process based on the distributed algorithm include:
s401, decomposing an objective function
Reconstructing the optimization function (2), writing it in decomposable form:
Figure BDA0002259626750000105
in this embodiment, 3765 enterprises are involved in the enterprise transaction network, so that each enterprise and its associated transaction network are calculated by taking N as 3765 and v as 1 to 3765; taking rho as 0.75, network characteristics and node characteristics which pay attention to time sequences in a balanced manner;
s402, executing a plurality of sub-functions in parallel
Decomposing the formula (3) into 3765 sub-optimization functions according to each enterprise v, solving the sub-optimization functions in parallel and finally merging to obtain Ht k +1Wherein the single sub-goal optimization function is:
Figure BDA0002259626750000111
in this embodiment, taking ρ to 0.75 focuses on network characterization of timing and node characterization in a more balanced manner. The calculation results of the sub-functions can be obtained by sequential calculation,
Figure BDA0002259626750000112
the characterization of each enterprise after the kth iteration on the t day obtained by solving each subfunction is obtained, and h is obtainedt k+1Characterizing the dynamic enterprise transaction network after the kth iteration on the tth day;
s403. comprehensive arrangement of parallel results
Solving the formula (4) by using a gradient descent algorithm, in the embodiment, the equation is set
Figure BDA0002259626750000113
Or
Figure BDA0002259626750000114
The updates are stopped, indicating that their nearly equal representations are of the enterprise trading network for that day. Thus, for dynamic trading networks distributed on days 1 to T, the network characterization for each day can be obtained by sequential calculation.
S104, constructing a classifier to identify false invoices
Firstly, combining the basic features obtained in the step S101 and the dynamic network features obtained in the step S102 to be used as learning data of a classifier; secondly, constructing a two-classification model based on a LightGBM classifier; then training the model by using the enterprise sample set marked whether the invoice is false or not; and finally, putting the representation result of the enterprise sample set needing to be predicted into the trained model for prediction, and determining whether the target enterprise has invoice false-open behavior or not based on the output of the prediction model.
As shown in fig. 5, the specific steps of constructing a classifier to identify the false invoice issuing implementation process include:
s501, learning data of the classifier are obtained
And combining the basic features obtained in the step S101 and the dynamic network features obtained in the step S103 together to be used as learning data of a classifier. In this embodiment, the enterprise basic feature vector obtained in S101 is directly placed on the dynamic network feature vector obtained in S103, and then combined into a new vector, which is used as learning data of the classifier
S502, constructing a binary classification model based on LightGBM
The main parameters for setting the classifier are: the number of leaves is 13, the learning rate is 0.1, and the iteration number is 100;
s503. training model
Step 1: and (3) taking the characterization results obtained by the enterprise sample set marked as the virtual invoice and the normal enterprise sample set as basic characteristics, and randomly dividing the basic characteristics into two groups according to the proportion of 3:1 to be used as a training set and a testing set.
Step 2: ten percent of the data in the training set was randomly dropped out as the validation set.
And step 3: training the classification model constructed in the S502 by using a training set, adjusting and training by using a verification set, and performing pruning operation when an over-fitting phenomenon occurs;
and 4, step 4: and (4) performing iterative computation, wherein the iteration number is set to 100, so that if the convergence condition is not reached in 100 iterations, the iteration is forcibly stopped, and the last iteration result is taken to be the representation obtained by computation.
And 5: the accuracy of the optimal model in the test set verification algorithm is selected, the accuracy rate verified by the embodiment is 0.957, the precision is 0.921, the recall rate is 0.87, the effect of the model in the test set is very good, and the requirement of false invoice identification in an actual tax scene can be met. Compared with other invoice false invoice identification methods based on static network representation, the method has the advantages that the accuracy is 0.876, the precision is 0.856, and the recall rate is 0.794, the identification accuracy is improved by 9.25%, the precision is improved by 7.6%, and the recall rate is improved by 9.57%. The method disclosed by the invention has the advantages that the effect of identifying the false invoices is improved, not only is the accuracy improved, but also the identification efficiency of distributed parallel operation is improved: the running time of the data sample adopting the distributed algorithm is 684.57s, which is 28.56% shorter than the running time 958.19s of the non-distributed algorithm.
S504, enterprise for forecasting invoice false invoice suspicion
Inputting the characterization result of the unmarked enterprise sample into the trained invoice false open suspicion enterprise prediction model, determining whether the target enterprise has invoice false open behavior or not based on the output of the prediction model, sequencing predicted values from high to low in the embodiment, and taking the top ten percent as the invoice false open suspicion enterprise.

Claims (6)

1. A false invoice identification method based on dynamic network representation is characterized in that firstly, enterprises are used as nodes, transaction records are used as edges, and enterprise transaction information is organized into a static network; secondly, establishing a representation of the enterprise transaction network by taking each day as a time node, establishing a time sequence window with the length of 30 days, fusing the static network representations of 30 days in the time sequence window each time, and gradually fusing the static network representations of all the time nodes through a mobile time sequence window to obtain a final dynamic network representation result; thirdly, decomposing the represented target function into independent sub-functions by using a distributed optimization algorithm for reference, and optimizing the sub-functions in parallel to improve the learning efficiency of the model; and finally, constructing a classifier based on the LightGBM to identify the suspected enterprise of the false invoice.
2. The method for identifying the false invoices on the basis of the dynamic network characterization according to claim 1 is characterized by comprising the following implementation steps:
1) basic feature extraction
Firstly, preprocessing data, and then extracting basic enterprise information, wherein the basic enterprise information is roughly divided into three types: converting text type data into vectors by using word2vec algorithm, encoding category type data by using One-Hot, and standardizing numerical type data;
2) feature extraction based on dynamic network characterization
After the basic characteristics of the enterprise are extracted, organizing the enterprise transaction information into a static network by taking the enterprise as a node, the basic information of the enterprise as a node attribute, the transaction record as a side and the transaction information as a side attribute and taking each day as a time node; then, a time sequence window is established by taking 30 days as a unit, the static network representations of 30 days are fused in the window every time, the static network representations of all the time are gradually fused through the mobile time sequence window, the objective function of the network representations is optimized, and finally the optimal dynamic enterprise transaction network representation is obtained;
3) distributed-based algorithm optimization
In order to improve the learning efficiency of the dynamic network representation, a distributed optimization algorithm is used for reference, an objective function of the dynamic enterprise transaction network representation is decomposed into independent sub-functions, and the parallel optimization sub-functions accelerate the solution of the large-scale complex enterprise transaction network representation;
4) construction classifier identification of false invoicing
Constructing a two-classification model based on a LightGBM classifier, taking the calculated dynamic network representation as learning data of the classifier, training the model by using a marked enterprise sample set, then putting the representation result of the enterprise sample set needing to be predicted into the trained model for prediction, and finally determining whether the target enterprise has invoice false-open behavior according to the output of the prediction model.
3. The invoice virtual invoice identification method based on the dynamic network characterization according to claim 2, characterized in that the implementation method of the step 1) is as follows:
step 101: data pre-processing
(1) Extracting 'taxpayer electronic file number' as an enterprise characteristic unique identifier;
(2) processing missing values: the attributes with serious data loss and the attributes irrelevant to the invoice virtual-open task are directly deleted, and the important attributes with a small amount of loss are filled with the missing values by using a similar mean interpolation method;
step 102: processing text-type data
The processing of the text information in the enterprise basic information table comprises the following steps:
(1) segmenting the text type data of the enterprise by using a Jieba segmentation tool;
(2) counting the result of word segmentation by using a dictionary tree, and selecting a word with higher weight as a keyword;
(3) converting the extracted N types of keywords into vectors based on word2 vec;
step 103: processing flag-type data
Adopting One-Hot coding for discrete type data in the enterprise basic information table; establishing each specific state of the state bit marks by taking the number of attribute values as the length;
step 104: processing numerical data
The numerical data in the enterprise basic information table is processed by adopting a traditional standardization method:
(1) calculating the average value of each attribute;
(2) solving the variance of each attribute;
(3) Z-Score normalization.
4. The invoice virtual invoice identification method based on dynamic network characterization according to claim 3, characterized in that, the implementation method of step 2) is as follows:
step 201: establishing static enterprise transaction networks
Establishing a characterization model of an enterprise transaction network every day, so that enterprises with similar topological structures or higher transaction weights are closer to each other in a characterization space, and an objective optimization function is as follows:
Figure FDA0002259626740000031
wherein h isi,hjIs a representation of an enterprise i, j; w is aijIs the weight of inter-enterprise transactions; minimization of wij||hi-hj||2The greater the transaction weight w is forcedijCorresponding business characterization hi,hjThe closer together;
minimization of the target
Figure FDA0002259626740000032
Obtaining an enterprise transaction network representation h optimized on the day;
step 202: dynamically fusing historical information
Establishing a time sequence window with the length of 30 days, fusing 30-day static network representations in the window each time, then moving the time sequence window, gradually fusing all the static network representations to finally obtain a dynamic enterprise transaction network representation, wherein the corresponding optimization target is as follows:
Figure FDA0002259626740000033
wherein
Figure FDA0002259626740000034
Respectively representing the representation of p and q of the enterprise on the t day and the inter-enterprise intersectionThe weight of the degree of change is determined,
Figure FDA0002259626740000035
then the degree of approximation of the characterization of business p and business q is represented; hiRepresenting a network characterization for day i within a timing window; penalty term
Figure FDA0002259626740000036
Enabling the matrix learned by the representation to approach the matrix of the original enterprise transaction network as much as possible, wherein rho is a parameter for defining the structural characteristics of the model and the degree of contribution to the approximation degree of the original matrix, and the larger rho is, the more time sequence network representation is emphasized by the model, and the smaller rho is, the more node representation is emphasized;
minimization of the target
Figure FDA0002259626740000037
And obtaining the optimized dynamic enterprise transaction network representation H.
5. The invoice virtual invoice identification method based on dynamic network characterization according to claim 4, characterized in that, the implementation method of step 3) is as follows:
step 301: decomposing an objective function
The optimization function (2) is reconstructed and written in a decomposable form:
Figure FDA0002259626740000041
wherein
Figure FDA0002259626740000042
Respectively representing the characterization of p, q and the weight of the transaction between enterprises on the t day,
Figure FDA0002259626740000043
then the degree of approximation of the characterization of business p and business q is represented; penalty term
Figure FDA0002259626740000044
On the basis that the formula (2) approaches to a matrix of an original enterprise transaction network, data is divided into single enterprises for calculation;
minimization of the target
Figure FDA0002259626740000045
Obtaining an optimized dynamic enterprise transaction network representation H;
step 302: parallel execution of multiple sub-functions
Decomposing the formula (3) into N sub-optimization functions, wherein N is the number of network nodes and represents the number of enterprises in the enterprise transaction network, and solving the N sub-optimization functions in parallel to obtain Ht k+1
Figure FDA0002259626740000046
Wherein
Figure FDA0002259626740000047
Representing an enterprise associated with enterprise v,
Figure FDA0002259626740000048
representing a characterization of business v at day t,
Figure FDA0002259626740000049
representing the characterization of the enterprise v on the t-th day after k iterations,
Figure FDA00022596267400000410
represents the weight of the transaction between the business v, q on the t day,
Figure FDA00022596267400000411
then the approximation degree of the characterization of the enterprise v and the enterprise q after the (k-1) iterations on the t day is represented;
Figure FDA00022596267400000412
representing an approximation of the characterization of Enterprise v at day i and day tDegree;
wherein
Figure FDA00022596267400000413
For the characterization of the enterprise v to be solved on the t day, an iterative optimization method is used for judging whether the calculation result meets the required accuracy: solving the problem by a gradient descent algorithm when a convergence condition is reached
Figure FDA00022596267400000414
Or
Figure FDA00022596267400000415
When the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1) th iteration of an enterprise reach the required accuracy; or when the iteration result of one enterprise is close enough to the related enterprise, stopping updating, and obtaining the characterization result of the kth iteration as the characterization of the enterprise in the day;
step 303: comprehensive collation of parallel results
The characterization of each enterprise on the T day can be obtained by computing N nodes of the transaction network in parallel, and then the characterization of the network on each time node is calculated and solved for the dynamic transaction network distributed on the time nodes 1 to T in sequence.
6. The invoice virtual invoice identification method based on dynamic network characterization according to claim 5, characterized in that, the implementation method of step 4) is as follows:
step 401: combining the basic features obtained in the step 1) and the dynamic network features obtained in the step 3) together to be used as learning data of a classifier;
step 402: constructing a two-classification model based on LightGBM, and setting main parameters of a classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the iteration number is 100;
step 403: and taking the characterization results obtained by the enterprise sample set marked as the virtual invoice and the normal enterprise sample set as basic characteristics, and according to the following steps of 3:1, randomly dividing the ratio into two groups as a training set and a testing set, and randomly dividing ten percent of data in the training set as a verification set; training the classification model in the step 2 by using a training set, adjusting and training by using a verification set, and if an overfitting phenomenon occurs, performing pruning operation; selecting an optimal model to verify the accuracy of the algorithm in the test set;
step 404: and inputting the characterization result of the unmarked enterprise sample into a LightGBM-based invoice false open suspicion enterprise prediction model, and finally determining whether the target enterprise has invoice false open behaviors or not based on the output of the prediction model.
CN201911066791.7A 2019-11-04 2019-11-04 Invoice false invoice identification method based on dynamic network representation Active CN110852856B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911066791.7A CN110852856B (en) 2019-11-04 2019-11-04 Invoice false invoice identification method based on dynamic network representation
PCT/CN2020/113450 WO2021088499A1 (en) 2019-11-04 2020-09-04 False invoice issuing identification method and system based on dynamic network representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911066791.7A CN110852856B (en) 2019-11-04 2019-11-04 Invoice false invoice identification method based on dynamic network representation

Publications (2)

Publication Number Publication Date
CN110852856A true CN110852856A (en) 2020-02-28
CN110852856B CN110852856B (en) 2022-10-25

Family

ID=69598895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911066791.7A Active CN110852856B (en) 2019-11-04 2019-11-04 Invoice false invoice identification method based on dynamic network representation

Country Status (2)

Country Link
CN (1) CN110852856B (en)
WO (1) WO2021088499A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382843A (en) * 2020-03-06 2020-07-07 浙江网商银行股份有限公司 Method and device for establishing upstream and downstream relation recognition model of enterprise and relation mining
CN111724241A (en) * 2020-06-05 2020-09-29 西安交通大学 Enterprise invoice virtual invoice detection method based on dynamic edge feature enhanced graph attention network
CN111966889A (en) * 2020-05-20 2020-11-20 清华大学深圳国际研究生院 Method for generating graph embedding vector and method for generating recommended network model
CN112215616A (en) * 2020-11-30 2021-01-12 四川新网银行股份有限公司 Method and system for automatically identifying abnormal fund transaction based on network
WO2021088499A1 (en) * 2019-11-04 2021-05-14 西安交通大学 False invoice issuing identification method and system based on dynamic network representation
CN114297319A (en) * 2021-12-23 2022-04-08 税友信息技术有限公司 Data identification method and related device

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326377B (en) * 2021-06-02 2023-10-13 上海生腾数据科技有限公司 Name disambiguation method and system based on enterprise association relationship
CN113642735B (en) * 2021-07-28 2023-07-18 浪潮软件科技有限公司 Continuous learning method for identifying virtual tax payers
CN114219287A (en) * 2021-12-15 2022-03-22 中国软件与技术服务股份有限公司 Taxpayer risk evaluation method based on graph neural network
CN115334005B (en) * 2022-03-31 2024-03-22 北京邮电大学 Encryption flow identification method based on pruning convolutional neural network and machine learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108461A1 (en) * 2009-01-07 2014-04-17 Oracle International Corporation Generic Ontology Based Semantic Business Policy Engine
US20160171627A1 (en) * 2014-12-15 2016-06-16 Abbyy Development Llc Processing electronic documents for invoice recognition
CN106780001A (en) * 2016-12-26 2017-05-31 税友软件集团股份有限公司 A kind of invoice writes out falsely enterprise supervision recognition methods and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104103011B (en) * 2014-07-10 2015-04-29 西安交通大学 Suspicious taxpayer recognition method based on taxpayer interest incidence network
CN106920162B (en) * 2017-03-14 2021-01-29 西京学院 False-open value-added tax special invoice detection method based on parallel loop detection
CN109583978A (en) * 2018-11-30 2019-04-05 税友软件集团股份有限公司 The method, device and equipment of invoice enterprise is write out falsely in a kind of identification
CN110852856B (en) * 2019-11-04 2022-10-25 西安交通大学 Invoice false invoice identification method based on dynamic network representation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108461A1 (en) * 2009-01-07 2014-04-17 Oracle International Corporation Generic Ontology Based Semantic Business Policy Engine
US20160171627A1 (en) * 2014-12-15 2016-06-16 Abbyy Development Llc Processing electronic documents for invoice recognition
CN106780001A (en) * 2016-12-26 2017-05-31 税友软件集团股份有限公司 A kind of invoice writes out falsely enterprise supervision recognition methods and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HONGCHAO YU 等: ""TaxVis: a Visual System for Detecting Tax Evasion Group"", 《IN PROCEEDINGS OF THE 2019 WORLD WIDE WEB CONFERENCE》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021088499A1 (en) * 2019-11-04 2021-05-14 西安交通大学 False invoice issuing identification method and system based on dynamic network representation
CN111382843A (en) * 2020-03-06 2020-07-07 浙江网商银行股份有限公司 Method and device for establishing upstream and downstream relation recognition model of enterprise and relation mining
CN111382843B (en) * 2020-03-06 2023-10-20 浙江网商银行股份有限公司 Method and device for establishing enterprise upstream and downstream relationship identification model and mining relationship
CN111966889A (en) * 2020-05-20 2020-11-20 清华大学深圳国际研究生院 Method for generating graph embedding vector and method for generating recommended network model
CN111966889B (en) * 2020-05-20 2023-04-28 清华大学深圳国际研究生院 Generation method of graph embedded vector and generation method of recommended network model
CN111724241A (en) * 2020-06-05 2020-09-29 西安交通大学 Enterprise invoice virtual invoice detection method based on dynamic edge feature enhanced graph attention network
CN111724241B (en) * 2020-06-05 2024-03-29 西安交通大学 Enterprise invoice virtual issuing detection method based on dynamic edge feature graph annotation meaning network
CN112215616A (en) * 2020-11-30 2021-01-12 四川新网银行股份有限公司 Method and system for automatically identifying abnormal fund transaction based on network
CN112215616B (en) * 2020-11-30 2021-04-30 四川新网银行股份有限公司 Method and system for automatically identifying abnormal fund transaction based on network
CN114297319A (en) * 2021-12-23 2022-04-08 税友信息技术有限公司 Data identification method and related device

Also Published As

Publication number Publication date
WO2021088499A1 (en) 2021-05-14
CN110852856B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
CN109255506B (en) Internet financial user loan overdue prediction method based on big data
CN110532542B (en) Invoice false invoice identification method and system based on positive case and unmarked learning
CN112070125A (en) Prediction method of unbalanced data set based on isolated forest learning
CN110738564A (en) Post-loan risk assessment method and device and storage medium
CN114048436A (en) Construction method and construction device for forecasting enterprise financial data model
Fan et al. Improved ML-based technique for credit card scoring in internet financial risk control
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
Ruyu et al. A comparison of credit rating classification models based on spark-evidence from lending-club
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN116468536A (en) Automatic risk control rule generation method
CN113506173A (en) Credit risk assessment method and related equipment thereof
CN111626331B (en) Automatic industry classification device and working method thereof
CN112329862A (en) Decision tree-based anti-money laundering method and system
Wu et al. The BP neural network with adam optimizer for predicting audit opinions of listed companies.
CN109992592B (en) College poverty and poverty identification method based on flow data of campus consumption card
CN116611911A (en) Credit risk prediction method and device based on support vector machine
WO2022143431A1 (en) Method and apparatus for training anti-money laundering model
CN114626940A (en) Data analysis method and device and electronic equipment
Guo et al. Statistical decision research of long-term deposit subscription in banks based on decision tree
CN114154617A (en) Low-voltage resident user abnormal electricity utilization identification method and system based on VFL
CN113935819A (en) Method for extracting checking abnormal features
CN113935023A (en) Database abnormal behavior detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant