CN110852856A - Invoice false invoice identification method based on dynamic network representation - Google Patents
Invoice false invoice identification method based on dynamic network representation Download PDFInfo
- Publication number
- CN110852856A CN110852856A CN201911066791.7A CN201911066791A CN110852856A CN 110852856 A CN110852856 A CN 110852856A CN 201911066791 A CN201911066791 A CN 201911066791A CN 110852856 A CN110852856 A CN 110852856A
- Authority
- CN
- China
- Prior art keywords
- enterprise
- network
- characterization
- invoice
- transaction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000005457 optimization Methods 0.000 claims abstract description 34
- 230000003068 static effect Effects 0.000 claims abstract description 32
- 230000006870 function Effects 0.000 claims abstract description 31
- 238000012512 characterization method Methods 0.000 claims description 60
- 238000012549 training Methods 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 12
- 230000006399 behavior Effects 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 10
- 238000013145 classification model Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000013459 approach Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 3
- 230000003203 everyday effect Effects 0.000 claims description 3
- 238000013138 pruning Methods 0.000 claims description 3
- 238000002759 z-score normalization Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000011425 standardization method Methods 0.000 claims description 2
- 230000008901 benefit Effects 0.000 description 7
- 238000001514 detection method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000008676 import Effects 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 239000000919 ceramic Substances 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000002354 daily effect Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Development Economics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Marketing (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Evolutionary Biology (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Primary Health Care (AREA)
- General Health & Medical Sciences (AREA)
- Educational Administration (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Technology Law (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an invoice false invoice identification method based on dynamic network representation. Firstly, taking enterprises as nodes and transaction records as edges, and organizing enterprise transaction information into a static network; secondly, establishing a representation of the enterprise transaction network by taking each day as a time node, establishing a time sequence window with the length of 30 days, fusing the static network representations of 30 days in the window each time, and gradually fusing the static network representations of all the time nodes through a mobile time sequence window to obtain a final dynamic network representation result; thirdly, a distributed optimization algorithm is used for reference, the represented target function is decomposed into independent sub-functions, and the sub-functions are optimized in parallel, so that the learning efficiency of the model is improved; and finally, constructing a classifier based on the LightGBM to identify the suspected enterprise of the false invoice. The method and the system identify the suspected enterprise of the false invoice based on the dynamic network representation, and improve the efficiency and the accuracy of the false invoice identification.
Description
Technical Field
The invention belongs to the technical field of tax control, and particularly relates to an invoice false-open identification method based on dynamic network representation.
Background
The invoice virtual invoice is that enterprises use various behavior means to produce invoices which are inconsistent with the actual operation business conditions so as to achieve the purpose of tax evasion.
The act of false invoicing will cause huge losses of national revenue and seriously destroy the national economic order. The current approach for tax bureau to identify the suspected enterprise of invoice is mainly as follows: reporting, daily supervision and spot check and problem enterprise involvement, and then checking by tax inspection personnel based on reports provided by the enterprise. These inspections are all extremely occasional and cannot systematically analyze and evaluate all enterprises; moreover, the manual checking workload of tax inspection personnel is large and the efficiency is low, the checking data is limited to the report forms provided by single enterprises, and the enterprises which are related upstream and downstream cannot be combined.
In order to solve the problems faced by the current invoice false-open identification, a solution is provided by a network characterization technology. The invoice virtual-open identification method based on the network representation can organize isolated report information into an enterprise transaction network, thereby systematically checking all enterprises, and simultaneously obtaining more enterprise information by using the connection among the enterprises to identify the invoice virtual-open enterprises. The following patents provide referenced related methods for invoice fraud identification automatically by computer based network characterization techniques:
The methods described in the above documents mainly have the following problems: document 1 can only detect the false invoice issuing behavior that funds return to the source account again after passing through a plurality of accounts, but the false invoice issuing forms are various and are not limited to loop forms, the identification type of the method is too single, and the generalization capability of the model is poor; document 2 ignores attribute information of an enterprise and normalizes the enterprise only based on a topological structure of taxpayers and interest relations, and cannot analyze the enterprise from the perspectives of the scale, market share and the like of the enterprise; documents 1 and 2 are both limited to static networks, and cannot dynamically analyze changes of enterprise transactions in combination with historical information, and cannot accurately grasp the dynamic changes, so that some enterprises can be provided with the functions. For example, the annual bill of a tax evasion enterprise is not a problem alone and is in a loss state for years, but the cost of water and electricity is increased year by year, the invoice virtual opening behavior is usually hidden in the characteristics related to the time sequence, and the static network cannot capture the characteristics.
Disclosure of Invention
In order to improve the efficiency of invoice false invoice identification, the invention aims to provide an invoice false invoice identification method based on dynamic network representation. The invention adopts dynamic network representation, dynamically analyzes the enterprise transaction network by combining historical information, and accurately grasps the dynamic change of enterprise transaction; different invoice false-open behaviors can be identified based on the correlation information among enterprises; meanwhile, by using a distributed optimization algorithm for reference, the calculation function is decomposed into independent sub-functions to be executed in parallel, and the invoice false-open recognition efficiency is improved.
In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:
a false invoice identification method based on dynamic network representation comprises the steps of firstly, taking enterprises as nodes and transaction records as edges, and organizing enterprise transaction information into a static network; secondly, establishing a representation of the enterprise transaction network by taking each day as a time node, establishing a time sequence window with the length of 30 days, fusing the static network representations of 30 days in the time sequence window each time, and gradually fusing the static network representations of all the time nodes through a mobile time sequence window to obtain a final dynamic network representation result; thirdly, decomposing the represented target function into independent sub-functions by using a distributed optimization algorithm for reference, and optimizing the sub-functions in parallel to improve the learning efficiency of the model; and finally, constructing a classifier based on the LightGBM to identify the suspected enterprise of the false invoice.
The invention is further improved in that the method specifically comprises the following implementation steps:
1) basic feature extraction
Firstly, preprocessing data, and then extracting basic enterprise information, wherein the basic enterprise information is roughly divided into three types: converting text type data into vectors by using word2vec algorithm, encoding category type data by using One-Hot, and standardizing numerical type data;
2) feature extraction based on dynamic network characterization
After the basic characteristics of the enterprise are extracted, organizing the enterprise transaction information into a static network by taking the enterprise as a node, the basic information of the enterprise as a node attribute, the transaction record as a side and the transaction information as a side attribute and taking each day as a time node; then, a time sequence window is established by taking 30 days as a unit, the static network representations of 30 days are fused in the window every time, the static network representations of all the time are gradually fused through the mobile time sequence window, the objective function of the network representations is optimized, and finally the optimal dynamic enterprise transaction network representation is obtained;
3) distributed-based algorithm optimization
In order to improve the learning efficiency of the dynamic network representation, a distributed optimization algorithm is used for reference, an objective function of the dynamic enterprise transaction network representation is decomposed into independent sub-functions, and the parallel optimization sub-functions accelerate the solution of the large-scale complex enterprise transaction network representation;
4) construction classifier identification of false invoicing
Constructing a two-classification model based on a LightGBM classifier, taking the calculated dynamic network representation as learning data of the classifier, training the model by using a marked enterprise sample set, then putting the representation result of the enterprise sample set needing to be predicted into the trained model for prediction, and finally determining whether the target enterprise has invoice false-open behavior according to the output of the prediction model.
The further improvement of the invention is that the implementation method of the step 1) is as follows:
step 101: data pre-processing
(1) Extracting 'taxpayer electronic file number' as an enterprise characteristic unique identifier;
(2) processing missing values: the attributes with serious data loss and the attributes irrelevant to the invoice virtual-open task are directly deleted, and the important attributes with a small amount of loss are filled with the missing values by using a similar mean interpolation method;
step 102: processing text-type data
The processing of the text information in the enterprise basic information table comprises the following steps:
(1) segmenting the text type data of the enterprise by using a Jieba segmentation tool;
(2) counting the result of word segmentation by using a dictionary tree, and selecting a word with higher weight as a keyword;
(3) converting the extracted N types of keywords into vectors based on word2 vec;
step 103: processing flag-type data
Adopting One-Hot coding for discrete type data in the enterprise basic information table; establishing each specific state of the state bit marks by taking the number of attribute values as the length;
step 104: processing numerical data
The numerical data in the enterprise basic information table is processed by adopting a traditional standardization method:
(1) calculating the average value of each attribute;
(2) solving the variance of each attribute;
(3) Z-Score normalization.
The further improvement of the invention is that the implementation method of the step 2) is as follows:
step 201: establishing static enterprise transaction networks
Establishing a characterization model of an enterprise transaction network every day, so that enterprises with similar topological structures or higher transaction weights are closer to each other in a characterization space, and an objective optimization function is as follows:
wherein h isi,hjIs a representation of an enterprise i, j; w is aijIs the weight of inter-enterprise transactions; minimization of wij||hi-hj||2The greater the transaction weight w is forcedijCorresponding business characterization hi,hjThe closer together;
minimization of the targetObtaining an enterprise transaction network representation h optimized on the day;
step 202: dynamically fusing historical information
Establishing a time sequence window with the length of 30 days, fusing 30-day static network representations in the window each time, then moving the time sequence window, gradually fusing all the static network representations to finally obtain a dynamic enterprise transaction network representation, wherein the corresponding optimization target is as follows:
whereinRespectively representCharacterization of p, q and weight of inter-business transactions for t days,then the degree of approximation of the characterization of business p and business q is represented; hiRepresenting a network characterization for day i within a timing window; penalty termEnabling the matrix learned by the representation to approach the matrix of the original enterprise transaction network as much as possible, wherein rho is a parameter for defining the structural characteristics of the model and the degree of contribution to the approximation degree of the original matrix, and the larger rho is, the more time sequence network representation is emphasized by the model, and the smaller rho is, the more node representation is emphasized;
minimization of the targetAnd obtaining the optimized dynamic enterprise transaction network representation H.
The further improvement of the invention is that the implementation method of the step 3) is as follows:
step 301: decomposing an objective function
The optimization function (2) is reconstructed and written in a decomposable form:
whereinRespectively representing the characterization of p, q and the weight of the transaction between enterprises on the t day,then the degree of approximation of the characterization of business p and business q is represented; penalty termOn the basis that the formula (2) approaches to a matrix of an original enterprise transaction network, data is divided into single enterprises for calculation;
minimization of the targetObtaining an optimized dynamic enterprise transaction network representation H;
step 302: parallel execution of multiple sub-functions
Decomposing the formula (3) into N sub-optimization functions, wherein N is the number of network nodes and represents the number of enterprises in the enterprise transaction network, and solving the N sub-optimization functions in parallel to obtain Ht k+1:
WhereinRepresenting an enterprise associated with enterprise v, ht vRepresenting a characterization of business v at day t,representing the characterization of the enterprise v on the t-th day after k iterations,represents the weight of the transaction between the business v, q on the t day,then the approximation degree of the characterization of the enterprise v and the enterprise q after the (k-1) iterations on the t day is represented;representing the proximity of the characterization of business v on day i and day t;
whereinFor the characterization of the enterprise v to be solved on the t day, an iterative optimization method is used for judging whether the calculation result meets the required accuracy: solving the problem by a gradient descent algorithm when convergence is achievedConditionOrWhen the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1) th iteration of an enterprise reach the required accuracy; or when the iteration result of one enterprise is close enough to the related enterprise, stopping updating, and obtaining the characterization result of the kth iteration as the characterization of the enterprise in the day;
step 303: comprehensive collation of parallel results
The characterization of each enterprise on the T day can be obtained by computing N nodes of the transaction network in parallel, and then the characterization of the network on each time node is calculated and solved for the dynamic transaction network distributed on the time nodes 1 to T in sequence.
The further improvement of the invention is that the implementation method of the step 4) is as follows:
step 401: combining the basic features obtained in the step 1) and the dynamic network features obtained in the step 3) together to be used as learning data of a classifier;
step 402: constructing a two-classification model based on LightGBM, and setting main parameters of a classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the iteration number is 100;
step 403: and taking the characterization results obtained by the enterprise sample set marked as the virtual invoice and the normal enterprise sample set as basic characteristics, and according to the following steps of 3:1, randomly dividing the ratio into two groups as a training set and a testing set, and randomly dividing ten percent of data in the training set as a verification set; training the classification model in the step 2 by using a training set, adjusting and training by using a verification set, and if an overfitting phenomenon occurs, performing pruning operation; selecting an optimal model to verify the accuracy of the algorithm in the test set;
step 404: and inputting the characterization result of the unmarked enterprise sample into a LightGBM-based invoice false open suspicion enterprise prediction model, and finally determining whether the target enterprise has invoice false open behaviors or not based on the output of the prediction model.
The invention has at least the following beneficial technical effects:
the invention provides a method for identifying the invoice false open suspicion enterprise based on the dynamic network representation learning thought, which has the following advantages:
1. by adopting dynamic network representation and combining historical information, the representation vectors are learned and fused for the networks of all time nodes, the dynamic change of the enterprise transaction network can be accurately mastered, and the accuracy of invoice false invoice identification is improved;
2. based on the correlation information among enterprises, different types of false invoicing behaviors can be identified;
3. by using a distributed optimization algorithm for reference, the calculation function is decomposed into independent sub-functions to be executed in parallel, the time complexity of calculating the network representation is reduced, and the invoice false-open recognition efficiency is improved.
Drawings
FIG. 1 is an overall framework flow diagram.
Fig. 2 is a schematic diagram of a basic feature extraction process.
Fig. 3 is a schematic diagram of a feature extraction process based on dynamic network characterization.
Fig. 4 is a schematic diagram of a network characterization algorithm optimization flow.
FIG. 5 is a schematic diagram of a process of constructing a classifier to identify false invoices.
Detailed Description
The detailed description of the invoice false invoice identification method based on dynamic network representation according to the present invention is made below with reference to the accompanying drawings and embodiments.
As shown in fig. 1, the invoice false invoice identification method based on dynamic network characterization includes the following steps:
s101, extracting basic features
After data are preprocessed, enterprise basic information is extracted, and the enterprise basic information is roughly divided into three types: text type data is converted into vectors by word2vec algorithm, category type data is encoded by One-Hot, and numerical type data is standardized.
As shown in fig. 2, the implementation process of the basic feature extraction specifically includes the following steps:
s201. data preprocessing
Step 1: extracting 'taxpayer electronic file number' as an enterprise characteristic unique identifier, and directly deleting the other attributes which can not describe the self distribution rule of the enterprise;
step 2: when the attribute has a large number of missing values and only a very small number of valid values, for example, the attributes "taxpayer tax authority code", "financial statement type" and "accounting form" have values only for less than 10% of the enterprises, the feature is selected to be deleted directly; when the attribute has a small number of missing values, for example, the attribute of 'number of persons involved in the industry' and 'registered capital' has a missing value of individual enterprises, a method of mean interpolation of the same kind is selected to complement the missing value.
S202, processing text type data
And preprocessing the text data 'cargo information' and 'business range' in the enterprise basic information table and extracting the characteristics. The text type data processing method specifically comprises the following steps:
step 1: and performing word segmentation by using a Jieba word segmentation tool, constructing a proper stop list, and removing stop words in the text. For example, in this embodiment, the "operation range" field content of a certain enterprise is "production, sales: combining the ceramic products; import and export of goods, import and export of technology ". The result is 'import and export technology of producing and selling ceramic and goods after word segmentation and removal of stop words';
step 2: counting the result of the step 1 by using a dictionary tree, and selecting a word with larger weight as a keyword;
and step 3: and converting the N types of keywords extracted in the step 2 into vectors based on word2 vec.
S203, processing the classified data
And adopting One-Hot coding for discrete type data 'enterprise type' and 'enterprise state' in the enterprise basic information table. The number of possible values of the attribute is represented as the length of a state bit, and one bit is marked as 1, and the rest are marked as 0 to represent a specific state. For example, the "enterprise type" field in this embodiment has four possible values "individual exclusive enterprises", "partnership enterprises", "limited liability companies" and "stocks limited companies". The status bit length of "business type" is 4, where 1000 denotes "individual exclusive business", 0100 denotes "partner business", 0010 denotes "limited responsibility company", and 0001 denotes "stock limited company".
S204, processing numerical data
The numerical data "registered capital", "total investment" and "number of workers" in the enterprise basic information table are standardized, and the embodiment takes "registered capital" as an example:
step 1: taking the mean of the "registered capital" attribute
Let u be the mean of the "registered capital" attribute, which is calculated in the specific form:
wherein n represents the number of enterprise basic information samples, xjThe j 'registration capital' attribute value is represented;
step 2: obtaining variance of each attribute
Let us note sigma2Is the variance of the "registered capital" attribute, which is calculated specifically in the form of:
the mean value and the variance are basic indexes of the numerical attribute, and the numerical attribute can be standardized through the mean value and the variance;
and step 3: Z-Score normalization
Let δ be the normalized value of "registered capital", where δ is (δ)1,δ2,…,δn),δjDenotes the normalized value of the jth "registered capital", δjThe specific calculation form is as follows:
δj=(xj-u)/σ,j=1,2,…,n
s102, feature extraction based on dynamic network characterization
Firstly, establishing a static enterprise transaction network by taking an enterprise as a node, taking a transaction record as a side and taking each day as a time node; and then establishing a time sequence window by taking 30 days as a unit, fusing the static network representations of 30 days in the window every time, gradually fusing the static network representations of all the time through moving the time sequence window, and optimizing an objective function of the network representations to obtain the optimal dynamic enterprise transaction network representation.
As shown in fig. 3, the specific steps of the implementation process of feature extraction based on dynamic network characterization include:
step 1: establishing static enterprise transaction networks
Establishing a characterization model of one enterprise transaction network every day, wherein an objective optimization function is as follows:
minimization of the targetThe representation h of each enterprise in the day can be obtained, so that enterprises with similar transaction structures or high transaction weight are closer to each other in the representation space, and further the representation of the whole enterprise transaction network in the day is obtained.
Step 2: dynamically fusing historical information
Gradually fusing all static enterprise transaction network representations in a time sequence window to finally obtain a dynamic enterprise transaction network representation, wherein the optimization goal is as follows:
the length of the time sequence window is 30 days, static network representations of 30 days are fused in the time sequence window every time, then the time sequence window is moved, all the static network representations are gradually fused, and the target is minimizedThe day can be obtainedCharacterization H of each business. In this embodiment, it is found that when ρ is 0.75, the effect is the best, and in this case, the network characterization of the time sequence and the characterization of the node are considered in a more balanced manner;
s103. algorithm optimization based on distribution
Firstly, decomposing an objective function; then executing a plurality of sub-functions in parallel; and finally, comprehensively finishing the parallel results.
As shown in fig. 4, the specific steps of the optimization implementation process based on the distributed algorithm include:
s401, decomposing an objective function
Reconstructing the optimization function (2), writing it in decomposable form:
in this embodiment, 3765 enterprises are involved in the enterprise transaction network, so that each enterprise and its associated transaction network are calculated by taking N as 3765 and v as 1 to 3765; taking rho as 0.75, network characteristics and node characteristics which pay attention to time sequences in a balanced manner;
s402, executing a plurality of sub-functions in parallel
Decomposing the formula (3) into 3765 sub-optimization functions according to each enterprise v, solving the sub-optimization functions in parallel and finally merging to obtain Ht k +1Wherein the single sub-goal optimization function is:
in this embodiment, taking ρ to 0.75 focuses on network characterization of timing and node characterization in a more balanced manner. The calculation results of the sub-functions can be obtained by sequential calculation,the characterization of each enterprise after the kth iteration on the t day obtained by solving each subfunction is obtained, and h is obtainedt k+1Characterizing the dynamic enterprise transaction network after the kth iteration on the tth day;
s403. comprehensive arrangement of parallel results
Solving the formula (4) by using a gradient descent algorithm, in the embodiment, the equation is setOrThe updates are stopped, indicating that their nearly equal representations are of the enterprise trading network for that day. Thus, for dynamic trading networks distributed on days 1 to T, the network characterization for each day can be obtained by sequential calculation.
S104, constructing a classifier to identify false invoices
Firstly, combining the basic features obtained in the step S101 and the dynamic network features obtained in the step S102 to be used as learning data of a classifier; secondly, constructing a two-classification model based on a LightGBM classifier; then training the model by using the enterprise sample set marked whether the invoice is false or not; and finally, putting the representation result of the enterprise sample set needing to be predicted into the trained model for prediction, and determining whether the target enterprise has invoice false-open behavior or not based on the output of the prediction model.
As shown in fig. 5, the specific steps of constructing a classifier to identify the false invoice issuing implementation process include:
s501, learning data of the classifier are obtained
And combining the basic features obtained in the step S101 and the dynamic network features obtained in the step S103 together to be used as learning data of a classifier. In this embodiment, the enterprise basic feature vector obtained in S101 is directly placed on the dynamic network feature vector obtained in S103, and then combined into a new vector, which is used as learning data of the classifier
S502, constructing a binary classification model based on LightGBM
The main parameters for setting the classifier are: the number of leaves is 13, the learning rate is 0.1, and the iteration number is 100;
s503. training model
Step 1: and (3) taking the characterization results obtained by the enterprise sample set marked as the virtual invoice and the normal enterprise sample set as basic characteristics, and randomly dividing the basic characteristics into two groups according to the proportion of 3:1 to be used as a training set and a testing set.
Step 2: ten percent of the data in the training set was randomly dropped out as the validation set.
And step 3: training the classification model constructed in the S502 by using a training set, adjusting and training by using a verification set, and performing pruning operation when an over-fitting phenomenon occurs;
and 4, step 4: and (4) performing iterative computation, wherein the iteration number is set to 100, so that if the convergence condition is not reached in 100 iterations, the iteration is forcibly stopped, and the last iteration result is taken to be the representation obtained by computation.
And 5: the accuracy of the optimal model in the test set verification algorithm is selected, the accuracy rate verified by the embodiment is 0.957, the precision is 0.921, the recall rate is 0.87, the effect of the model in the test set is very good, and the requirement of false invoice identification in an actual tax scene can be met. Compared with other invoice false invoice identification methods based on static network representation, the method has the advantages that the accuracy is 0.876, the precision is 0.856, and the recall rate is 0.794, the identification accuracy is improved by 9.25%, the precision is improved by 7.6%, and the recall rate is improved by 9.57%. The method disclosed by the invention has the advantages that the effect of identifying the false invoices is improved, not only is the accuracy improved, but also the identification efficiency of distributed parallel operation is improved: the running time of the data sample adopting the distributed algorithm is 684.57s, which is 28.56% shorter than the running time 958.19s of the non-distributed algorithm.
S504, enterprise for forecasting invoice false invoice suspicion
Inputting the characterization result of the unmarked enterprise sample into the trained invoice false open suspicion enterprise prediction model, determining whether the target enterprise has invoice false open behavior or not based on the output of the prediction model, sequencing predicted values from high to low in the embodiment, and taking the top ten percent as the invoice false open suspicion enterprise.
Claims (6)
1. A false invoice identification method based on dynamic network representation is characterized in that firstly, enterprises are used as nodes, transaction records are used as edges, and enterprise transaction information is organized into a static network; secondly, establishing a representation of the enterprise transaction network by taking each day as a time node, establishing a time sequence window with the length of 30 days, fusing the static network representations of 30 days in the time sequence window each time, and gradually fusing the static network representations of all the time nodes through a mobile time sequence window to obtain a final dynamic network representation result; thirdly, decomposing the represented target function into independent sub-functions by using a distributed optimization algorithm for reference, and optimizing the sub-functions in parallel to improve the learning efficiency of the model; and finally, constructing a classifier based on the LightGBM to identify the suspected enterprise of the false invoice.
2. The method for identifying the false invoices on the basis of the dynamic network characterization according to claim 1 is characterized by comprising the following implementation steps:
1) basic feature extraction
Firstly, preprocessing data, and then extracting basic enterprise information, wherein the basic enterprise information is roughly divided into three types: converting text type data into vectors by using word2vec algorithm, encoding category type data by using One-Hot, and standardizing numerical type data;
2) feature extraction based on dynamic network characterization
After the basic characteristics of the enterprise are extracted, organizing the enterprise transaction information into a static network by taking the enterprise as a node, the basic information of the enterprise as a node attribute, the transaction record as a side and the transaction information as a side attribute and taking each day as a time node; then, a time sequence window is established by taking 30 days as a unit, the static network representations of 30 days are fused in the window every time, the static network representations of all the time are gradually fused through the mobile time sequence window, the objective function of the network representations is optimized, and finally the optimal dynamic enterprise transaction network representation is obtained;
3) distributed-based algorithm optimization
In order to improve the learning efficiency of the dynamic network representation, a distributed optimization algorithm is used for reference, an objective function of the dynamic enterprise transaction network representation is decomposed into independent sub-functions, and the parallel optimization sub-functions accelerate the solution of the large-scale complex enterprise transaction network representation;
4) construction classifier identification of false invoicing
Constructing a two-classification model based on a LightGBM classifier, taking the calculated dynamic network representation as learning data of the classifier, training the model by using a marked enterprise sample set, then putting the representation result of the enterprise sample set needing to be predicted into the trained model for prediction, and finally determining whether the target enterprise has invoice false-open behavior according to the output of the prediction model.
3. The invoice virtual invoice identification method based on the dynamic network characterization according to claim 2, characterized in that the implementation method of the step 1) is as follows:
step 101: data pre-processing
(1) Extracting 'taxpayer electronic file number' as an enterprise characteristic unique identifier;
(2) processing missing values: the attributes with serious data loss and the attributes irrelevant to the invoice virtual-open task are directly deleted, and the important attributes with a small amount of loss are filled with the missing values by using a similar mean interpolation method;
step 102: processing text-type data
The processing of the text information in the enterprise basic information table comprises the following steps:
(1) segmenting the text type data of the enterprise by using a Jieba segmentation tool;
(2) counting the result of word segmentation by using a dictionary tree, and selecting a word with higher weight as a keyword;
(3) converting the extracted N types of keywords into vectors based on word2 vec;
step 103: processing flag-type data
Adopting One-Hot coding for discrete type data in the enterprise basic information table; establishing each specific state of the state bit marks by taking the number of attribute values as the length;
step 104: processing numerical data
The numerical data in the enterprise basic information table is processed by adopting a traditional standardization method:
(1) calculating the average value of each attribute;
(2) solving the variance of each attribute;
(3) Z-Score normalization.
4. The invoice virtual invoice identification method based on dynamic network characterization according to claim 3, characterized in that, the implementation method of step 2) is as follows:
step 201: establishing static enterprise transaction networks
Establishing a characterization model of an enterprise transaction network every day, so that enterprises with similar topological structures or higher transaction weights are closer to each other in a characterization space, and an objective optimization function is as follows:
wherein h isi,hjIs a representation of an enterprise i, j; w is aijIs the weight of inter-enterprise transactions; minimization of wij||hi-hj||2The greater the transaction weight w is forcedijCorresponding business characterization hi,hjThe closer together;
minimization of the targetObtaining an enterprise transaction network representation h optimized on the day;
step 202: dynamically fusing historical information
Establishing a time sequence window with the length of 30 days, fusing 30-day static network representations in the window each time, then moving the time sequence window, gradually fusing all the static network representations to finally obtain a dynamic enterprise transaction network representation, wherein the corresponding optimization target is as follows:
whereinRespectively representing the representation of p and q of the enterprise on the t day and the inter-enterprise intersectionThe weight of the degree of change is determined,then the degree of approximation of the characterization of business p and business q is represented; hiRepresenting a network characterization for day i within a timing window; penalty termEnabling the matrix learned by the representation to approach the matrix of the original enterprise transaction network as much as possible, wherein rho is a parameter for defining the structural characteristics of the model and the degree of contribution to the approximation degree of the original matrix, and the larger rho is, the more time sequence network representation is emphasized by the model, and the smaller rho is, the more node representation is emphasized;
5. The invoice virtual invoice identification method based on dynamic network characterization according to claim 4, characterized in that, the implementation method of step 3) is as follows:
step 301: decomposing an objective function
The optimization function (2) is reconstructed and written in a decomposable form:
whereinRespectively representing the characterization of p, q and the weight of the transaction between enterprises on the t day,then the degree of approximation of the characterization of business p and business q is represented; penalty termOn the basis that the formula (2) approaches to a matrix of an original enterprise transaction network, data is divided into single enterprises for calculation;
minimization of the targetObtaining an optimized dynamic enterprise transaction network representation H;
step 302: parallel execution of multiple sub-functions
Decomposing the formula (3) into N sub-optimization functions, wherein N is the number of network nodes and represents the number of enterprises in the enterprise transaction network, and solving the N sub-optimization functions in parallel to obtain Ht k+1:
WhereinRepresenting an enterprise associated with enterprise v,representing a characterization of business v at day t,representing the characterization of the enterprise v on the t-th day after k iterations,represents the weight of the transaction between the business v, q on the t day,then the approximation degree of the characterization of the enterprise v and the enterprise q after the (k-1) iterations on the t day is represented;representing an approximation of the characterization of Enterprise v at day i and day tDegree;
whereinFor the characterization of the enterprise v to be solved on the t day, an iterative optimization method is used for judging whether the calculation result meets the required accuracy: solving the problem by a gradient descent algorithm when a convergence condition is reachedOrWhen the optimization function obtains the optimal value; when the results obtained after the kth iteration and the (k-1) th iteration of an enterprise reach the required accuracy; or when the iteration result of one enterprise is close enough to the related enterprise, stopping updating, and obtaining the characterization result of the kth iteration as the characterization of the enterprise in the day;
step 303: comprehensive collation of parallel results
The characterization of each enterprise on the T day can be obtained by computing N nodes of the transaction network in parallel, and then the characterization of the network on each time node is calculated and solved for the dynamic transaction network distributed on the time nodes 1 to T in sequence.
6. The invoice virtual invoice identification method based on dynamic network characterization according to claim 5, characterized in that, the implementation method of step 4) is as follows:
step 401: combining the basic features obtained in the step 1) and the dynamic network features obtained in the step 3) together to be used as learning data of a classifier;
step 402: constructing a two-classification model based on LightGBM, and setting main parameters of a classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the iteration number is 100;
step 403: and taking the characterization results obtained by the enterprise sample set marked as the virtual invoice and the normal enterprise sample set as basic characteristics, and according to the following steps of 3:1, randomly dividing the ratio into two groups as a training set and a testing set, and randomly dividing ten percent of data in the training set as a verification set; training the classification model in the step 2 by using a training set, adjusting and training by using a verification set, and if an overfitting phenomenon occurs, performing pruning operation; selecting an optimal model to verify the accuracy of the algorithm in the test set;
step 404: and inputting the characterization result of the unmarked enterprise sample into a LightGBM-based invoice false open suspicion enterprise prediction model, and finally determining whether the target enterprise has invoice false open behaviors or not based on the output of the prediction model.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911066791.7A CN110852856B (en) | 2019-11-04 | 2019-11-04 | Invoice false invoice identification method based on dynamic network representation |
PCT/CN2020/113450 WO2021088499A1 (en) | 2019-11-04 | 2020-09-04 | False invoice issuing identification method and system based on dynamic network representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911066791.7A CN110852856B (en) | 2019-11-04 | 2019-11-04 | Invoice false invoice identification method based on dynamic network representation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110852856A true CN110852856A (en) | 2020-02-28 |
CN110852856B CN110852856B (en) | 2022-10-25 |
Family
ID=69598895
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911066791.7A Active CN110852856B (en) | 2019-11-04 | 2019-11-04 | Invoice false invoice identification method based on dynamic network representation |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110852856B (en) |
WO (1) | WO2021088499A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111382843A (en) * | 2020-03-06 | 2020-07-07 | 浙江网商银行股份有限公司 | Method and device for establishing upstream and downstream relation recognition model of enterprise and relation mining |
CN111724241A (en) * | 2020-06-05 | 2020-09-29 | 西安交通大学 | Enterprise invoice virtual invoice detection method based on dynamic edge feature enhanced graph attention network |
CN111966889A (en) * | 2020-05-20 | 2020-11-20 | 清华大学深圳国际研究生院 | Method for generating graph embedding vector and method for generating recommended network model |
CN112215616A (en) * | 2020-11-30 | 2021-01-12 | 四川新网银行股份有限公司 | Method and system for automatically identifying abnormal fund transaction based on network |
WO2021088499A1 (en) * | 2019-11-04 | 2021-05-14 | 西安交通大学 | False invoice issuing identification method and system based on dynamic network representation |
CN114297319A (en) * | 2021-12-23 | 2022-04-08 | 税友信息技术有限公司 | Data identification method and related device |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113326377B (en) * | 2021-06-02 | 2023-10-13 | 上海生腾数据科技有限公司 | Name disambiguation method and system based on enterprise association relationship |
CN113642735B (en) * | 2021-07-28 | 2023-07-18 | 浪潮软件科技有限公司 | Continuous learning method for identifying virtual tax payers |
CN114219287A (en) * | 2021-12-15 | 2022-03-22 | 中国软件与技术服务股份有限公司 | Taxpayer risk evaluation method based on graph neural network |
CN115334005B (en) * | 2022-03-31 | 2024-03-22 | 北京邮电大学 | Encryption flow identification method based on pruning convolutional neural network and machine learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140108461A1 (en) * | 2009-01-07 | 2014-04-17 | Oracle International Corporation | Generic Ontology Based Semantic Business Policy Engine |
US20160171627A1 (en) * | 2014-12-15 | 2016-06-16 | Abbyy Development Llc | Processing electronic documents for invoice recognition |
CN106780001A (en) * | 2016-12-26 | 2017-05-31 | 税友软件集团股份有限公司 | A kind of invoice writes out falsely enterprise supervision recognition methods and system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104103011B (en) * | 2014-07-10 | 2015-04-29 | 西安交通大学 | Suspicious taxpayer recognition method based on taxpayer interest incidence network |
CN106920162B (en) * | 2017-03-14 | 2021-01-29 | 西京学院 | False-open value-added tax special invoice detection method based on parallel loop detection |
CN109583978A (en) * | 2018-11-30 | 2019-04-05 | 税友软件集团股份有限公司 | The method, device and equipment of invoice enterprise is write out falsely in a kind of identification |
CN110852856B (en) * | 2019-11-04 | 2022-10-25 | 西安交通大学 | Invoice false invoice identification method based on dynamic network representation |
-
2019
- 2019-11-04 CN CN201911066791.7A patent/CN110852856B/en active Active
-
2020
- 2020-09-04 WO PCT/CN2020/113450 patent/WO2021088499A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140108461A1 (en) * | 2009-01-07 | 2014-04-17 | Oracle International Corporation | Generic Ontology Based Semantic Business Policy Engine |
US20160171627A1 (en) * | 2014-12-15 | 2016-06-16 | Abbyy Development Llc | Processing electronic documents for invoice recognition |
CN106780001A (en) * | 2016-12-26 | 2017-05-31 | 税友软件集团股份有限公司 | A kind of invoice writes out falsely enterprise supervision recognition methods and system |
Non-Patent Citations (1)
Title |
---|
HONGCHAO YU 等: ""TaxVis: a Visual System for Detecting Tax Evasion Group"", 《IN PROCEEDINGS OF THE 2019 WORLD WIDE WEB CONFERENCE》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021088499A1 (en) * | 2019-11-04 | 2021-05-14 | 西安交通大学 | False invoice issuing identification method and system based on dynamic network representation |
CN111382843A (en) * | 2020-03-06 | 2020-07-07 | 浙江网商银行股份有限公司 | Method and device for establishing upstream and downstream relation recognition model of enterprise and relation mining |
CN111382843B (en) * | 2020-03-06 | 2023-10-20 | 浙江网商银行股份有限公司 | Method and device for establishing enterprise upstream and downstream relationship identification model and mining relationship |
CN111966889A (en) * | 2020-05-20 | 2020-11-20 | 清华大学深圳国际研究生院 | Method for generating graph embedding vector and method for generating recommended network model |
CN111966889B (en) * | 2020-05-20 | 2023-04-28 | 清华大学深圳国际研究生院 | Generation method of graph embedded vector and generation method of recommended network model |
CN111724241A (en) * | 2020-06-05 | 2020-09-29 | 西安交通大学 | Enterprise invoice virtual invoice detection method based on dynamic edge feature enhanced graph attention network |
CN111724241B (en) * | 2020-06-05 | 2024-03-29 | 西安交通大学 | Enterprise invoice virtual issuing detection method based on dynamic edge feature graph annotation meaning network |
CN112215616A (en) * | 2020-11-30 | 2021-01-12 | 四川新网银行股份有限公司 | Method and system for automatically identifying abnormal fund transaction based on network |
CN112215616B (en) * | 2020-11-30 | 2021-04-30 | 四川新网银行股份有限公司 | Method and system for automatically identifying abnormal fund transaction based on network |
CN114297319A (en) * | 2021-12-23 | 2022-04-08 | 税友信息技术有限公司 | Data identification method and related device |
Also Published As
Publication number | Publication date |
---|---|
WO2021088499A1 (en) | 2021-05-14 |
CN110852856B (en) | 2022-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110852856B (en) | Invoice false invoice identification method based on dynamic network representation | |
CN109255506B (en) | Internet financial user loan overdue prediction method based on big data | |
CN110532542B (en) | Invoice false invoice identification method and system based on positive case and unmarked learning | |
CN112070125A (en) | Prediction method of unbalanced data set based on isolated forest learning | |
CN110738564A (en) | Post-loan risk assessment method and device and storage medium | |
CN114048436A (en) | Construction method and construction device for forecasting enterprise financial data model | |
Fan et al. | Improved ML-based technique for credit card scoring in internet financial risk control | |
CN104850868A (en) | Customer segmentation method based on k-means and neural network cluster | |
CN110689437A (en) | Communication construction project financial risk prediction method based on random forest | |
Ruyu et al. | A comparison of credit rating classification models based on spark-evidence from lending-club | |
CN115545886A (en) | Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium | |
CN115794803A (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN116468536A (en) | Automatic risk control rule generation method | |
CN113506173A (en) | Credit risk assessment method and related equipment thereof | |
CN111626331B (en) | Automatic industry classification device and working method thereof | |
CN112329862A (en) | Decision tree-based anti-money laundering method and system | |
Wu et al. | The BP neural network with adam optimizer for predicting audit opinions of listed companies. | |
CN109992592B (en) | College poverty and poverty identification method based on flow data of campus consumption card | |
CN116611911A (en) | Credit risk prediction method and device based on support vector machine | |
WO2022143431A1 (en) | Method and apparatus for training anti-money laundering model | |
CN114626940A (en) | Data analysis method and device and electronic equipment | |
Guo et al. | Statistical decision research of long-term deposit subscription in banks based on decision tree | |
CN114154617A (en) | Low-voltage resident user abnormal electricity utilization identification method and system based on VFL | |
CN113935819A (en) | Method for extracting checking abnormal features | |
CN113935023A (en) | Database abnormal behavior detection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |