CN103377454B - Based on the abnormal tax return data detection method of cosine similarity - Google Patents

Based on the abnormal tax return data detection method of cosine similarity Download PDF

Info

Publication number
CN103377454B
CN103377454B CN201310291896.9A CN201310291896A CN103377454B CN 103377454 B CN103377454 B CN 103377454B CN 201310291896 A CN201310291896 A CN 201310291896A CN 103377454 B CN103377454 B CN 103377454B
Authority
CN
China
Prior art keywords
taxpayer
data
avg
dutiable goods
cosine similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310291896.9A
Other languages
Chinese (zh)
Other versions
CN103377454A (en
Inventor
刘烃
刘杨
桂宇虹
郑庆华
屈宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Servyou Software Group Co., Ltd.
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201310291896.9A priority Critical patent/CN103377454B/en
Publication of CN103377454A publication Critical patent/CN103377454A/en
Application granted granted Critical
Publication of CN103377454B publication Critical patent/CN103377454B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of abnormal tax return data detection method based on cosine similarity: the taxpayer based on of the same trade/area declares dutiable goods data, the declare dutiable goods statistical indicator of data and the same industry/regional taxpayer of tax payment assessed people declares dutiable goods the statistical nature of data; By calculate different taxpayer and the same industry/area declare dutiable goods data statistical nature between cosine similarity, detect abnormal data and identify suspicious taxpayer.This method effectively can improve the accuracy of detection of abnormal data of declaring dutiable goods, and reduces computation complexity, and realizes the identification to suspicious taxpayer.

Description

Based on the abnormal tax return data detection method of cosine similarity
Technical field:
The present invention relates to data monitoring field, particularly a kind of abnormal tax return data detection method.
Background technology:
Tax audit refers to that the tax authority fulfils obligation to pay tax to taxpayer, withholding agent in accordance with the law, withholds the general name of Tax Check that voluntary situation carries out and work for the treatment of.Tax laws regulation is complicated, audit point is many, and general audit point reaches more than 2000; Simultaneously audit target data are huge, and large enterprise's only financial affairs receipt data one, about have tens million of pen, traditionally manual type, complete one comparatively large enterprises' audit generally need 5-10 people's teamwork 6 months.How to pass through to carry out automatic analysis to the data of declaring dutiable goods of taxpayer, examination goes out abnormal declare dutiable goods data and taxpayer, reduces the data volume of manual audit, becomes one of tax audit field problem demanding prompt solution.
Summary of the invention:
Fundamental purpose of the present invention is to provide a kind of abnormal tax return data detection method based on cosine similarity, by building the data characteristics vector of declaring dutiable goods of taxpayer, and same area/industry taxpayer declares dutiable goods data statistical characteristics vector, calculate the cosine similarity between different taxpayer and statistical nature, whether the data of declaring dutiable goods detecting taxpayer exist exception, to identify suspicious taxpayer.
Object of the present invention is achieved through the following technical solutions:
Based on the abnormal tax return data detection method of cosine similarity, comprise the following steps:
S100, gathers the data of declaring dutiable goods of a same industry/regional m taxpayer in same service period of declaring dutiable goods;
S101, declare dutiable goods in service period according to the same of step S100 collection, the data of declaring dutiable goods of taxpayer i, calculate every statistical indicator of declaring dutiable goods, are designated as S 1(i), S 2(i) ..., S n(i); Statistical indicator vector S (i)=(S for taxpayer i is generated with this 1(i), S 2(i) ..., S n(i)); Wherein n be declare dutiable goods statistical indicator kind sum;
S102, for all taxpayers 1,2 in of the same trade/area ..., m, calculate arithmetic average AVG and the total sales weighted mean value WAVG of its data statistics indicator vector of declaring dutiable goods, computing formula is:
AVG = 1 m Σ i = 1 m S ( i )
WAVG = 1 Σ i = 1 m o ( i ) Σ i = 1 m o ( i ) · S ( i )
Wherein, o (i) is the total value of sales of taxpayer i;
S103, calculates the statistical indicator vector of m taxpayer and the cosine similarity of statistical nature AVG and WAVG of the same trade/regional: the taxpayer corresponding to statistical indicator vector that similarity is greater than cosine similarity threshold value data of declaring dutiable goods are normal data; There is abnormal data in the taxpayer corresponding to statistical indicator vector that similarity is less than or equal to cosine similarity threshold value data of declaring dutiable goods.
The present invention further improves and is: for the taxpayer that there is abnormal data detected by step S103, calculate its each to declare dutiable goods the component relative error of data statistics index component and the same industry/regional statistics feature AVG and WAVG.
The present invention further improves and is: described cosine similarity threshold value is 0.96.
The present invention further improves and is: every statistical indicator of declaring dutiable goods described in step S101 to comprise in different tax category burden of taxation ratio, income tax amount, tax amount of offset item, running cost one or more.
The present invention further improves and is: in step S103, the computation process of cosine similarity is: statistical indicator vector S (i) for i-th taxpayer is respectively with the computing formula of the cosine similarity of statistical nature AVG and WAVG in of the same trade/area:
Similarity ( S ( i ) , AVG ) = S ( i ) · AVG | | S ( i ) | | · | | AVG | | = Σ j = 1 n S j ( i ) × AVG j Σ j = 1 n S j ( i ) 2 × Σ j = 1 n AVG j 2
Similarity ( S ( i ) , WAVG ) = S ( i ) · WAVG | | S ( i ) | | · | | WAVG | | = Σ j = 1 n S j ( i ) × WAVG j Σ j = 1 n S j ( i ) 2 × Σ j = 1 n W AVG j 2
Wherein AVG ja jth component of arithmetic mean AVG, WAVG jit is a jth component of total sales weighted mean value WAVG; S ji () is that the jth item of i-th taxpayer is declared dutiable goods statistical indicator.
The present invention further improves and is: the jth statistical indicator S that there is the i-th taxpayer of abnormal data j(i), the component relative error computing formula of itself and the same industry/regional statistics feature AVG and WAVG is:
Error ( S j ( i ) , AVG j ) = | S j ( i ) - AVG j | | S j ( i ) + AVG j |
Error ( S j ( i ) , WAVG j ) = | S j ( i ) - WAVG j | | S j ( i ) + WAVG j | .
Relative to prior art, the invention has the beneficial effects as follows:
(1) algorithm complex is low, is conducive to extensive use: taxpayer's quantity in the whole nation is close to ten million order of magnitude, and the complexity of analytical algorithm directly affects the effect of use; The similarity degree that the present invention utilizes cosine similarity to describe taxpayer to declare dutiable goods between data taxpayer data statistical characteristics different from the same industry/region, to identify suspicious taxpayer and to detect abnormal data, algorithm complex is low, fast operation, can support the data analysis of extensive taxpayer;
(2) accuracy of detection is high: existing detection method is by setting up threshold value to assess to single features, and be subject to management style and the influence of fluctuations of short-term achievement, more significantly changing may appear in the data of totally declaring dutiable goods of taxpayer, and existing method exists the higher problem of rate of false alarm; The present invention, to the various dimensions index calculate cosine similarity of data of declaring dutiable goods, detects taxpayer and to declare dutiable goods the overall similarity of data and statistical nature, effectively can reduce wrong report, can provide abnormality detection result more accurately for tax audit personnel.
Accompanying drawing illustrates:
Fig. 1 is the abnormal tax return data detection method block diagram based on cosine similarity.
Embodiment:
By reference to the accompanying drawings with example in detail embodiments of the present invention.
Refer to described in Fig. 1, the present invention is based on the abnormal tax return data detection method of cosine similarity, comprise the following steps:
Step S100, gathers the data of declaring dutiable goods of the same industry/regional multiple taxpayer;
Example: somewhere thermal power industry 7 units data of declaring dutiable goods of 2010, wherein there is the behavior of concealing sales revenue and the amendment tax category in unit 7, and the statistical indicator of declaring dutiable goods of 7 units is as shown in table 1.
Table 1 somewhere thermal power industry 7 units data of declaring dutiable goods of 2010
1 2 3 4 5 6 7
Value added tax ratio 62.1% 64.7% 59.8% 61.4% 69.1% 67.7% 47.4%
Business tax ratio 1.1% 1.2% 1.7% 1.5% 0.8% 0.9% 17.9%
2010 annual sales amounts (hundred million yuan) 15.6 12.6 36.2 67.1 11.5 27.1 18.9
2009 annual sales amounts (hundred million yuan) 7.8 4.2 9.1 13.4 1.9 3.9 2.4
Rate of gross profit 23.3% 21.8% 26.1% 22.9% 18.7% 17.1% 17.9%
Step S101, the taxpayer according to gathering declares dutiable goods data, and select value added tax ratio (S1), business tax ratio (S2), sales volume year amplification (S3) and rate of gross profit (S4) as statistical indicator, concrete numerical value is as shown in table 2;
Table 2 somewhere thermal power industry 7 units data statistics index of declaring dutiable goods of 2010
Characteristic index 1 2 3 4 5 6 7
S1 62.1% 64.7% 59.8% 61.4% 69.1% 67.7% 47.4%
S2 1.1% 1.2% 1.7% 1.5% 0.8% 0.9% 17.9%
S3 13.5% 16.7% 13.1% 17.5% 11.9% 13.1% 18.8%
S4 23.3% 21.8% 26.1% 22.9% 18.7% 17.1% 17.9%
Step S102, according to arithmetic average and the total sales weighted mean value computing formula of the same industry/regional statistics indicator vector, the arithmetic average AVG and the total sales weighted mean value WAVG that obtain 7 unit 2010 annual datas are
AVG = 1 7 Σ i = 1 7 S ( i ) = 61.7 % 3.6 % 14.9 % 21.1 % , WAVG = 1 Σ i = 1 7 o ( i ) Σ i = 1 7 o ( i ) · S ( i ) = 61.3 % 3.0 % 15.4 % 21.9 %
Step S103, calculate the statistical indicator vector of 7 units and the cosine similarity of statistical nature AVG and WAVG in of the same trade/area, result is as shown in table 3; In this example, the threshold value of cosine similarity is set to 0.96, the testing result that similarity is greater than threshold value is designated as " normally ", and the testing result that similarity is less than threshold value is designated as "abnormal", and testing result is as shown in table 4, wherein unit 1-6 is all normal, and the data of unit 7 exist abnormal;
Table 3 cosine similarity
1 2 3 4 5 6 7
A VG 0.9987 0.9984 0.9967 0.9992 0.9963 0.9958 0.9550
WAVG 0.9991 0.9983 0.9977 0.9997 0.9954 0.9949 0.9537
Table 4 abnormality detection result
1 2 3 4 5 6 7
A VG Normally Normally Normally Normally Normally Normally Abnormal
WAVG Normally Normally Normally Normally Normally Normally Abnormal
Step S104, for the unit 7 that there is abnormal data, calculate the relative error of its each component and statistical nature, result is as shown in table 5; Wherein the relative error of index S 2 exists abnormal up to 0.6678 relative AVG WAVG relative to 0.7146, Judging index S2.
The component relative error of table 5 unit 7 statistical indicator
AVG WAVG
S1(7) 0.1356 0.1324
S2(7) 0.6678 0.7146
S3(7) 0.1248 0.1084
S4(7) 0.0796 0.0974
Step S105, Output rusults " suspicious taxpayer 7 detected, the existence of its index business tax ratio is abnormal ".

Claims (3)

1. based on the abnormal tax return data detection method of cosine similarity, it is characterized in that, comprise the following steps:
S100, gathers the data of declaring dutiable goods of a same industry/regional m taxpayer in same service period of declaring dutiable goods;
S101, declare dutiable goods in service period according to the same of step S100 collection, the data of declaring dutiable goods of taxpayer i, calculate every statistical indicator of declaring dutiable goods, are designated as S 1(i), S 2(i) ..., S n(i); Statistical indicator vector S (i)=(S for taxpayer i is generated with this 1(i), S 2(i) ..., S n(i)); Wherein n be declare dutiable goods statistical indicator kind sum;
S102, for all taxpayers 1,2 in of the same trade/area ..., m, calculate arithmetic average AVG and the total sales weighted mean value WAVG of its data statistics indicator vector of declaring dutiable goods, computing formula is:
A V G = 1 m Σ i = 1 m S ( i )
W A V G = 1 Σ i = 1 m o ( i ) Σ i = 1 m o ( i ) · S ( i )
Wherein, o (i) is the total value of sales of taxpayer i;
S103, calculates the statistical indicator vector of m taxpayer and the cosine similarity of statistical nature AVG and WAVG of the same trade/regional: the taxpayer corresponding to statistical indicator vector that similarity is greater than cosine similarity threshold value data of declaring dutiable goods are normal data; There is abnormal data in the taxpayer corresponding to statistical indicator vector that similarity is less than or equal to cosine similarity threshold value data of declaring dutiable goods;
For the taxpayer that there is abnormal data detected by step S103, calculate its each to declare dutiable goods the component relative error of data statistics index component and the same industry/regional statistics feature AVG and WAVG;
Every statistical indicator of declaring dutiable goods described in step S101 to comprise in different tax category burden of taxation ratio, income tax amount, tax amount of offset item, running cost one or more;
In step S103, the computation process of cosine similarity is: statistical indicator vector S (i) for i-th taxpayer is respectively with the computing formula of the cosine similarity of statistical nature AVG and WAVG in of the same trade/area:
S i m i l a r i t y ( S ( i ) , A V G ) = S ( i ) · A V G | | S ( i ) | | · | | A V G | | = Σ j = 1 n S j ( i ) × AVG j Σ j = 1 n S j ( i ) 2 Σ j = 1 n AVG j 2
S i m i l a r i t y ( S ( i ) , A V G ) = S ( i ) · A V G | | S ( i ) | | · | | A V G | | = Σ j = 1 n S j ( i ) × AVG j Σ j = 1 n S j ( i ) 2 Σ j = 1 n AVG j 2
Wherein AVG ja jth component of arithmetic mean AVG, WAVG jit is a jth component of total sales weighted mean value WAVG; S ji () is that the jth item of i-th taxpayer is declared dutiable goods statistical indicator;
The described abnormal tax return data detection method based on cosine similarity is completed automatically by computing machine.
2. the abnormal tax return data detection method based on cosine similarity according to claim 1, is characterized in that, described cosine similarity threshold value is 0.96.
3. the abnormal tax return data detection method based on cosine similarity according to claim 1, is characterized in that, there is a jth statistical indicator S of the i-th taxpayer of abnormal data j(i), the component relative error computing formula of itself and the same industry/regional statistics feature AVG and WAVG is:
E r r o r ( S j ( i ) , AVG j ) = | S j ( i ) - AVG j | | S j ( i ) + AVG j |
E r r o r ( S j ( i ) , WAVG j ) = | S j ( i ) - WAVG j | | S j ( i ) + WAVG j |
CN201310291896.9A 2013-07-11 2013-07-11 Based on the abnormal tax return data detection method of cosine similarity Active CN103377454B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310291896.9A CN103377454B (en) 2013-07-11 2013-07-11 Based on the abnormal tax return data detection method of cosine similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310291896.9A CN103377454B (en) 2013-07-11 2013-07-11 Based on the abnormal tax return data detection method of cosine similarity

Publications (2)

Publication Number Publication Date
CN103377454A CN103377454A (en) 2013-10-30
CN103377454B true CN103377454B (en) 2015-11-11

Family

ID=49462524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310291896.9A Active CN103377454B (en) 2013-07-11 2013-07-11 Based on the abnormal tax return data detection method of cosine similarity

Country Status (1)

Country Link
CN (1) CN103377454B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104166934A (en) * 2014-08-29 2014-11-26 税友软件集团股份有限公司 Tax revenue analysis method and system of index model for industries and tax categories
CN106933814A (en) * 2015-12-28 2017-07-07 航天信息股份有限公司 Tax data exception analysis method and system
CN106021479A (en) * 2016-05-18 2016-10-12 广东源恒软件科技有限公司 Project key index automatic association method and system
CN110659948A (en) * 2018-06-13 2020-01-07 中国软件与技术服务股份有限公司 Calculation method for matching degree of commodity sold and false invoice risk discovery method
CN111695979A (en) * 2020-06-18 2020-09-22 税友软件集团股份有限公司 Method, device and equipment for analyzing relation between raw material and finished product
CN112613929A (en) * 2020-12-17 2021-04-06 山东浪潮商用系统有限公司 Invoice false invoice recognition method and system based on semantic analysis
CN114445207B (en) * 2022-04-11 2022-07-26 广东企数标普科技有限公司 Tax administration system based on digital RMB

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259719A (en) * 1999-03-08 2000-09-22 Internatl Business Mach Corp <Ibm> Method and device for calculating probability of default on obligation
WO2005101265A2 (en) * 2004-04-06 2005-10-27 Pricewaterhousecoopers, Llp Systems and methods for investigation of financial reporting information
CN102609874A (en) * 2012-02-15 2012-07-25 江苏壹格信息科技有限公司 Tax-involved risk assessment method for real-estate project
CN102890803A (en) * 2011-07-21 2013-01-23 阿里巴巴集团控股有限公司 Method and device for determining abnormal transaction process of electronic commodity
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259719A (en) * 1999-03-08 2000-09-22 Internatl Business Mach Corp <Ibm> Method and device for calculating probability of default on obligation
WO2005101265A2 (en) * 2004-04-06 2005-10-27 Pricewaterhousecoopers, Llp Systems and methods for investigation of financial reporting information
CN102890803A (en) * 2011-07-21 2013-01-23 阿里巴巴集团控股有限公司 Method and device for determining abnormal transaction process of electronic commodity
CN102609874A (en) * 2012-02-15 2012-07-25 江苏壹格信息科技有限公司 Tax-involved risk assessment method for real-estate project
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method

Also Published As

Publication number Publication date
CN103377454A (en) 2013-10-30

Similar Documents

Publication Publication Date Title
CN103377454B (en) Based on the abnormal tax return data detection method of cosine similarity
CN103366091B (en) Based on the abnormal tax return data detection method of multilevel threshold exponent-weighted average
JP6707564B2 (en) Data quality analysis
Lennox et al. Accounting misstatements following lawsuits against auditors
de Linde Leonard et al. Does the UK minimum wage reduce employment? A meta‐regression analysis
Matolcsy et al. Capitalized intangibles and financial analysts
Brazel et al. Auditors' reactions to inconsistencies between financial and nonfinancial measures: The interactive effects of fraud risk assessment and a decision prompt
Byerlee et al. Sense and sustainability revisited: the limits of total factor productivity measures of sustainable agricultural systems
CN104715308A (en) Enterprise income tax declaration data risk analysis and prompt method and system
Wang et al. Identifying the multiscale financial contagion in precious metal markets
US20190377653A1 (en) Systems and methods for modeling computer resource metrics
CN107766500A (en) The auditing method of fixed assets card
WO2016138805A1 (en) Method and system for determining and locating distributed data transaction
Akpa et al. Climate risk and financial instability in Asia-Pacific
CN112329862A (en) Decision tree-based anti-money laundering method and system
Laitinen Matching of expenses in financial reporting: a matching function approach
Liu et al. Crowding in or crowding out? The effect of imported environmentally sound technologies on indigenous green innovation
Subhan et al. THE INFLUENCE OF ECONOMIC FACTORS ON THE STOCK PRICE OF KIMIA FARMA COMPANIES ON THE INDONESIAN STOCK EXCHANGE
Ltaifa The impact of banking strategies on the net interest margin of Tunisian banks
Anghelache et al. Operational risk modeling
US20230394069A1 (en) Method and apparatus for measuring material risk in a data set
Jiang et al. The statistics of capture ratios
Heriasman et al. THE EFFECT OF FINANCIAL PERFORMANCE ON THE LEVEL OF FINANCIAL INDEPENDENCE REGIONAL PUBLIC SERVICE AGENCY (BLUD) REGIONAL PUBLIC HOSPITAL (RSUD) INDRASARI RENGAT
Imhanzenobe Modelling Predictors of Financial Sustainability of Nigerian Manufacturing Companies
van Delden et al. Analysing response differences between sample survey and VAT turnover

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160415

Address after: 310053, tax building, No. 3738 South Ring Road, Hangzhou, Zhejiang, Binjiang District

Patentee after: Servyou Software Group Co., Ltd.

Address before: 710049 Xianning West Road, Shaanxi, China, No. 28, No.

Patentee before: Xi'an Jiaotong University