CN103377454B - Based on the abnormal tax return data detection method of cosine similarity - Google Patents
Based on the abnormal tax return data detection method of cosine similarity Download PDFInfo
- Publication number
- CN103377454B CN103377454B CN201310291896.9A CN201310291896A CN103377454B CN 103377454 B CN103377454 B CN 103377454B CN 201310291896 A CN201310291896 A CN 201310291896A CN 103377454 B CN103377454 B CN 103377454B
- Authority
- CN
- China
- Prior art keywords
- taxpayer
- data
- avg
- dutiable goods
- cosine similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of abnormal tax return data detection method based on cosine similarity: the taxpayer based on of the same trade/area declares dutiable goods data, the declare dutiable goods statistical indicator of data and the same industry/regional taxpayer of tax payment assessed people declares dutiable goods the statistical nature of data; By calculate different taxpayer and the same industry/area declare dutiable goods data statistical nature between cosine similarity, detect abnormal data and identify suspicious taxpayer.This method effectively can improve the accuracy of detection of abnormal data of declaring dutiable goods, and reduces computation complexity, and realizes the identification to suspicious taxpayer.
Description
Technical field:
The present invention relates to data monitoring field, particularly a kind of abnormal tax return data detection method.
Background technology:
Tax audit refers to that the tax authority fulfils obligation to pay tax to taxpayer, withholding agent in accordance with the law, withholds the general name of Tax Check that voluntary situation carries out and work for the treatment of.Tax laws regulation is complicated, audit point is many, and general audit point reaches more than 2000; Simultaneously audit target data are huge, and large enterprise's only financial affairs receipt data one, about have tens million of pen, traditionally manual type, complete one comparatively large enterprises' audit generally need 5-10 people's teamwork 6 months.How to pass through to carry out automatic analysis to the data of declaring dutiable goods of taxpayer, examination goes out abnormal declare dutiable goods data and taxpayer, reduces the data volume of manual audit, becomes one of tax audit field problem demanding prompt solution.
Summary of the invention:
Fundamental purpose of the present invention is to provide a kind of abnormal tax return data detection method based on cosine similarity, by building the data characteristics vector of declaring dutiable goods of taxpayer, and same area/industry taxpayer declares dutiable goods data statistical characteristics vector, calculate the cosine similarity between different taxpayer and statistical nature, whether the data of declaring dutiable goods detecting taxpayer exist exception, to identify suspicious taxpayer.
Object of the present invention is achieved through the following technical solutions:
Based on the abnormal tax return data detection method of cosine similarity, comprise the following steps:
S100, gathers the data of declaring dutiable goods of a same industry/regional m taxpayer in same service period of declaring dutiable goods;
S101, declare dutiable goods in service period according to the same of step S100 collection, the data of declaring dutiable goods of taxpayer i, calculate every statistical indicator of declaring dutiable goods, are designated as S
1(i), S
2(i) ..., S
n(i); Statistical indicator vector S (i)=(S for taxpayer i is generated with this
1(i), S
2(i) ..., S
n(i)); Wherein n be declare dutiable goods statistical indicator kind sum;
S102, for all taxpayers 1,2 in of the same trade/area ..., m, calculate arithmetic average AVG and the total sales weighted mean value WAVG of its data statistics indicator vector of declaring dutiable goods, computing formula is:
Wherein, o (i) is the total value of sales of taxpayer i;
S103, calculates the statistical indicator vector of m taxpayer and the cosine similarity of statistical nature AVG and WAVG of the same trade/regional: the taxpayer corresponding to statistical indicator vector that similarity is greater than cosine similarity threshold value data of declaring dutiable goods are normal data; There is abnormal data in the taxpayer corresponding to statistical indicator vector that similarity is less than or equal to cosine similarity threshold value data of declaring dutiable goods.
The present invention further improves and is: for the taxpayer that there is abnormal data detected by step S103, calculate its each to declare dutiable goods the component relative error of data statistics index component and the same industry/regional statistics feature AVG and WAVG.
The present invention further improves and is: described cosine similarity threshold value is 0.96.
The present invention further improves and is: every statistical indicator of declaring dutiable goods described in step S101 to comprise in different tax category burden of taxation ratio, income tax amount, tax amount of offset item, running cost one or more.
The present invention further improves and is: in step S103, the computation process of cosine similarity is: statistical indicator vector S (i) for i-th taxpayer is respectively with the computing formula of the cosine similarity of statistical nature AVG and WAVG in of the same trade/area:
Wherein AVG
ja jth component of arithmetic mean AVG, WAVG
jit is a jth component of total sales weighted mean value WAVG; S
ji () is that the jth item of i-th taxpayer is declared dutiable goods statistical indicator.
The present invention further improves and is: the jth statistical indicator S that there is the i-th taxpayer of abnormal data
j(i), the component relative error computing formula of itself and the same industry/regional statistics feature AVG and WAVG is:
Relative to prior art, the invention has the beneficial effects as follows:
(1) algorithm complex is low, is conducive to extensive use: taxpayer's quantity in the whole nation is close to ten million order of magnitude, and the complexity of analytical algorithm directly affects the effect of use; The similarity degree that the present invention utilizes cosine similarity to describe taxpayer to declare dutiable goods between data taxpayer data statistical characteristics different from the same industry/region, to identify suspicious taxpayer and to detect abnormal data, algorithm complex is low, fast operation, can support the data analysis of extensive taxpayer;
(2) accuracy of detection is high: existing detection method is by setting up threshold value to assess to single features, and be subject to management style and the influence of fluctuations of short-term achievement, more significantly changing may appear in the data of totally declaring dutiable goods of taxpayer, and existing method exists the higher problem of rate of false alarm; The present invention, to the various dimensions index calculate cosine similarity of data of declaring dutiable goods, detects taxpayer and to declare dutiable goods the overall similarity of data and statistical nature, effectively can reduce wrong report, can provide abnormality detection result more accurately for tax audit personnel.
Accompanying drawing illustrates:
Fig. 1 is the abnormal tax return data detection method block diagram based on cosine similarity.
Embodiment:
By reference to the accompanying drawings with example in detail embodiments of the present invention.
Refer to described in Fig. 1, the present invention is based on the abnormal tax return data detection method of cosine similarity, comprise the following steps:
Step S100, gathers the data of declaring dutiable goods of the same industry/regional multiple taxpayer;
Example: somewhere thermal power industry 7 units data of declaring dutiable goods of 2010, wherein there is the behavior of concealing sales revenue and the amendment tax category in unit 7, and the statistical indicator of declaring dutiable goods of 7 units is as shown in table 1.
Table 1 somewhere thermal power industry 7 units data of declaring dutiable goods of 2010
1 | 2 | 3 | 4 | 5 | 6 | 7 | |
Value added tax ratio | 62.1% | 64.7% | 59.8% | 61.4% | 69.1% | 67.7% | 47.4% |
Business tax ratio | 1.1% | 1.2% | 1.7% | 1.5% | 0.8% | 0.9% | 17.9% |
2010 annual sales amounts (hundred million yuan) | 15.6 | 12.6 | 36.2 | 67.1 | 11.5 | 27.1 | 18.9 |
2009 annual sales amounts (hundred million yuan) | 7.8 | 4.2 | 9.1 | 13.4 | 1.9 | 3.9 | 2.4 |
Rate of gross profit | 23.3% | 21.8% | 26.1% | 22.9% | 18.7% | 17.1% | 17.9% |
Step S101, the taxpayer according to gathering declares dutiable goods data, and select value added tax ratio (S1), business tax ratio (S2), sales volume year amplification (S3) and rate of gross profit (S4) as statistical indicator, concrete numerical value is as shown in table 2;
Table 2 somewhere thermal power industry 7 units data statistics index of declaring dutiable goods of 2010
Characteristic index | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
S1 | 62.1% | 64.7% | 59.8% | 61.4% | 69.1% | 67.7% | 47.4% |
S2 | 1.1% | 1.2% | 1.7% | 1.5% | 0.8% | 0.9% | 17.9% |
S3 | 13.5% | 16.7% | 13.1% | 17.5% | 11.9% | 13.1% | 18.8% |
S4 | 23.3% | 21.8% | 26.1% | 22.9% | 18.7% | 17.1% | 17.9% |
Step S102, according to arithmetic average and the total sales weighted mean value computing formula of the same industry/regional statistics indicator vector, the arithmetic average AVG and the total sales weighted mean value WAVG that obtain 7 unit 2010 annual datas are
Step S103, calculate the statistical indicator vector of 7 units and the cosine similarity of statistical nature AVG and WAVG in of the same trade/area, result is as shown in table 3; In this example, the threshold value of cosine similarity is set to 0.96, the testing result that similarity is greater than threshold value is designated as " normally ", and the testing result that similarity is less than threshold value is designated as "abnormal", and testing result is as shown in table 4, wherein unit 1-6 is all normal, and the data of unit 7 exist abnormal;
Table 3 cosine similarity
1 | 2 | 3 | 4 | 5 | 6 | 7 | |
A VG | 0.9987 | 0.9984 | 0.9967 | 0.9992 | 0.9963 | 0.9958 | 0.9550 |
WAVG | 0.9991 | 0.9983 | 0.9977 | 0.9997 | 0.9954 | 0.9949 | 0.9537 |
Table 4 abnormality detection result
1 | 2 | 3 | 4 | 5 | 6 | 7 | |
A VG | Normally | Normally | Normally | Normally | Normally | Normally | Abnormal |
WAVG | Normally | Normally | Normally | Normally | Normally | Normally | Abnormal |
Step S104, for the unit 7 that there is abnormal data, calculate the relative error of its each component and statistical nature, result is as shown in table 5; Wherein the relative error of index S 2 exists abnormal up to 0.6678 relative AVG WAVG relative to 0.7146, Judging index S2.
The component relative error of table 5 unit 7 statistical indicator
AVG | WAVG | |
S1(7) | 0.1356 | 0.1324 |
S2(7) | 0.6678 | 0.7146 |
S3(7) | 0.1248 | 0.1084 |
S4(7) | 0.0796 | 0.0974 |
Step S105, Output rusults " suspicious taxpayer 7 detected, the existence of its index business tax ratio is abnormal ".
Claims (3)
1. based on the abnormal tax return data detection method of cosine similarity, it is characterized in that, comprise the following steps:
S100, gathers the data of declaring dutiable goods of a same industry/regional m taxpayer in same service period of declaring dutiable goods;
S101, declare dutiable goods in service period according to the same of step S100 collection, the data of declaring dutiable goods of taxpayer i, calculate every statistical indicator of declaring dutiable goods, are designated as S
1(i), S
2(i) ..., S
n(i); Statistical indicator vector S (i)=(S for taxpayer i is generated with this
1(i), S
2(i) ..., S
n(i)); Wherein n be declare dutiable goods statistical indicator kind sum;
S102, for all taxpayers 1,2 in of the same trade/area ..., m, calculate arithmetic average AVG and the total sales weighted mean value WAVG of its data statistics indicator vector of declaring dutiable goods, computing formula is:
Wherein, o (i) is the total value of sales of taxpayer i;
S103, calculates the statistical indicator vector of m taxpayer and the cosine similarity of statistical nature AVG and WAVG of the same trade/regional: the taxpayer corresponding to statistical indicator vector that similarity is greater than cosine similarity threshold value data of declaring dutiable goods are normal data; There is abnormal data in the taxpayer corresponding to statistical indicator vector that similarity is less than or equal to cosine similarity threshold value data of declaring dutiable goods;
For the taxpayer that there is abnormal data detected by step S103, calculate its each to declare dutiable goods the component relative error of data statistics index component and the same industry/regional statistics feature AVG and WAVG;
Every statistical indicator of declaring dutiable goods described in step S101 to comprise in different tax category burden of taxation ratio, income tax amount, tax amount of offset item, running cost one or more;
In step S103, the computation process of cosine similarity is: statistical indicator vector S (i) for i-th taxpayer is respectively with the computing formula of the cosine similarity of statistical nature AVG and WAVG in of the same trade/area:
Wherein AVG
ja jth component of arithmetic mean AVG, WAVG
jit is a jth component of total sales weighted mean value WAVG; S
ji () is that the jth item of i-th taxpayer is declared dutiable goods statistical indicator;
The described abnormal tax return data detection method based on cosine similarity is completed automatically by computing machine.
2. the abnormal tax return data detection method based on cosine similarity according to claim 1, is characterized in that, described cosine similarity threshold value is 0.96.
3. the abnormal tax return data detection method based on cosine similarity according to claim 1, is characterized in that, there is a jth statistical indicator S of the i-th taxpayer of abnormal data
j(i), the component relative error computing formula of itself and the same industry/regional statistics feature AVG and WAVG is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310291896.9A CN103377454B (en) | 2013-07-11 | 2013-07-11 | Based on the abnormal tax return data detection method of cosine similarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310291896.9A CN103377454B (en) | 2013-07-11 | 2013-07-11 | Based on the abnormal tax return data detection method of cosine similarity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103377454A CN103377454A (en) | 2013-10-30 |
CN103377454B true CN103377454B (en) | 2015-11-11 |
Family
ID=49462524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310291896.9A Active CN103377454B (en) | 2013-07-11 | 2013-07-11 | Based on the abnormal tax return data detection method of cosine similarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103377454B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104166934A (en) * | 2014-08-29 | 2014-11-26 | 税友软件集团股份有限公司 | Tax revenue analysis method and system of index model for industries and tax categories |
CN106933814A (en) * | 2015-12-28 | 2017-07-07 | 航天信息股份有限公司 | Tax data exception analysis method and system |
CN106021479A (en) * | 2016-05-18 | 2016-10-12 | 广东源恒软件科技有限公司 | Project key index automatic association method and system |
CN110659948A (en) * | 2018-06-13 | 2020-01-07 | 中国软件与技术服务股份有限公司 | Calculation method for matching degree of commodity sold and false invoice risk discovery method |
CN111695979A (en) * | 2020-06-18 | 2020-09-22 | 税友软件集团股份有限公司 | Method, device and equipment for analyzing relation between raw material and finished product |
CN112613929A (en) * | 2020-12-17 | 2021-04-06 | 山东浪潮商用系统有限公司 | Invoice false invoice recognition method and system based on semantic analysis |
CN114445207B (en) * | 2022-04-11 | 2022-07-26 | 广东企数标普科技有限公司 | Tax administration system based on digital RMB |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259719A (en) * | 1999-03-08 | 2000-09-22 | Internatl Business Mach Corp <Ibm> | Method and device for calculating probability of default on obligation |
WO2005101265A2 (en) * | 2004-04-06 | 2005-10-27 | Pricewaterhousecoopers, Llp | Systems and methods for investigation of financial reporting information |
CN102609874A (en) * | 2012-02-15 | 2012-07-25 | 江苏壹格信息科技有限公司 | Tax-involved risk assessment method for real-estate project |
CN102890803A (en) * | 2011-07-21 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and device for determining abnormal transaction process of electronic commodity |
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
-
2013
- 2013-07-11 CN CN201310291896.9A patent/CN103377454B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259719A (en) * | 1999-03-08 | 2000-09-22 | Internatl Business Mach Corp <Ibm> | Method and device for calculating probability of default on obligation |
WO2005101265A2 (en) * | 2004-04-06 | 2005-10-27 | Pricewaterhousecoopers, Llp | Systems and methods for investigation of financial reporting information |
CN102890803A (en) * | 2011-07-21 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and device for determining abnormal transaction process of electronic commodity |
CN102609874A (en) * | 2012-02-15 | 2012-07-25 | 江苏壹格信息科技有限公司 | Tax-involved risk assessment method for real-estate project |
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
Also Published As
Publication number | Publication date |
---|---|
CN103377454A (en) | 2013-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103377454B (en) | Based on the abnormal tax return data detection method of cosine similarity | |
CN103366091B (en) | Based on the abnormal tax return data detection method of multilevel threshold exponent-weighted average | |
JP6707564B2 (en) | Data quality analysis | |
Lennox et al. | Accounting misstatements following lawsuits against auditors | |
de Linde Leonard et al. | Does the UK minimum wage reduce employment? A meta‐regression analysis | |
Matolcsy et al. | Capitalized intangibles and financial analysts | |
Brazel et al. | Auditors' reactions to inconsistencies between financial and nonfinancial measures: The interactive effects of fraud risk assessment and a decision prompt | |
Byerlee et al. | Sense and sustainability revisited: the limits of total factor productivity measures of sustainable agricultural systems | |
CN104715308A (en) | Enterprise income tax declaration data risk analysis and prompt method and system | |
Wang et al. | Identifying the multiscale financial contagion in precious metal markets | |
US20190377653A1 (en) | Systems and methods for modeling computer resource metrics | |
CN107766500A (en) | The auditing method of fixed assets card | |
WO2016138805A1 (en) | Method and system for determining and locating distributed data transaction | |
Akpa et al. | Climate risk and financial instability in Asia-Pacific | |
CN112329862A (en) | Decision tree-based anti-money laundering method and system | |
Laitinen | Matching of expenses in financial reporting: a matching function approach | |
Liu et al. | Crowding in or crowding out? The effect of imported environmentally sound technologies on indigenous green innovation | |
Subhan et al. | THE INFLUENCE OF ECONOMIC FACTORS ON THE STOCK PRICE OF KIMIA FARMA COMPANIES ON THE INDONESIAN STOCK EXCHANGE | |
Ltaifa | The impact of banking strategies on the net interest margin of Tunisian banks | |
Anghelache et al. | Operational risk modeling | |
US20230394069A1 (en) | Method and apparatus for measuring material risk in a data set | |
Jiang et al. | The statistics of capture ratios | |
Heriasman et al. | THE EFFECT OF FINANCIAL PERFORMANCE ON THE LEVEL OF FINANCIAL INDEPENDENCE REGIONAL PUBLIC SERVICE AGENCY (BLUD) REGIONAL PUBLIC HOSPITAL (RSUD) INDRASARI RENGAT | |
Imhanzenobe | Modelling Predictors of Financial Sustainability of Nigerian Manufacturing Companies | |
van Delden et al. | Analysing response differences between sample survey and VAT turnover |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20160415 Address after: 310053, tax building, No. 3738 South Ring Road, Hangzhou, Zhejiang, Binjiang District Patentee after: Servyou Software Group Co., Ltd. Address before: 710049 Xianning West Road, Shaanxi, China, No. 28, No. Patentee before: Xi'an Jiaotong University |