CN109858922A - Improper taxpayer's recognition methods and device - Google Patents
Improper taxpayer's recognition methods and device Download PDFInfo
- Publication number
- CN109858922A CN109858922A CN201811584029.3A CN201811584029A CN109858922A CN 109858922 A CN109858922 A CN 109858922A CN 201811584029 A CN201811584029 A CN 201811584029A CN 109858922 A CN109858922 A CN 109858922A
- Authority
- CN
- China
- Prior art keywords
- taxpayer
- improper
- identified
- selected feature
- characteristic value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 56
- 238000012360 testing method Methods 0.000 claims description 64
- 230000029305 taxis Effects 0.000 claims description 19
- 230000008859 change Effects 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 11
- 235000013399 edible fruits Nutrition 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 8
- 238000010801 machine learning Methods 0.000 abstract description 7
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000011144 upstream manufacturing Methods 0.000 description 3
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 239000012141 concentrate Substances 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 238000013100 final test Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of improper taxpayer's recognition methods and devices, this method comprises: obtaining the selected information of taxpayer to be identified;The characteristic value of at least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified;In xgboost model after the characteristic value of at least one selected feature of the taxpayer to be identified to be sequentially input to the training of the first setting quantity, the probability value of the first setting quantity of the taxpayer to be identified is obtained;The probability value of the first setting quantity based on the taxpayer to be identified obtains the recognition result of the taxpayer to be identified.The program may be implemented by machine learning algorithm and big data technology, identify whether taxpayer is normal to construct improper taxpayer's identification model.
Description
Technical field
The present invention relates to technical field of information processing, espespecially a kind of improper taxpayer's recognition methods and device.
Background technique
Tax revenue is the most important receipts form of national public finance and source.Although the popularization of forgery prevention for value-added tax taxation control system
It uses, is tax collection and administration and a strong means for increasing country's income, but still insufficient in tax risk management aspect, also
It is the business experience for mainly relying on tax analysis personnel, not only subjective ingredient is dense, and accuracy is not strong, and inefficiency,
Especially in the differentiation for writing out falsely invoice enterprise and Zou Tao enterprise.By machine learning algorithm and big data technology, come construct it is non-just
Normal taxpayer's identification model identifies whether taxpayer is normal, can not only promote monitoring recognition effect and the identification of suspicious enterprise
Efficiency, and help to maintain normal tax revenue and economic order.
Summary of the invention
The embodiment of the present invention provides a kind of improper taxpayer's recognition methods and device, calculates to realize by machine learning
Method and big data technology identify whether taxpayer is normal to construct improper taxpayer's identification model.
According to embodiments of the present invention, a kind of improper taxpayer's recognition methods is provided, which comprises
Obtain the selected information of taxpayer to be identified;
At least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified
Characteristic value;
The characteristic value of at least one selected feature of the taxpayer to be identified is sequentially input to the instruction of the first setting quantity
In xgboost model after white silk, the probability value of the first setting quantity of the taxpayer to be identified is obtained;
The probability value of the first setting quantity based on the taxpayer to be identified obtains the identification of the taxpayer to be identified
As a result.
Specifically, obtaining at least one choosing of the taxpayer to be identified from the selected information of the taxpayer to be identified
The characteristic value for determining feature, specifically includes:
At least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified
Initial characteristic values;
By initial characteristic values unreasonable in the initial characteristic values of at least one selected feature of the taxpayer to be identified
It is changed to setting value, will indicate the initial of classification in the initial characteristic values of at least one selected feature of the taxpayer to be identified
Characteristic value is identified as classification, the initial characteristic values of at least one selected feature of the taxpayer to be identified is standardized place
Reason obtains the characteristic value of at least one selected feature of the taxpayer to be identified.
Specifically, the probability value based on the first setting quantity obtains recognition result, specifically include:
Calculate the mean value of the probability value of the first setting quantity;
The mean value is compared with given threshold;
If the mean value is greater than or equal to the given threshold, it is determined that artificially improper pay taxes to be identified of paying taxes
People;If the mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.
Specifically, further include:
The selected information and third that obtain the normal taxpayer of the second setting quantity set the improper taxpayer's of quantity
Selected information;
Corresponding normal taxpayer is obtained at least from the selected information of the normal taxpayer of the second setting quantity
The characteristic value of one selected feature marks the characteristic value of at least one selected feature of each normal taxpayer and normal taxpayer
Label are added in the wide table of characteristic value label;
Obtain corresponding improper taxpayer's from the selected information of the improper taxpayer of third setting quantity
The characteristic value of at least one selected feature, by the characteristic value of at least one selected feature of each improper taxpayer and improper
Taxpayer's label is added in the wide table of characteristic value label;
The training sample set of test sample collection and the first setting quantity is obtained from the wide table of characteristic value label;
The training sample of the first setting quantity is concentrated to the characteristic value of at least one selected feature of each taxpayer
It is separately input to initial xgboost model with corresponding label, obtains the candidate xgboost mould of the first setting quantity
Type;
Using the test sample concentrate at least one selected feature of each taxpayer characteristic value and corresponding label
Test the candidate xgboost model of the first setting quantity;
Accurate rate and recall rate are determined based on test result;
If the accurate rate and the recall rate are up to standard, by the candidate xgboost mould of the first setting quantity
Type is determined as the xgboost model after the training of the first setting quantity.
Specifically, obtaining corresponding normal taxpayer from the selected information of the normal taxpayer of the second setting quantity
At least one selected feature characteristic value, specifically include:
Corresponding normal taxpayer is obtained at least from the selected information of the normal taxpayer of the second setting quantity
The initial characteristic values of one selected feature;
It will be unreasonable in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity
Initial characteristic values be changed to setting value, will the second setting quantity normal taxpayer at least one selected feature just
Indicate that the initial characteristic values of classification are identified as classification, set the normal taxpayer of quantity at least for described second in beginning characteristic value
The initial characteristic values of one selected feature are standardized, and obtain at least one selected spy of corresponding normal taxpayer
The characteristic value of sign.
Specifically, obtaining corresponding improper receive from the selected information of the improper taxpayer of third setting quantity
The characteristic value of the selected feature of at least one of tax people, specifically includes:
Obtain corresponding improper taxpayer's from the selected information of the improper taxpayer of third setting quantity
The initial characteristic values of at least one selected feature;
It will not conform in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity
The initial characteristic values of reason are changed to setting value, at least one selected feature by the improper taxpayer of third setting quantity
Initial characteristic values in indicate classification initial characteristic values be identified as classification, by the third setting quantity improper taxpayer
Initial characteristic values of at least one selected feature be standardized, obtain at least the one of corresponding improper taxpayer
The characteristic value of a selected feature.
Specifically, obtaining the training sample of test sample collection and the first setting quantity from the wide table of characteristic value label
This collection specifically includes:
Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is non-to obtain first
Normal taxpayer's set and second improper taxpayer's set;
For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the characteristic value
Extracted in the wide table of label the improper taxpayer's quantity gathered with the described first improper taxpayer and include it is identical and before not by
The normal taxpayer extracted, by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with
The characteristic value of at least one selected feature of improper taxpayer in first improper taxpayer's set and corresponding mark
Label combination, obtains a training sample set;
By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label
With corresponding label, the spy at least one selected feature of the improper taxpayer in described second improper taxpayer's set
Value indicative and corresponding tag combination, obtain test sample collection.
Specifically, determining accurate rate and recall rate based on test result, specifically include:
The calculation formula of the accurate rate is as follows:
Precision=TP/ (TP+FP);
The calculation formula of the recall rate is as follows:
Recall=TP/ (TP+FN);
Where it is assumed that being positive sample by improper taxpayer's sample, normal taxpayer's sample is negative sample, and TP indicates test
As a result be positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample
This quantity, FN indicate test result be negative sample, be actually positive sample sample size.
According to embodiments of the present invention, a kind of improper taxpayer's identification device is also provided, described device includes:
First obtains module, for obtaining the selected information of taxpayer to be identified;
Second obtains module, for obtaining the taxpayer's to be identified from the selected information of the taxpayer to be identified
The characteristic value of at least one selected feature;
Input module, for the characteristic value of at least one selected feature of the taxpayer to be identified to be sequentially input first
In xgboost model after setting the training of quantity, the probability value of the first setting quantity of the taxpayer to be identified is obtained;
Identification module, the probability value for the first setting quantity based on the taxpayer to be identified obtain described to be identified
The recognition result of taxpayer.
Specifically, described second obtains module, for obtained from the selected information of the taxpayer to be identified it is described to
The characteristic value for identifying at least one selected feature of taxpayer, is specifically used for:
At least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified
Initial characteristic values;
By initial characteristic values unreasonable in the initial characteristic values of at least one selected feature of the taxpayer to be identified
It is changed to setting value, will indicate the initial of classification in the initial characteristic values of at least one selected feature of the taxpayer to be identified
Characteristic value is identified as classification, the initial characteristic values of at least one selected feature of the taxpayer to be identified is standardized place
Reason obtains the characteristic value of at least one selected feature of the taxpayer to be identified.
Specifically, the identification module, obtains recognition result for the probability value based on the first setting quantity, specifically
For:
Calculate the mean value of the probability value of the first setting quantity;
The mean value is compared with given threshold;
If the mean value is more than or equal to the given threshold, it is determined that artificially improper pay taxes to be identified of paying taxes
People;If the mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.
Specifically, further include:
Third obtains module, for obtaining the selected information and third setting quantity of the normal taxpayer of the second setting quantity
Improper taxpayer selected information;
Adding module is corresponding normal for obtaining from the selected information of the normal taxpayer of the second setting quantity
The characteristic value of the selected feature of at least one of taxpayer, by the characteristic value of at least one selected feature of each normal taxpayer and
Normal taxpayer's label is added in the wide table of characteristic value label;From the selected letter of the improper taxpayer of third setting quantity
The characteristic value of at least one selected feature of corresponding improper taxpayer is obtained in breath, at least by each improper taxpayer
The characteristic value of one selected feature and improper taxpayer's label are added in the wide table of characteristic value label;
4th obtains module, for obtaining test sample collection and the first setting number from the wide table of characteristic value label
The training sample set of amount;
Training module, for concentrating at least one of each taxpayer to select the training sample of the first setting quantity
The characteristic value of feature and corresponding label are separately input to initial xgboost model, obtain the time of the first setting quantity
The xgboost model of choosing;
Test module, the characteristic value of at least one selected feature for concentrating each taxpayer using the test sample
The candidate xgboost model of the first setting quantity is tested with corresponding label;
First determining module, for determining accurate rate and recall rate based on test result;
Second determining module, if up to standard for the accurate rate and the recall rate, by the first setting quantity
Candidate xgboost model be determined as the xgboost model after the training of the first setting quantity.
Specifically, the adding module, for being obtained from the selected information of the normal taxpayer of the second setting quantity
The characteristic value for taking at least one selected feature of corresponding normal taxpayer, is specifically used for:
Corresponding normal taxpayer is obtained at least from the selected information of the normal taxpayer of the second setting quantity
The initial characteristic values of one selected feature;
It will be unreasonable in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity
Initial characteristic values be changed to setting value, will the second setting quantity normal taxpayer at least one selected feature just
Indicate that the initial characteristic values of classification are identified as classification, set the normal taxpayer of quantity at least for described second in beginning characteristic value
The initial characteristic values of one selected feature are standardized, and obtain at least one selected spy of corresponding normal taxpayer
The characteristic value of sign.
Specifically, the adding module, for from the selected information of the improper taxpayer of third setting quantity
The characteristic value for obtaining at least one selected feature of corresponding improper taxpayer, is specifically used for:
Obtain corresponding improper taxpayer's from the selected information of the improper taxpayer of third setting quantity
The initial characteristic values of at least one selected feature;
It will not conform in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity
The initial characteristic values of reason are changed to setting value, at least one selected feature by the improper taxpayer of third setting quantity
Initial characteristic values in indicate classification initial characteristic values be identified as classification, by the third setting quantity improper taxpayer
Initial characteristic values of at least one selected feature be standardized, obtain at least the one of corresponding improper taxpayer
The characteristic value of a selected feature.
Specifically, the described 4th obtains module, for obtaining test sample collection and institute from the wide table of characteristic value label
The training sample set for stating the first setting quantity, is specifically used for:
Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is non-to obtain first
Normal taxpayer's set and second improper taxpayer's set;
For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the characteristic value
Extracted in the wide table of label the improper taxpayer's quantity gathered with the described first improper taxpayer and include it is identical and before not by
The normal taxpayer extracted, by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with
The characteristic value of at least one selected feature of improper taxpayer in first improper taxpayer's set and corresponding mark
Label combination, obtains a training sample set;
By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label
With corresponding label, the spy at least one selected feature of the improper taxpayer in described second improper taxpayer's set
Value indicative and corresponding tag combination, obtain test sample collection.
Specifically, first determining module is specifically used for for determining accurate rate and recall rate based on test result:
The calculation formula of the accurate rate is as follows:
Precision=TP/ (TP+FP);
The calculation formula of the recall rate is as follows:
Recall=TP/ (TP+FN);
Where it is assumed that being positive sample by improper taxpayer's sample, normal taxpayer's sample is negative sample, and TP indicates test
As a result be positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample
This quantity, FN indicate test result be negative sample, be actually positive sample sample size.
The present invention has the beneficial effect that:
The embodiment of the present invention provides a kind of improper taxpayer's recognition methods and device, passes through and obtains taxpayer's to be identified
Selected information;At least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified
Characteristic value;The characteristic value of at least one selected feature of the taxpayer to be identified is sequentially input to the instruction of the first setting quantity
In xgboost model after white silk, the probability value of the first setting quantity of the taxpayer to be identified is obtained;Based on described to be identified
The probability value of the first setting quantity of taxpayer obtains the recognition result of the taxpayer to be identified.It, can be preparatory in the program
Training first setting quantity xgboost model, then using first setting quantity training after xgboost model obtain to
Identify that the probability value of the first setting quantity of taxpayer, the probability value for being then based on the first setting quantity determine taxpayer to be identified
It whether is improper taxpayer, so as to realize by machine learning algorithm and big data technology, to construct improper pay taxes
People's identification model identifies whether taxpayer is normal.
Detailed description of the invention
Fig. 1 is a kind of flow chart of improper taxpayer's recognition methods in the embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of improper taxpayer's identification device in the embodiment of the present invention.
Specific embodiment
In order to realize by machine learning algorithm and big data technology, identified to construct improper taxpayer's identification model
Whether taxpayer is normal, and the embodiment of the present invention provides a kind of improper taxpayer's recognition methods, and xgboost algorithm is used as than single
One of more superior integrated study model of learner generalization ability, because it supports parallel, to joined regular terms in loss function
The advantages that can preventing over-fitting, has outstanding efficiency and higher prediction accuracy, in industry and kaggle contest all
It is frequently used.VAT invoice data volume is huge, and increment is also big, and the diversity of enterprise and otherness lead to the spy extracted
It levies there are a large amount of missing values in variable, the processing to missing values is also a characteristic of xgboost, it can learn to divide out automatically
Split direction.So the present embodiment proposes a kind of improper taxpayer's recognition methods based on xgboost algorithm, it is intended to change with
It relies on the traditional method of micro-judgment merely toward risk identification, improves recognition efficiency, done to improve value-added tax tax risk management
It benefits our pursuits out.
The process of above-mentioned improper taxpayer's recognition methods is as shown in Figure 1, execute that steps are as follows:
S11: the selected information of taxpayer to be identified is obtained.
For taxpayer in the present embodiment is primarily directed to enterprise, then the selected letter of corresponding taxpayer to be identified
Breath can be, but not limited to include company information, VAT invoice data, commodity detailed data etc..
S12: the spy of at least one selected feature of taxpayer to be identified is obtained from the selected information of taxpayer to be identified
Value indicative.
Company information, VAT invoice data, commodity detailed data based on taxpayer to be identified etc., from company information,
Management state, compared with other of the same trade enterprise development situations, invoice issuing, invoice cancel, invoice is received, invoice authentication, into
The variance analysis of item commodity detail, inside and outside the province multiple angles such as trading situation and upstream and downstream trading situation are sold, to design selected feature
Variable.
From company information angle, designing selected characteristic variable has industry code, registers whether type is individual and private
Battalion enterprise, whether be general taxpayer, legal person whether be stranger, enterprise set up the time, enterprise whether a location mostly according to, whether with
Many enterprises share legal person, do whether tax person is that many enterprises share, whether financial staff is shared many enterprises, legal person and wealth
Whether business personnel, which intersect, is served as.
From enterprise management condition angle, designing selected characteristic variable has the statistical value of the tax amount of offset item rate of change, income tax amount
The statistical value of the rate of change, the statistical value of the tax liability rate of change, the statistical value of the invoice amount rate of change, valence tax add up to the system of the rate of change
Evaluation, the statistical value of the profit rate of change, the statistical value of the burden of taxation rate of change, different time sections (nearest 3 months, T-6 to T-3 this 3
Month, T-9 to T-6 this 3 months) in the average pin item amount of money and average profit, above mentioned statistical value include median, variance
And mean value, T refer to the deadline that model data is extracted.
From the angle compared with other enterprise development situations of the same trade, designs the selected valuable tax of characteristic variable and add up to variance rate
Statistical value, the statistical value of profit variance rate, burden of taxation variance rate statistical value, statistical value here include median, variance and
Mean value.
From invoice issuing angle, designing selected characteristic variable has abnormal invoice number or amount of money accounting, the month that do not make out an invoice
Accounting is counted, not the moon number accounting of income invoice, invoice issuing object or region sum are red by ticket object or region sum
Number, the amount of money or the amount of tax to be paid ratio of word and blue word invoice, top plate issue invoice number or amount of money accounting, have sales invoice to send out without income
The moon number accounting of ticket.Features above different time sections (nearest 3 months, T-6 to T-3 this 3 months, T-9 to T-6 this three
A month) in calculated, wherein T refer to model data extract deadline.In addition, it is 5 days that minimum time granularity, which has also been devised,
Feature, i.e., nearest 5 days voided check numbers or amount of money accounting.
Cancel angle from invoice, design selected characteristic variable have different time sections (3 months, T-6 to T-3 this 3 months, T-9
To the interior voided check number of T-6 this 3 months) or amount of money accounting, abnormal voided check number or amount of money accounting, wherein T refers to model
The deadline that data are extracted.
Angle is received from invoice, designing selected characteristic variable has nearest 1 month or this 3 months 3 months or T-6 to T-3,
Or whether T-9 to T-6 frequently receives invoice, nearest 1 month invoice quantity purchase in this 3 months, T here refers to that model data mentions
The deadline taken.
From invoice authentication angle, design selected characteristic variable have different time sections (nearest 3 months, T-6 to T-3 this 3
Month, these three moons of T-9 to T-6) in certification when or invoice number or amount of money accounting out of control after authenticating, T here refer to pattern number
According to the deadline of extraction.
From into pin item commodity detail variance analysis angle, design selected characteristic variable have different time sections (nearest 3 months,
These three moons of T-6 to T-3 this 3 months, T-9 to T-6) it is interior into tax amount of offset item diversity factor, into pin item amount of money diversity factor, into pin item object
Product diversity factor, T here refer to the deadline that model data is extracted.
From trading situation inside and outside the province and upstream and downstream trading situation angle, designing selected characteristic variable has different time sections (most
Nearly 3 months, T-6 to T-3 this 3 months, these three moons of T-9 to T-6) in, with the inside the province or outside the province pin item amount of money of business transaction or
The mean value and variance of income amount of money accounting, the statistical value of lower/upper trip other provinces transaction enterprise's number rate of change, lower/upper trip other provinces transaction
The statistical value of the volume rate of change, up/down swim enterprise's number divergence, and up/down swims turnover divergence, up/down trip transaction stability.On
Stating statistical value includes median, variance and mean value.
S13: the characteristic value of at least one selected feature of taxpayer to be identified is sequentially input to the instruction of the first setting quantity
In xgboost model after white silk, the probability value of the first setting quantity of taxpayer to be identified is obtained.
S14: the probability value of the first setting quantity based on taxpayer to be identified obtains the recognition result of taxpayer to be identified.
In the program, the xgboost model of the first setting quantity can be trained in advance, then using the first setting quantity
Xgboost model after training obtains the probability value of the first setting quantity of taxpayer to be identified, is then based on the first setting number
The probability value of amount determines whether taxpayer to be identified is improper taxpayer, so as to realize by machine learning algorithm and big
Data technique, to construct improper taxpayer's identification model and then identify whether taxpayer is normal.
Specifically, at least one choosing of taxpayer to be identified is obtained in above-mentioned 12 from the selected information of taxpayer to be identified
The characteristic value for determining feature, specifically includes:
The initial spy of at least one selected feature of taxpayer to be identified is obtained from the selected information of taxpayer to be identified
Value indicative;
By initial characteristic values change unreasonable in the initial characteristic values of at least one selected feature of taxpayer to be identified
For setting value, will taxpayer to be identified at least one selected feature initial characteristic values in indicate the initial characteristic values mark of classification
Know and be standardized for classification, by initial characteristic values of at least one selected feature of taxpayer to be identified, is obtained wait know
The characteristic value of the selected feature of at least one of other taxpayer.
The initial spy of at least one selected feature of taxpayer to be identified is obtained from the selected information of taxpayer to be identified
Value indicative, these initial characteristic values might not be complied with standard all, standardization processing can be carried out according to the above aspect, to obtain
The characteristic value of the selected feature of at least one of taxpayer to be identified.Above it is only to list three kinds of modes, it can also be used
His mode, no longer illustrates one by one here.
Specifically, the probability value based on the first setting quantity in above-mentioned S14 obtains recognition result, specifically include:
Calculate the mean value of the probability value of the first setting quantity;
Mean value is compared with given threshold;
If mean value is greater than or equal to given threshold, it is determined that the artificial improper taxpayer to be identified that pays taxes;If mean value is less than
Given threshold, it is determined that taxpayer to be identified is normal taxpayer.
It averages to the prediction result of the xgboost model of the first setting quantity and is predicted to be as the taxpayer to be identified
The probability of improper taxpayer.It can be set according to actual needs given threshold, for example, can be, but not limited to be set as 0.9, if
Probability value >=0.9, then the artificial improper taxpayer to be identified that pays taxes is determined, otherwise, it is determined that the taxpayer to be identified is normal
Taxpayer.
Optionally, further includes:
The selected information and third that obtain the normal taxpayer of the second setting quantity set the improper taxpayer's of quantity
Selected information;
At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity
The characteristic value of selected feature adds the characteristic value of at least one selected feature of each normal taxpayer and normal taxpayer's label
It is added in the wide table of characteristic value label;
Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity
The characteristic value of one selected feature by the characteristic value of at least one selected feature of each improper taxpayer and improper is paid taxes
People's label is added in the wide table of characteristic value label;
The training sample set of test sample collection and the first setting quantity is obtained from the wide table of characteristic value label;
The training sample of first setting quantity is concentrated into the characteristic value of at least one selected feature of each taxpayer and right
The label answered is separately input to initial xgboost model, obtains the candidate xgboost model of the first setting quantity;
The characteristic value of at least one selected feature of each taxpayer and corresponding label is concentrated to test using test sample
The candidate xgboost model of first setting quantity;
Accurate rate and recall rate are determined based on test result;
If accurate rate and recall rate are up to standard, the candidate xgboost model of the first setting quantity is determined as first
Xgboost model after setting the training of quantity.
The parameter that the xgboost model of the first setting quantity can be set, using first setting quantity training sample into
Row training obtains the candidate xgboost model of the first setting quantity, then the xgboost model of these the first setting quantity
It is predicted applied to test sample collection.The parameter of xgboost model can be, but not limited to:
' objective ': ' binary:logistic ', the logistic regression problem of two classification export as probability
' max_depth ': the depth of tree is constructed
' eta ': over-fitting in order to prevent, the contraction step-length used in renewal process
' silent ': 0, do not export operation information
' eval_metric ': ' map ', evaluation index, map indicate consensus forecast
' lambda ': L2 regularization term parameter, Controlling model complexity
' min_child_weight ': the minimum value for the sum that second order is led in each leaf node
' nthread ': cpu Thread Count
' num_rounds ': the number of iterations.
Specifically, obtaining corresponding normal taxpayer in the selected information of the above-mentioned normal taxpayer from the second setting quantity
At least one selected feature characteristic value, specifically include:
At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity
The initial characteristic values of selected feature;
By in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity it is unreasonable just
Beginning characteristic value is changed to the initial characteristic values of at least one selected feature of setting value, the normal taxpayer for setting quantity for second
The middle initial characteristic values for indicating classification are identified as classification, at least one selected feature by the normal taxpayer of the second setting quantity
Initial characteristic values be standardized, obtain the characteristic value of at least one selected feature of corresponding normal taxpayer.
Specifically, obtaining corresponding improper receive in the selected information of the above-mentioned improper taxpayer from third setting quantity
The characteristic value of the selected feature of at least one of tax people, specifically includes:
Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity
The initial characteristic values of one selected feature;
It will be unreasonable in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity
Initial characteristic values are changed to setting value, by the initial spy of at least one selected feature of the improper taxpayer of third setting quantity
Indicate that the initial characteristic values of classification are identified as classification, third is set at least one choosing of the improper taxpayer of quantity in value indicative
The initial characteristic values for determining feature are standardized, and obtain at least one selected feature of corresponding improper taxpayer
Characteristic value.
Specifically, the above-mentioned training sample for obtaining test sample collection and the first setting quantity from the wide table of characteristic value label
Collection, specifically includes:
Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is improper to obtain first
Taxpayer's set and second improper taxpayer's set;
For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the wide table of characteristic value label
Middle extraction is identical as improper taxpayer's quantity that first improper taxpayer's set includes and what is be not extracted before normally receives
Tax people improper receives by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with first
The characteristic value and corresponding tag combination of at least one selected feature of improper taxpayer in tax people set, obtain an instruction
Practice sample set;
By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label and right
The label answered, with the characteristic value of at least one selected feature of the improper taxpayer in second improper taxpayer's set and right
The tag combination answered, obtains test sample collection.
Because the number of improper taxpayer is seldom, cause sample seriously unbalanced.In order to guarantee xgboost modelling effect,
Improper taxpayer is divided into two parts according to preset ratio, the first improper taxpayer set and second is obtained and improper pays taxes
People's set executes: from the wide table of characteristic value label for each training sample set that the training sample of the first setting quantity is concentrated
Extraction is identical as improper taxpayer's quantity that first improper taxpayer's set includes and what is be not extracted before normally pays taxes
People improper pays taxes by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with first
The characteristic value and corresponding tag combination of at least one selected feature of improper taxpayer in people's set, obtain a training
Sample set.
By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label and right
The label answered, with the characteristic value of at least one selected feature of the improper taxpayer in second improper taxpayer's set and right
The tag combination answered, obtains test sample collection.
Wherein, preset ratio need to guarantee that final test sample concentrates normal taxpayer and the ratio of improper taxpayer to connect
Nearly real-life actual ratio.
Specifically, above-mentioned determine accurate rate and recall rate based on test result, specifically include:
The calculation formula of accurate rate is as follows:
Precision=TP/ (TP+FP);
The calculation formula of recall rate is as follows:
Recall=TP/ (TP+FN);
Where it is assumed that improper taxpayer's sample is positive sample, normal taxpayer's sample is negative sample, and TP indicates test knot
Fruit be positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample number
Amount, FN indicate test result be negative sample, be actually positive sample sample size.
Accurate rate and recall rate are not positive correlation, often can reduce recall rate when improving accurate rate, and are improved
Recall rate can also reduce accurate rate.So need to continuously attempt to, adjusting parameter, seek the ginseng for meeting accurate rate and recall rate of compromise
Array is closed.Under best parameter group, feature importance ranking result is exported.Final result shows upstream and downstream transaction stability,
Whether industry etc. is improper taxpayer's important role to differentiation taxpayer.
Based on the same inventive concept, the embodiment of the present invention provides a kind of improper taxpayer's identification device, the knot of the device
Structure is as shown in Figure 2, comprising:
First obtains module 21, for obtaining the selected information of taxpayer to be identified;
Second obtains module 22, for obtaining at least the one of taxpayer to be identified from the selected information of taxpayer to be identified
The characteristic value of a selected feature;
Input module 23 is set for the characteristic value of at least one selected feature of taxpayer to be identified to be sequentially input first
In xgboost model after the training of fixed number amount, the probability value of the first setting quantity of taxpayer to be identified is obtained;
Identification module 24, the probability value for the first setting quantity based on taxpayer to be identified obtain taxpayer to be identified
Recognition result.
In the program, the xgboost model of the first setting quantity can be trained in advance, then using the first setting quantity
Xgboost model after training obtains the probability value of the first setting quantity of taxpayer to be identified, is then based on the first setting number
The probability value of amount determines whether taxpayer to be identified is improper taxpayer, so as to realize by machine learning algorithm and big
Data technique identifies whether taxpayer is normal to construct improper taxpayer's identification model.
Specifically, second obtains module, for obtaining taxpayer's to be identified from the selected information of taxpayer to be identified
The characteristic value of at least one selected feature, is specifically used for:
The initial spy of at least one selected feature of taxpayer to be identified is obtained from the selected information of taxpayer to be identified
Value indicative;
By initial characteristic values change unreasonable in the initial characteristic values of at least one selected feature of taxpayer to be identified
For setting value, will taxpayer to be identified at least one selected feature initial characteristic values in indicate the initial characteristic values mark of classification
Know and be standardized for classification, by initial characteristic values of at least one selected feature of taxpayer to be identified, is obtained wait know
The characteristic value of the selected feature of at least one of other taxpayer.
Specifically, identification module, obtains recognition result for the probability value based on the first setting quantity, is specifically used for:
Calculate the mean value of the probability value of the first setting quantity;
Mean value is compared with given threshold;
If mean value is greater than or equal to given threshold, it is determined that the artificial improper taxpayer to be identified that pays taxes;If mean value is less than
Given threshold, it is determined that taxpayer to be identified is normal taxpayer.
Optionally, further includes:
Third obtains module, for obtaining the selected information and third setting quantity of the normal taxpayer of the second setting quantity
Improper taxpayer selected information;
Adding module corresponding is normally paid taxes for obtaining from the selected information of the normal taxpayer of the second setting quantity
The characteristic value of the selected feature of at least one of people, by the characteristic value of at least one selected feature of each normal taxpayer and normally
Taxpayer's label is added in the wide table of characteristic value label;It is obtained from the selected information of the improper taxpayer of third setting quantity
The characteristic value of the selected feature of at least one of corresponding improper taxpayer, at least one by each improper taxpayer are selected
The characteristic value of feature and improper taxpayer's label are added in the wide table of characteristic value label;
4th obtains module, for obtaining the training of test sample collection and the first setting quantity from the wide table of characteristic value label
Sample set;
Training module, for the training sample of the first setting quantity to be concentrated at least one selected feature of each taxpayer
Characteristic value and corresponding label be separately input to initial xgboost model, obtain the candidate of the first setting quantity
Xgboost model;
Test module, for concentrating the characteristic value of at least one selected feature of each taxpayer and right using test sample
The candidate xgboost model for label test the first setting quantity answered;
First determining module, for determining accurate rate and recall rate based on test result;
Second determining module, if up to standard for accurate rate and recall rate, by the candidate's of the first setting quantity
Xgboost model is determined as the xgboost model after the training of the first setting quantity.
Specifically, adding module, corresponding for obtaining from the selected information of the normal taxpayer of the second setting quantity
The characteristic value of the selected feature of at least one of normal taxpayer, is specifically used for:
At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity
The initial characteristic values of selected feature;
By in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity it is unreasonable just
Beginning characteristic value is changed to the initial characteristic values of at least one selected feature of setting value, the normal taxpayer for setting quantity for second
The middle initial characteristic values for indicating classification are identified as classification, at least one selected feature by the normal taxpayer of the second setting quantity
Initial characteristic values be standardized, obtain the characteristic value of at least one selected feature of corresponding normal taxpayer.
Specifically, adding module, corresponds to for obtaining from the selected information of the improper taxpayer of third setting quantity
Improper taxpayer at least one selected feature characteristic value, be specifically used for:
Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity
The initial characteristic values of one selected feature;
It will be unreasonable in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity
Initial characteristic values are changed to setting value, by the initial spy of at least one selected feature of the improper taxpayer of third setting quantity
Indicate that the initial characteristic values of classification are identified as classification, third is set at least one choosing of the improper taxpayer of quantity in value indicative
The initial characteristic values for determining feature are standardized, and obtain at least one selected feature of corresponding improper taxpayer
Characteristic value.
Specifically, the 4th obtains module, for obtaining test sample collection and the first setting number from the wide table of characteristic value label
The training sample set of amount, is specifically used for:
Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is improper to obtain first
Taxpayer's set and second improper taxpayer's set;
For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the wide table of characteristic value label
Middle extraction is identical as improper taxpayer's quantity that first improper taxpayer's set includes and what is be not extracted before normally receives
Tax people improper receives by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with first
The characteristic value and corresponding tag combination of at least one selected feature of improper taxpayer in tax people set, obtain an instruction
Practice sample set;
By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label and right
The label answered, with the characteristic value of at least one selected feature of the improper taxpayer in second improper taxpayer's set and right
The tag combination answered, obtains test sample collection.
Specifically, the first determining module, for determining accurate rate and recall rate based on test result, it is specifically used for:
The calculation formula of accurate rate is as follows:
Precision=TP/ (TP+FP);
The calculation formula of recall rate is as follows:
Recall=TP/ (TP+FN);
Where it is assumed that improper taxpayer's sample is positive sample, normal taxpayer's sample is negative sample, and TP indicates test knot
Fruit be positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample
Quantity, FN indicate test result be negative sample, be actually positive sample sample size.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although alternative embodiment of the invention has been described, created once a person skilled in the art knows basic
Property concept, then additional changes and modifications may be made to these embodiments.So the following claims are intended to be interpreted as include can
It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, those skilled in the art can carry out various modification and variations without departing from this hair to the embodiment of the present invention
The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention
And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.
Claims (16)
1. a kind of improper taxpayer's recognition methods, which is characterized in that the described method includes:
Obtain the selected information of taxpayer to be identified;
The spy of at least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified
Value indicative;
After the training that the characteristic value of at least one selected feature of the taxpayer to be identified is sequentially input to the first setting quantity
Xgboost model in, obtain the taxpayer to be identified first setting quantity probability value;
The probability value of the first setting quantity based on the taxpayer to be identified obtains the recognition result of the taxpayer to be identified.
2. the method as described in claim 1, which is characterized in that described in being obtained from the selected information of the taxpayer to be identified
The characteristic value of the selected feature of at least one of taxpayer to be identified, specifically includes:
The first of at least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified
Beginning characteristic value;
By initial characteristic values change unreasonable in the initial characteristic values of at least one selected feature of the taxpayer to be identified
For setting value, will the taxpayer to be identified at least one selected feature initial characteristic values in indicate the initial characteristics of classification
Value is identified as classification, is standardized the initial characteristic values of at least one selected feature of the taxpayer to be identified,
Obtain the characteristic value of at least one selected feature of the taxpayer to be identified.
3. the method as described in claim 1, which is characterized in that the probability value based on the first setting quantity obtains identification knot
Fruit specifically includes:
Calculate the mean value of the probability value of the first setting quantity;
The mean value is compared with given threshold;
If the mean value is greater than or equal to the given threshold, it is determined that the artificial improper taxpayer to be identified that pays taxes;If
The mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.
4. method a method according to any one of claims 1-3, which is characterized in that further include:
The selected information and third that obtain the normal taxpayer of the second setting quantity set the selected of the improper taxpayer of quantity
Information;
At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity
The characteristic value of selected feature adds the characteristic value of at least one selected feature of each normal taxpayer and normal taxpayer's label
It is added in the wide table of characteristic value label;
Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity
The characteristic value of one selected feature by the characteristic value of at least one selected feature of each improper taxpayer and improper is paid taxes
People's label is added in the wide table of characteristic value label;
The training sample set of test sample collection and the first setting quantity is obtained from the wide table of characteristic value label;
The training sample of the first setting quantity is concentrated into the characteristic value of at least one selected feature of each taxpayer and right
The label answered is separately input to initial xgboost model, obtains the candidate xgboost model of the first setting quantity;
The characteristic value of at least one selected feature of each taxpayer and corresponding label is concentrated to test using the test sample
The candidate xgboost model of the first setting quantity;
Accurate rate and recall rate are determined based on test result;
It is if the accurate rate and the recall rate are up to standard, the candidate xgboost model of the first setting quantity is true
Xgboost model after being set to the training of the first setting quantity.
5. method as claimed in claim 4, which is characterized in that from the selected letter of the normal taxpayer of the second setting quantity
The characteristic value that at least one selected feature of corresponding normal taxpayer is obtained in breath, specifically includes:
At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity
The initial characteristic values of selected feature;
By in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity it is unreasonable just
Beginning characteristic value is changed to the initial spy of at least one selected feature of setting value, the normal taxpayer for setting quantity for described second
Indicate that the initial characteristic values of classification are identified as at least one of classification, the normal taxpayer for setting quantity for described second in value indicative
The initial characteristic values of selected feature are standardized, and obtain at least one selected feature of corresponding normal taxpayer
Characteristic value.
6. method as claimed in claim 4, which is characterized in that set the selected of the improper taxpayer of quantity from the third
The characteristic value that at least one selected feature of corresponding improper taxpayer is obtained in information, specifically includes:
Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity
The initial characteristic values of one selected feature;
It will be unreasonable in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity
Initial characteristic values be changed to setting value, will third setting quantity improper taxpayer at least one selected feature just
Indicate that the initial characteristic values of classification are identified as classification, the third is set the improper taxpayer of quantity extremely in beginning characteristic value
The initial characteristic values of a few selected feature are standardized, and obtain at least one choosing of corresponding improper taxpayer
Determine the characteristic value of feature.
7. method as claimed in claim 4, which is characterized in that from the wide table of characteristic value label obtain test sample collection and
The training sample set of the first setting quantity, specifically includes:
Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is improper to obtain first
Taxpayer's set and second improper taxpayer's set;
For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the characteristic value label
It extracts identical as improper taxpayer's quantity that described first improper taxpayer's set includes in wide table and is not extracted before
Normal taxpayer, by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with it is described
The characteristic value and corresponding set of tags of at least one selected feature of improper taxpayer in first improper taxpayer's set
It closes, obtains a training sample set;
By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label and right
The characteristic value of at least one selected feature of the label answered and the improper taxpayer in described second improper taxpayer's set
With corresponding tag combination, test sample collection is obtained.
8. method as claimed in claim 4, which is characterized in that accurate rate and recall rate are determined based on test result, it is specific to wrap
It includes:
The calculation formula of the accurate rate is as follows:
Precision=TP/ (TP+FP);
The calculation formula of the recall rate is as follows:
Recall=TP/ (TP+FN);
Where it is assumed that being positive sample by improper taxpayer's sample, normal taxpayer's sample is negative sample, and TP indicates test result
For positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample number
Amount, FN indicate test result be negative sample, be actually positive sample sample size.
9. a kind of improper taxpayer's identification device, which is characterized in that described device includes:
First obtains module, for obtaining the selected information of taxpayer to be identified;
Second obtains module, for obtaining the taxpayer to be identified at least from the selected information of the taxpayer to be identified
The characteristic value of one selected feature;
Input module, for the characteristic value of at least one selected feature of the taxpayer to be identified to be sequentially input the first setting
In xgboost model after the training of quantity, the probability value of the first setting quantity of the taxpayer to be identified is obtained;
Identification module, the probability value for the first setting quantity based on the taxpayer to be identified obtain described to be identified pay taxes
The recognition result of people.
10. device as claimed in claim 9, which is characterized in that described second obtains module, for be identified paying taxes from described
The characteristic value that at least one selected feature of the taxpayer to be identified is obtained in the selected information of people, is specifically used for:
The first of at least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified
Beginning characteristic value;
By initial characteristic values change unreasonable in the initial characteristic values of at least one selected feature of the taxpayer to be identified
For setting value, will the taxpayer to be identified at least one selected feature initial characteristic values in indicate the initial characteristics of classification
Value is identified as classification, is standardized the initial characteristic values of at least one selected feature of the taxpayer to be identified,
Obtain the characteristic value of at least one selected feature of the taxpayer to be identified.
11. device as claimed in claim 9, which is characterized in that the identification module, for based on the first setting quantity
Probability value obtain recognition result, be specifically used for:
Calculate the mean value of the probability value of the first setting quantity;
The mean value is compared with given threshold;
If the mean value is more than or equal to the given threshold, it is determined that the artificial improper taxpayer to be identified that pays taxes;
If the mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.
12. the device as described in claim 9-11 is any, which is characterized in that further include:
Third obtains module, and the selected information and third for obtaining the normal taxpayer of the second setting quantity set the non-of quantity
The selected information of normal taxpayer;
Adding module corresponding is normally paid taxes for obtaining from the selected information of the normal taxpayer of the second setting quantity
The characteristic value of the selected feature of at least one of people, by the characteristic value of at least one selected feature of each normal taxpayer and normally
Taxpayer's label is added in the wide table of characteristic value label;From the selected information of the improper taxpayer of third setting quantity
The characteristic value for obtaining at least one selected feature of corresponding improper taxpayer, by least one of each improper taxpayer
The characteristic value of selected feature and improper taxpayer's label are added in the wide table of characteristic value label;
4th obtains module, sets quantity for obtaining test sample collection and described first from the wide table of characteristic value label
Training sample set;
Training module, for the training sample of the first setting quantity to be concentrated at least one selected feature of each taxpayer
Characteristic value and corresponding label be separately input to initial xgboost model, obtain the candidate of the first setting quantity
Xgboost model;
Test module, for concentrating the characteristic value of at least one selected feature of each taxpayer and right using the test sample
The label answered tests the candidate xgboost model of the first setting quantity;
First determining module, for determining accurate rate and recall rate based on test result;
Second determining module, if up to standard for the accurate rate and the recall rate, by the time of the first setting quantity
The xgboost model of choosing is determined as the xgboost model after the training of the first setting quantity.
13. device as claimed in claim 12, which is characterized in that the adding module, for setting quantity from described second
Normal taxpayer selected information in obtain corresponding normal taxpayer at least one selected feature characteristic value, it is specific to use
In:
At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity
The initial characteristic values of selected feature;
By in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity it is unreasonable just
Beginning characteristic value is changed to the initial spy of at least one selected feature of setting value, the normal taxpayer for setting quantity for described second
Indicate that the initial characteristic values of classification are identified as at least one of classification, the normal taxpayer for setting quantity for described second in value indicative
The initial characteristic values of selected feature are standardized, and obtain at least one selected feature of corresponding normal taxpayer
Characteristic value.
14. device as claimed in claim 12, which is characterized in that the adding module, for setting quantity from the third
Improper taxpayer selected information in obtain corresponding improper taxpayer at least one selected feature characteristic value, tool
Body is used for:
Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity
The initial characteristic values of one selected feature;
It will be unreasonable in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity
Initial characteristic values be changed to setting value, will third setting quantity improper taxpayer at least one selected feature just
Indicate that the initial characteristic values of classification are identified as classification, the third is set the improper taxpayer of quantity extremely in beginning characteristic value
The initial characteristic values of a few selected feature are standardized, and obtain at least one choosing of corresponding improper taxpayer
Determine the characteristic value of feature.
15. device as claimed in claim 12, which is characterized in that the described 4th obtains module, is used for from the characteristic value mark
The training sample set for obtaining test sample collection and the first setting quantity in wide table is signed, is specifically used for:
Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is improper to obtain first
Taxpayer's set and second improper taxpayer's set;
For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the characteristic value label
It extracts identical as improper taxpayer's quantity that described first improper taxpayer's set includes in wide table and is not extracted before
Normal taxpayer, by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with it is described
The characteristic value and corresponding set of tags of at least one selected feature of improper taxpayer in first improper taxpayer's set
It closes, obtains a training sample set;
By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label and right
The characteristic value of at least one selected feature of the label answered and the improper taxpayer in described second improper taxpayer's set
With corresponding tag combination, test sample collection is obtained.
16. device as claimed in claim 12, which is characterized in that first determining module, for true based on test result
Determine accurate rate and recall rate, be specifically used for:
The calculation formula of the accurate rate is as follows:
Precision=TP/ (TP+FP);
The calculation formula of the recall rate is as follows:
Recall=TP/ (TP+FN);
Where it is assumed that being positive sample by improper taxpayer's sample, normal taxpayer's sample is negative sample, and TP indicates test result
For positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample number
Amount, FN indicate test result be negative sample, be actually positive sample sample size.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811584029.3A CN109858922A (en) | 2018-12-24 | 2018-12-24 | Improper taxpayer's recognition methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811584029.3A CN109858922A (en) | 2018-12-24 | 2018-12-24 | Improper taxpayer's recognition methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109858922A true CN109858922A (en) | 2019-06-07 |
Family
ID=66891967
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811584029.3A Pending CN109858922A (en) | 2018-12-24 | 2018-12-24 | Improper taxpayer's recognition methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109858922A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111222766A (en) * | 2019-12-29 | 2020-06-02 | 航天信息股份有限公司 | Method and system for early warning of enterprise false invoicing |
CN112036997A (en) * | 2020-08-28 | 2020-12-04 | 山东浪潮商用系统有限公司 | Method and device for predicting abnormal user in taxpayer |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104103011A (en) * | 2014-07-10 | 2014-10-15 | 西安交通大学 | Suspicious taxpayer recognition method based on taxpayer interest incidence network |
CN104102706A (en) * | 2014-07-10 | 2014-10-15 | 西安交通大学 | Hierarchical clustering-based suspicious taxpayer detection method |
CN104598634A (en) * | 2015-02-06 | 2015-05-06 | 浪潮集团有限公司 | Electronic commerce tax fund management analysis method |
CN107909433A (en) * | 2017-11-14 | 2018-04-13 | 重庆邮电大学 | A kind of Method of Commodity Recommendation based on big data mobile e-business |
CN108199795A (en) * | 2017-12-29 | 2018-06-22 | 北京百分点信息科技有限公司 | The monitoring method and device of a kind of equipment state |
CN108805583A (en) * | 2018-05-18 | 2018-11-13 | 连连银通电子支付有限公司 | Electric business fraud detection method, device, equipment and medium based on address of cache |
CN109063931A (en) * | 2018-09-06 | 2018-12-21 | 盈盈(杭州)网络技术有限公司 | A kind of model method for predicting freight logistics driver Default Probability |
-
2018
- 2018-12-24 CN CN201811584029.3A patent/CN109858922A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104103011A (en) * | 2014-07-10 | 2014-10-15 | 西安交通大学 | Suspicious taxpayer recognition method based on taxpayer interest incidence network |
CN104102706A (en) * | 2014-07-10 | 2014-10-15 | 西安交通大学 | Hierarchical clustering-based suspicious taxpayer detection method |
CN104598634A (en) * | 2015-02-06 | 2015-05-06 | 浪潮集团有限公司 | Electronic commerce tax fund management analysis method |
CN107909433A (en) * | 2017-11-14 | 2018-04-13 | 重庆邮电大学 | A kind of Method of Commodity Recommendation based on big data mobile e-business |
CN108199795A (en) * | 2017-12-29 | 2018-06-22 | 北京百分点信息科技有限公司 | The monitoring method and device of a kind of equipment state |
CN108805583A (en) * | 2018-05-18 | 2018-11-13 | 连连银通电子支付有限公司 | Electric business fraud detection method, device, equipment and medium based on address of cache |
CN109063931A (en) * | 2018-09-06 | 2018-12-21 | 盈盈(杭州)网络技术有限公司 | A kind of model method for predicting freight logistics driver Default Probability |
Non-Patent Citations (5)
Title |
---|
刘晗;余小清;万旺根;马秀丽;: "基于粗糙集理论与支持向量机的纳税评估模型", vol. 26, no. 12, pages 253 - 256 * |
徐迪: "一种基于XGBoost的恶意HTTP请求识别方法", vol. 31, no. 12, pages 22 - 27 * |
樊重俊: "《大数据分析与应用》", 立信会计出版社, pages: 189 - 196 * |
石涛: "基于XGBoost的企业倒闭风险预测", no. 8, pages 102 - 104 * |
郑树旺;李和琴;: "数据挖掘技术支持下小型民营企业纳税评估模型研究", no. 12, pages 55 - 59 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111222766A (en) * | 2019-12-29 | 2020-06-02 | 航天信息股份有限公司 | Method and system for early warning of enterprise false invoicing |
CN112036997A (en) * | 2020-08-28 | 2020-12-04 | 山东浪潮商用系统有限公司 | Method and device for predicting abnormal user in taxpayer |
CN112036997B (en) * | 2020-08-28 | 2023-08-04 | 浪潮软件科技有限公司 | Method and device for predicting abnormal users in taxpayers |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI584215B (en) | Method of monitoring suspicious transactions | |
CN109492945A (en) | Business risk identifies monitoring method, device, equipment and storage medium | |
CN109657894A (en) | Credit Risk Assessment of Enterprise method for early warning, device, equipment and storage medium | |
CN110246031A (en) | Appraisal procedure, system, equipment and the storage medium of business standing | |
CN109146662A (en) | A kind of risk control method and device | |
CN106355499A (en) | Stock price trend forecasting and trading method | |
CN107657500A (en) | Stock recommends method and server | |
CN109727136A (en) | The configuration method and device of financial asset | |
CN109583966A (en) | A kind of high value customer recognition methods, system, equipment and storage medium | |
Mahuni et al. | Nexus between doing business indicators and foreign direct investment for Zimbabwe: A time series analysis | |
CN113095927A (en) | Method and device for identifying suspicious transactions of anti-money laundering | |
WO2022143431A1 (en) | Method and apparatus for training anti-money laundering model | |
Asghar | Foreign Direct Investment and Trade Openness: The Case of South Asian Economies | |
CN109858922A (en) | Improper taxpayer's recognition methods and device | |
CN109934700A (en) | A kind of method and device of arbitrage detecting | |
Zhitlukhina et al. | Issues of Falsifying Financial Statements in Terms of Economic Security. | |
CN110245879A (en) | A kind of risk rating method and device | |
Šverko Grdić et al. | Insolvency in the Republic of Croatia | |
Ford et al. | A simultaneous equation model of economic growth, FDI and government policy in China | |
CN109767314A (en) | Trade company's risk management and control method, device, computer equipment and storage medium | |
Caldecott et al. | Empirical calibration of climate policy using corporate solvency: a case study of the UK’s carbon price support | |
Liu et al. | A new market risk management approach for commercial banks' fixed‐income securities trading accounts | |
Plaskova et al. | Principles of forming a modern accounting and analytical model of commercial organization in digital economy | |
CN108242019A (en) | The monitoring method and system of the taxable sales volume of small-scale taxpayer year based on SPARK | |
Zelenkov et al. | Bankruptcy factors at different stages of the lifecycle for Russian companies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190607 |