CN109858922A

CN109858922A - Improper taxpayer's recognition methods and device

Info

Publication number: CN109858922A
Application number: CN201811584029.3A
Authority: CN
Inventors: 刘芬; 舒南飞; 林文辉; 王志刚
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2019-06-07

Abstract

The invention discloses a kind of improper taxpayer's recognition methods and devices, this method comprises: obtaining the selected information of taxpayer to be identified；The characteristic value of at least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified；In xgboost model after the characteristic value of at least one selected feature of the taxpayer to be identified to be sequentially input to the training of the first setting quantity, the probability value of the first setting quantity of the taxpayer to be identified is obtained；The probability value of the first setting quantity based on the taxpayer to be identified obtains the recognition result of the taxpayer to be identified.The program may be implemented by machine learning algorithm and big data technology, identify whether taxpayer is normal to construct improper taxpayer's identification model.

Description

Improper taxpayer's recognition methods and device

Technical field

The present invention relates to technical field of information processing, espespecially a kind of improper taxpayer's recognition methods and device.

Background technique

Tax revenue is the most important receipts form of national public finance and source.Although the popularization of forgery prevention for value-added tax taxation control system It uses, is tax collection and administration and a strong means for increasing country's income, but still insufficient in tax risk management aspect, also It is the business experience for mainly relying on tax analysis personnel, not only subjective ingredient is dense, and accuracy is not strong, and inefficiency, Especially in the differentiation for writing out falsely invoice enterprise and Zou Tao enterprise.By machine learning algorithm and big data technology, come construct it is non-just Normal taxpayer's identification model identifies whether taxpayer is normal, can not only promote monitoring recognition effect and the identification of suspicious enterprise Efficiency, and help to maintain normal tax revenue and economic order.

Summary of the invention

The embodiment of the present invention provides a kind of improper taxpayer's recognition methods and device, calculates to realize by machine learning Method and big data technology identify whether taxpayer is normal to construct improper taxpayer's identification model.

According to embodiments of the present invention, a kind of improper taxpayer's recognition methods is provided, which comprises

Obtain the selected information of taxpayer to be identified；

At least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified Characteristic value；

The characteristic value of at least one selected feature of the taxpayer to be identified is sequentially input to the instruction of the first setting quantity In xgboost model after white silk, the probability value of the first setting quantity of the taxpayer to be identified is obtained；

The probability value of the first setting quantity based on the taxpayer to be identified obtains the identification of the taxpayer to be identified As a result.

Specifically, obtaining at least one choosing of the taxpayer to be identified from the selected information of the taxpayer to be identified The characteristic value for determining feature, specifically includes:

At least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified Initial characteristic values；

By initial characteristic values unreasonable in the initial characteristic values of at least one selected feature of the taxpayer to be identified It is changed to setting value, will indicate the initial of classification in the initial characteristic values of at least one selected feature of the taxpayer to be identified Characteristic value is identified as classification, the initial characteristic values of at least one selected feature of the taxpayer to be identified is standardized place Reason obtains the characteristic value of at least one selected feature of the taxpayer to be identified.

Specifically, the probability value based on the first setting quantity obtains recognition result, specifically include:

Calculate the mean value of the probability value of the first setting quantity；

The mean value is compared with given threshold；

If the mean value is greater than or equal to the given threshold, it is determined that artificially improper pay taxes to be identified of paying taxes People；If the mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.

Specifically, further include:

The selected information and third that obtain the normal taxpayer of the second setting quantity set the improper taxpayer's of quantity Selected information；

Corresponding normal taxpayer is obtained at least from the selected information of the normal taxpayer of the second setting quantity The characteristic value of one selected feature marks the characteristic value of at least one selected feature of each normal taxpayer and normal taxpayer Label are added in the wide table of characteristic value label；

Obtain corresponding improper taxpayer's from the selected information of the improper taxpayer of third setting quantity The characteristic value of at least one selected feature, by the characteristic value of at least one selected feature of each improper taxpayer and improper Taxpayer's label is added in the wide table of characteristic value label；

The training sample set of test sample collection and the first setting quantity is obtained from the wide table of characteristic value label；

The training sample of the first setting quantity is concentrated to the characteristic value of at least one selected feature of each taxpayer It is separately input to initial xgboost model with corresponding label, obtains the candidate xgboost mould of the first setting quantity Type；

Using the test sample concentrate at least one selected feature of each taxpayer characteristic value and corresponding label Test the candidate xgboost model of the first setting quantity；

Accurate rate and recall rate are determined based on test result；

If the accurate rate and the recall rate are up to standard, by the candidate xgboost mould of the first setting quantity Type is determined as the xgboost model after the training of the first setting quantity.

Specifically, obtaining corresponding normal taxpayer from the selected information of the normal taxpayer of the second setting quantity At least one selected feature characteristic value, specifically include:

Corresponding normal taxpayer is obtained at least from the selected information of the normal taxpayer of the second setting quantity The initial characteristic values of one selected feature；

It will be unreasonable in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity Initial characteristic values be changed to setting value, will the second setting quantity normal taxpayer at least one selected feature just Indicate that the initial characteristic values of classification are identified as classification, set the normal taxpayer of quantity at least for described second in beginning characteristic value The initial characteristic values of one selected feature are standardized, and obtain at least one selected spy of corresponding normal taxpayer The characteristic value of sign.

Specifically, obtaining corresponding improper receive from the selected information of the improper taxpayer of third setting quantity The characteristic value of the selected feature of at least one of tax people, specifically includes:

Obtain corresponding improper taxpayer's from the selected information of the improper taxpayer of third setting quantity The initial characteristic values of at least one selected feature；

It will not conform in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity The initial characteristic values of reason are changed to setting value, at least one selected feature by the improper taxpayer of third setting quantity Initial characteristic values in indicate classification initial characteristic values be identified as classification, by the third setting quantity improper taxpayer Initial characteristic values of at least one selected feature be standardized, obtain at least the one of corresponding improper taxpayer The characteristic value of a selected feature.

Specifically, obtaining the training sample of test sample collection and the first setting quantity from the wide table of characteristic value label This collection specifically includes:

Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is non-to obtain first Normal taxpayer's set and second improper taxpayer's set；

For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the characteristic value Extracted in the wide table of label the improper taxpayer's quantity gathered with the described first improper taxpayer and include it is identical and before not by The normal taxpayer extracted, by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with The characteristic value of at least one selected feature of improper taxpayer in first improper taxpayer's set and corresponding mark Label combination, obtains a training sample set；

By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label With corresponding label, the spy at least one selected feature of the improper taxpayer in described second improper taxpayer's set Value indicative and corresponding tag combination, obtain test sample collection.

Specifically, determining accurate rate and recall rate based on test result, specifically include:

The calculation formula of the accurate rate is as follows:

Precision=TP/ (TP+FP)；

The calculation formula of the recall rate is as follows:

Recall=TP/ (TP+FN)；

Where it is assumed that being positive sample by improper taxpayer's sample, normal taxpayer's sample is negative sample, and TP indicates test As a result be positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample This quantity, FN indicate test result be negative sample, be actually positive sample sample size.

According to embodiments of the present invention, a kind of improper taxpayer's identification device is also provided, described device includes:

First obtains module, for obtaining the selected information of taxpayer to be identified；

Second obtains module, for obtaining the taxpayer's to be identified from the selected information of the taxpayer to be identified The characteristic value of at least one selected feature；

Input module, for the characteristic value of at least one selected feature of the taxpayer to be identified to be sequentially input first In xgboost model after setting the training of quantity, the probability value of the first setting quantity of the taxpayer to be identified is obtained；

Identification module, the probability value for the first setting quantity based on the taxpayer to be identified obtain described to be identified The recognition result of taxpayer.

Specifically, described second obtains module, for obtained from the selected information of the taxpayer to be identified it is described to The characteristic value for identifying at least one selected feature of taxpayer, is specifically used for:

Specifically, the identification module, obtains recognition result for the probability value based on the first setting quantity, specifically For:

The mean value is compared with given threshold；

If the mean value is more than or equal to the given threshold, it is determined that artificially improper pay taxes to be identified of paying taxes People；If the mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.

Specifically, further include:

Third obtains module, for obtaining the selected information and third setting quantity of the normal taxpayer of the second setting quantity Improper taxpayer selected information；

Adding module is corresponding normal for obtaining from the selected information of the normal taxpayer of the second setting quantity The characteristic value of the selected feature of at least one of taxpayer, by the characteristic value of at least one selected feature of each normal taxpayer and Normal taxpayer's label is added in the wide table of characteristic value label；From the selected letter of the improper taxpayer of third setting quantity The characteristic value of at least one selected feature of corresponding improper taxpayer is obtained in breath, at least by each improper taxpayer The characteristic value of one selected feature and improper taxpayer's label are added in the wide table of characteristic value label；

4th obtains module, for obtaining test sample collection and the first setting number from the wide table of characteristic value label The training sample set of amount；

Training module, for concentrating at least one of each taxpayer to select the training sample of the first setting quantity The characteristic value of feature and corresponding label are separately input to initial xgboost model, obtain the time of the first setting quantity The xgboost model of choosing；

Test module, the characteristic value of at least one selected feature for concentrating each taxpayer using the test sample The candidate xgboost model of the first setting quantity is tested with corresponding label；

First determining module, for determining accurate rate and recall rate based on test result；

Second determining module, if up to standard for the accurate rate and the recall rate, by the first setting quantity Candidate xgboost model be determined as the xgboost model after the training of the first setting quantity.

Specifically, the adding module, for being obtained from the selected information of the normal taxpayer of the second setting quantity The characteristic value for taking at least one selected feature of corresponding normal taxpayer, is specifically used for:

Specifically, the adding module, for from the selected information of the improper taxpayer of third setting quantity The characteristic value for obtaining at least one selected feature of corresponding improper taxpayer, is specifically used for:

Specifically, the described 4th obtains module, for obtaining test sample collection and institute from the wide table of characteristic value label The training sample set for stating the first setting quantity, is specifically used for:

Specifically, first determining module is specifically used for for determining accurate rate and recall rate based on test result:

The calculation formula of the accurate rate is as follows:

Precision=TP/ (TP+FP)；

The calculation formula of the recall rate is as follows:

Recall=TP/ (TP+FN)；

The present invention has the beneficial effect that:

The embodiment of the present invention provides a kind of improper taxpayer's recognition methods and device, passes through and obtains taxpayer's to be identified Selected information；At least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified Characteristic value；The characteristic value of at least one selected feature of the taxpayer to be identified is sequentially input to the instruction of the first setting quantity In xgboost model after white silk, the probability value of the first setting quantity of the taxpayer to be identified is obtained；Based on described to be identified The probability value of the first setting quantity of taxpayer obtains the recognition result of the taxpayer to be identified.It, can be preparatory in the program Training first setting quantity xgboost model, then using first setting quantity training after xgboost model obtain to Identify that the probability value of the first setting quantity of taxpayer, the probability value for being then based on the first setting quantity determine taxpayer to be identified It whether is improper taxpayer, so as to realize by machine learning algorithm and big data technology, to construct improper pay taxes People's identification model identifies whether taxpayer is normal.

Detailed description of the invention

Fig. 1 is a kind of flow chart of improper taxpayer's recognition methods in the embodiment of the present invention；

Fig. 2 is a kind of structural schematic diagram of improper taxpayer's identification device in the embodiment of the present invention.

Specific embodiment

In order to realize by machine learning algorithm and big data technology, identified to construct improper taxpayer's identification model Whether taxpayer is normal, and the embodiment of the present invention provides a kind of improper taxpayer's recognition methods, and xgboost algorithm is used as than single One of more superior integrated study model of learner generalization ability, because it supports parallel, to joined regular terms in loss function The advantages that can preventing over-fitting, has outstanding efficiency and higher prediction accuracy, in industry and kaggle contest all It is frequently used.VAT invoice data volume is huge, and increment is also big, and the diversity of enterprise and otherness lead to the spy extracted It levies there are a large amount of missing values in variable, the processing to missing values is also a characteristic of xgboost, it can learn to divide out automatically Split direction.So the present embodiment proposes a kind of improper taxpayer's recognition methods based on xgboost algorithm, it is intended to change with It relies on the traditional method of micro-judgment merely toward risk identification, improves recognition efficiency, done to improve value-added tax tax risk management It benefits our pursuits out.

The process of above-mentioned improper taxpayer's recognition methods is as shown in Figure 1, execute that steps are as follows:

S11: the selected information of taxpayer to be identified is obtained.

For taxpayer in the present embodiment is primarily directed to enterprise, then the selected letter of corresponding taxpayer to be identified Breath can be, but not limited to include company information, VAT invoice data, commodity detailed data etc..

S12: the spy of at least one selected feature of taxpayer to be identified is obtained from the selected information of taxpayer to be identified Value indicative.

Company information, VAT invoice data, commodity detailed data based on taxpayer to be identified etc., from company information, Management state, compared with other of the same trade enterprise development situations, invoice issuing, invoice cancel, invoice is received, invoice authentication, into The variance analysis of item commodity detail, inside and outside the province multiple angles such as trading situation and upstream and downstream trading situation are sold, to design selected feature Variable.

From company information angle, designing selected characteristic variable has industry code, registers whether type is individual and private Battalion enterprise, whether be general taxpayer, legal person whether be stranger, enterprise set up the time, enterprise whether a location mostly according to, whether with Many enterprises share legal person, do whether tax person is that many enterprises share, whether financial staff is shared many enterprises, legal person and wealth Whether business personnel, which intersect, is served as.

From enterprise management condition angle, designing selected characteristic variable has the statistical value of the tax amount of offset item rate of change, income tax amount The statistical value of the rate of change, the statistical value of the tax liability rate of change, the statistical value of the invoice amount rate of change, valence tax add up to the system of the rate of change Evaluation, the statistical value of the profit rate of change, the statistical value of the burden of taxation rate of change, different time sections (nearest 3 months, T-6 to T-3 this 3 Month, T-9 to T-6 this 3 months) in the average pin item amount of money and average profit, above mentioned statistical value include median, variance And mean value, T refer to the deadline that model data is extracted.

From the angle compared with other enterprise development situations of the same trade, designs the selected valuable tax of characteristic variable and add up to variance rate Statistical value, the statistical value of profit variance rate, burden of taxation variance rate statistical value, statistical value here include median, variance and Mean value.

From invoice issuing angle, designing selected characteristic variable has abnormal invoice number or amount of money accounting, the month that do not make out an invoice Accounting is counted, not the moon number accounting of income invoice, invoice issuing object or region sum are red by ticket object or region sum Number, the amount of money or the amount of tax to be paid ratio of word and blue word invoice, top plate issue invoice number or amount of money accounting, have sales invoice to send out without income The moon number accounting of ticket.Features above different time sections (nearest 3 months, T-6 to T-3 this 3 months, T-9 to T-6 this three A month) in calculated, wherein T refer to model data extract deadline.In addition, it is 5 days that minimum time granularity, which has also been devised, Feature, i.e., nearest 5 days voided check numbers or amount of money accounting.

Cancel angle from invoice, design selected characteristic variable have different time sections (3 months, T-6 to T-3 this 3 months, T-9 To the interior voided check number of T-6 this 3 months) or amount of money accounting, abnormal voided check number or amount of money accounting, wherein T refers to model The deadline that data are extracted.

Angle is received from invoice, designing selected characteristic variable has nearest 1 month or this 3 months 3 months or T-6 to T-3, Or whether T-9 to T-6 frequently receives invoice, nearest 1 month invoice quantity purchase in this 3 months, T here refers to that model data mentions The deadline taken.

From invoice authentication angle, design selected characteristic variable have different time sections (nearest 3 months, T-6 to T-3 this 3 Month, these three moons of T-9 to T-6) in certification when or invoice number or amount of money accounting out of control after authenticating, T here refer to pattern number According to the deadline of extraction.

From into pin item commodity detail variance analysis angle, design selected characteristic variable have different time sections (nearest 3 months, These three moons of T-6 to T-3 this 3 months, T-9 to T-6) it is interior into tax amount of offset item diversity factor, into pin item amount of money diversity factor, into pin item object Product diversity factor, T here refer to the deadline that model data is extracted.

From trading situation inside and outside the province and upstream and downstream trading situation angle, designing selected characteristic variable has different time sections (most Nearly 3 months, T-6 to T-3 this 3 months, these three moons of T-9 to T-6) in, with the inside the province or outside the province pin item amount of money of business transaction or The mean value and variance of income amount of money accounting, the statistical value of lower/upper trip other provinces transaction enterprise's number rate of change, lower/upper trip other provinces transaction The statistical value of the volume rate of change, up/down swim enterprise's number divergence, and up/down swims turnover divergence, up/down trip transaction stability.On Stating statistical value includes median, variance and mean value.

S13: the characteristic value of at least one selected feature of taxpayer to be identified is sequentially input to the instruction of the first setting quantity In xgboost model after white silk, the probability value of the first setting quantity of taxpayer to be identified is obtained.

S14: the probability value of the first setting quantity based on taxpayer to be identified obtains the recognition result of taxpayer to be identified.

In the program, the xgboost model of the first setting quantity can be trained in advance, then using the first setting quantity Xgboost model after training obtains the probability value of the first setting quantity of taxpayer to be identified, is then based on the first setting number The probability value of amount determines whether taxpayer to be identified is improper taxpayer, so as to realize by machine learning algorithm and big Data technique, to construct improper taxpayer's identification model and then identify whether taxpayer is normal.

Specifically, at least one choosing of taxpayer to be identified is obtained in above-mentioned 12 from the selected information of taxpayer to be identified The characteristic value for determining feature, specifically includes:

The initial spy of at least one selected feature of taxpayer to be identified is obtained from the selected information of taxpayer to be identified Value indicative；

By initial characteristic values change unreasonable in the initial characteristic values of at least one selected feature of taxpayer to be identified For setting value, will taxpayer to be identified at least one selected feature initial characteristic values in indicate the initial characteristic values mark of classification Know and be standardized for classification, by initial characteristic values of at least one selected feature of taxpayer to be identified, is obtained wait know The characteristic value of the selected feature of at least one of other taxpayer.

The initial spy of at least one selected feature of taxpayer to be identified is obtained from the selected information of taxpayer to be identified Value indicative, these initial characteristic values might not be complied with standard all, standardization processing can be carried out according to the above aspect, to obtain The characteristic value of the selected feature of at least one of taxpayer to be identified.Above it is only to list three kinds of modes, it can also be used His mode, no longer illustrates one by one here.

Specifically, the probability value based on the first setting quantity in above-mentioned S14 obtains recognition result, specifically include:

Mean value is compared with given threshold；

If mean value is greater than or equal to given threshold, it is determined that the artificial improper taxpayer to be identified that pays taxes；If mean value is less than Given threshold, it is determined that taxpayer to be identified is normal taxpayer.

It averages to the prediction result of the xgboost model of the first setting quantity and is predicted to be as the taxpayer to be identified The probability of improper taxpayer.It can be set according to actual needs given threshold, for example, can be, but not limited to be set as 0.9, if Probability value >=0.9, then the artificial improper taxpayer to be identified that pays taxes is determined, otherwise, it is determined that the taxpayer to be identified is normal Taxpayer.

Optionally, further includes:

At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity The characteristic value of selected feature adds the characteristic value of at least one selected feature of each normal taxpayer and normal taxpayer's label It is added in the wide table of characteristic value label；

Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity The characteristic value of one selected feature by the characteristic value of at least one selected feature of each improper taxpayer and improper is paid taxes People's label is added in the wide table of characteristic value label；

The training sample of first setting quantity is concentrated into the characteristic value of at least one selected feature of each taxpayer and right The label answered is separately input to initial xgboost model, obtains the candidate xgboost model of the first setting quantity；

The characteristic value of at least one selected feature of each taxpayer and corresponding label is concentrated to test using test sample The candidate xgboost model of first setting quantity；

Accurate rate and recall rate are determined based on test result；

If accurate rate and recall rate are up to standard, the candidate xgboost model of the first setting quantity is determined as first Xgboost model after setting the training of quantity.

The parameter that the xgboost model of the first setting quantity can be set, using first setting quantity training sample into Row training obtains the candidate xgboost model of the first setting quantity, then the xgboost model of these the first setting quantity It is predicted applied to test sample collection.The parameter of xgboost model can be, but not limited to:

' objective ': ' binary:logistic ', the logistic regression problem of two classification export as probability

' max_depth ': the depth of tree is constructed

' eta ': over-fitting in order to prevent, the contraction step-length used in renewal process

' silent ': 0, do not export operation information

' eval_metric ': ' map ', evaluation index, map indicate consensus forecast

' lambda ': L2 regularization term parameter, Controlling model complexity

' min_child_weight ': the minimum value for the sum that second order is led in each leaf node

' nthread ': cpu Thread Count

' num_rounds ': the number of iterations.

Specifically, obtaining corresponding normal taxpayer in the selected information of the above-mentioned normal taxpayer from the second setting quantity At least one selected feature characteristic value, specifically include:

At least one of corresponding normal taxpayer is obtained from the selected information of the normal taxpayer of the second setting quantity The initial characteristic values of selected feature；

By in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity it is unreasonable just Beginning characteristic value is changed to the initial characteristic values of at least one selected feature of setting value, the normal taxpayer for setting quantity for second The middle initial characteristic values for indicating classification are identified as classification, at least one selected feature by the normal taxpayer of the second setting quantity Initial characteristic values be standardized, obtain the characteristic value of at least one selected feature of corresponding normal taxpayer.

Specifically, obtaining corresponding improper receive in the selected information of the above-mentioned improper taxpayer from third setting quantity The characteristic value of the selected feature of at least one of tax people, specifically includes:

Corresponding improper taxpayer is obtained at least from the selected information of the improper taxpayer of third setting quantity The initial characteristic values of one selected feature；

It will be unreasonable in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity Initial characteristic values are changed to setting value, by the initial spy of at least one selected feature of the improper taxpayer of third setting quantity Indicate that the initial characteristic values of classification are identified as classification, third is set at least one choosing of the improper taxpayer of quantity in value indicative The initial characteristic values for determining feature are standardized, and obtain at least one selected feature of corresponding improper taxpayer Characteristic value.

Specifically, the above-mentioned training sample for obtaining test sample collection and the first setting quantity from the wide table of characteristic value label Collection, specifically includes:

Improper taxpayer in the wide table of characteristic value label is divided into two parts according to preset ratio, it is improper to obtain first Taxpayer's set and second improper taxpayer's set；

For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the wide table of characteristic value label Middle extraction is identical as improper taxpayer's quantity that first improper taxpayer's set includes and what is be not extracted before normally receives Tax people improper receives by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with first The characteristic value and corresponding tag combination of at least one selected feature of improper taxpayer in tax people set, obtain an instruction Practice sample set；

By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label and right The label answered, with the characteristic value of at least one selected feature of the improper taxpayer in second improper taxpayer's set and right The tag combination answered, obtains test sample collection.

Because the number of improper taxpayer is seldom, cause sample seriously unbalanced.In order to guarantee xgboost modelling effect, Improper taxpayer is divided into two parts according to preset ratio, the first improper taxpayer set and second is obtained and improper pays taxes People's set executes: from the wide table of characteristic value label for each training sample set that the training sample of the first setting quantity is concentrated Extraction is identical as improper taxpayer's quantity that first improper taxpayer's set includes and what is be not extracted before normally pays taxes People improper pays taxes by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with first The characteristic value and corresponding tag combination of at least one selected feature of improper taxpayer in people's set, obtain a training Sample set.

Wherein, preset ratio need to guarantee that final test sample concentrates normal taxpayer and the ratio of improper taxpayer to connect Nearly real-life actual ratio.

Specifically, above-mentioned determine accurate rate and recall rate based on test result, specifically include:

The calculation formula of accurate rate is as follows:

Precision=TP/ (TP+FP)；

The calculation formula of recall rate is as follows:

Recall=TP/ (TP+FN)；

Where it is assumed that improper taxpayer's sample is positive sample, normal taxpayer's sample is negative sample, and TP indicates test knot Fruit be positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample number Amount, FN indicate test result be negative sample, be actually positive sample sample size.

Accurate rate and recall rate are not positive correlation, often can reduce recall rate when improving accurate rate, and are improved Recall rate can also reduce accurate rate.So need to continuously attempt to, adjusting parameter, seek the ginseng for meeting accurate rate and recall rate of compromise Array is closed.Under best parameter group, feature importance ranking result is exported.Final result shows upstream and downstream transaction stability, Whether industry etc. is improper taxpayer's important role to differentiation taxpayer.

Based on the same inventive concept, the embodiment of the present invention provides a kind of improper taxpayer's identification device, the knot of the device Structure is as shown in Figure 2, comprising:

First obtains module 21, for obtaining the selected information of taxpayer to be identified；

Second obtains module 22, for obtaining at least the one of taxpayer to be identified from the selected information of taxpayer to be identified The characteristic value of a selected feature；

Input module 23 is set for the characteristic value of at least one selected feature of taxpayer to be identified to be sequentially input first In xgboost model after the training of fixed number amount, the probability value of the first setting quantity of taxpayer to be identified is obtained；

Identification module 24, the probability value for the first setting quantity based on taxpayer to be identified obtain taxpayer to be identified Recognition result.

In the program, the xgboost model of the first setting quantity can be trained in advance, then using the first setting quantity Xgboost model after training obtains the probability value of the first setting quantity of taxpayer to be identified, is then based on the first setting number The probability value of amount determines whether taxpayer to be identified is improper taxpayer, so as to realize by machine learning algorithm and big Data technique identifies whether taxpayer is normal to construct improper taxpayer's identification model.

Specifically, second obtains module, for obtaining taxpayer's to be identified from the selected information of taxpayer to be identified The characteristic value of at least one selected feature, is specifically used for:

Specifically, identification module, obtains recognition result for the probability value based on the first setting quantity, is specifically used for:

Mean value is compared with given threshold；

Optionally, further includes:

Adding module corresponding is normally paid taxes for obtaining from the selected information of the normal taxpayer of the second setting quantity The characteristic value of the selected feature of at least one of people, by the characteristic value of at least one selected feature of each normal taxpayer and normally Taxpayer's label is added in the wide table of characteristic value label；It is obtained from the selected information of the improper taxpayer of third setting quantity The characteristic value of the selected feature of at least one of corresponding improper taxpayer, at least one by each improper taxpayer are selected The characteristic value of feature and improper taxpayer's label are added in the wide table of characteristic value label；

4th obtains module, for obtaining the training of test sample collection and the first setting quantity from the wide table of characteristic value label Sample set；

Training module, for the training sample of the first setting quantity to be concentrated at least one selected feature of each taxpayer Characteristic value and corresponding label be separately input to initial xgboost model, obtain the candidate of the first setting quantity Xgboost model；

Test module, for concentrating the characteristic value of at least one selected feature of each taxpayer and right using test sample The candidate xgboost model for label test the first setting quantity answered；

Second determining module, if up to standard for accurate rate and recall rate, by the candidate's of the first setting quantity Xgboost model is determined as the xgboost model after the training of the first setting quantity.

Specifically, adding module, corresponding for obtaining from the selected information of the normal taxpayer of the second setting quantity The characteristic value of the selected feature of at least one of normal taxpayer, is specifically used for:

Specifically, adding module, corresponds to for obtaining from the selected information of the improper taxpayer of third setting quantity Improper taxpayer at least one selected feature characteristic value, be specifically used for:

Specifically, the 4th obtains module, for obtaining test sample collection and the first setting number from the wide table of characteristic value label The training sample set of amount, is specifically used for:

Specifically, the first determining module, for determining accurate rate and recall rate based on test result, it is specifically used for:

The calculation formula of accurate rate is as follows:

Precision=TP/ (TP+FP)；

The calculation formula of recall rate is as follows:

Recall=TP/ (TP+FN)；

Where it is assumed that improper taxpayer's sample is positive sample, normal taxpayer's sample is negative sample, and TP indicates test knot Fruit be positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample Quantity, FN indicate test result be negative sample, be actually positive sample sample size.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although alternative embodiment of the invention has been described, created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So the following claims are intended to be interpreted as include can It selects embodiment and falls into all change and modification of the scope of the invention.

Obviously, those skilled in the art can carry out various modification and variations without departing from this hair to the embodiment of the present invention The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of improper taxpayer's recognition methods, which is characterized in that the described method includes:

Obtain the selected information of taxpayer to be identified；

The spy of at least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified Value indicative；

After the training that the characteristic value of at least one selected feature of the taxpayer to be identified is sequentially input to the first setting quantity Xgboost model in, obtain the taxpayer to be identified first setting quantity probability value；

The probability value of the first setting quantity based on the taxpayer to be identified obtains the recognition result of the taxpayer to be identified.

2. the method as described in claim 1, which is characterized in that described in being obtained from the selected information of the taxpayer to be identified The characteristic value of the selected feature of at least one of taxpayer to be identified, specifically includes:

The first of at least one selected feature of the taxpayer to be identified is obtained from the selected information of the taxpayer to be identified Beginning characteristic value；

By initial characteristic values change unreasonable in the initial characteristic values of at least one selected feature of the taxpayer to be identified For setting value, will the taxpayer to be identified at least one selected feature initial characteristic values in indicate the initial characteristics of classification Value is identified as classification, is standardized the initial characteristic values of at least one selected feature of the taxpayer to be identified, Obtain the characteristic value of at least one selected feature of the taxpayer to be identified.

3. the method as described in claim 1, which is characterized in that the probability value based on the first setting quantity obtains identification knot Fruit specifically includes:

The mean value is compared with given threshold；

If the mean value is greater than or equal to the given threshold, it is determined that the artificial improper taxpayer to be identified that pays taxes；If The mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.

4. method a method according to any one of claims 1-3, which is characterized in that further include:

The selected information and third that obtain the normal taxpayer of the second setting quantity set the selected of the improper taxpayer of quantity Information；

The training sample of the first setting quantity is concentrated into the characteristic value of at least one selected feature of each taxpayer and right The label answered is separately input to initial xgboost model, obtains the candidate xgboost model of the first setting quantity；

The characteristic value of at least one selected feature of each taxpayer and corresponding label is concentrated to test using the test sample The candidate xgboost model of the first setting quantity；

Accurate rate and recall rate are determined based on test result；

It is if the accurate rate and the recall rate are up to standard, the candidate xgboost model of the first setting quantity is true Xgboost model after being set to the training of the first setting quantity.

5. method as claimed in claim 4, which is characterized in that from the selected letter of the normal taxpayer of the second setting quantity The characteristic value that at least one selected feature of corresponding normal taxpayer is obtained in breath, specifically includes:

By in the initial characteristic values of at least one selected feature of the normal taxpayer of the second setting quantity it is unreasonable just Beginning characteristic value is changed to the initial spy of at least one selected feature of setting value, the normal taxpayer for setting quantity for described second Indicate that the initial characteristic values of classification are identified as at least one of classification, the normal taxpayer for setting quantity for described second in value indicative The initial characteristic values of selected feature are standardized, and obtain at least one selected feature of corresponding normal taxpayer Characteristic value.

6. method as claimed in claim 4, which is characterized in that set the selected of the improper taxpayer of quantity from the third The characteristic value that at least one selected feature of corresponding improper taxpayer is obtained in information, specifically includes:

It will be unreasonable in the initial characteristic values of at least one selected feature of the improper taxpayer of third setting quantity Initial characteristic values be changed to setting value, will third setting quantity improper taxpayer at least one selected feature just Indicate that the initial characteristic values of classification are identified as classification, the third is set the improper taxpayer of quantity extremely in beginning characteristic value The initial characteristic values of a few selected feature are standardized, and obtain at least one choosing of corresponding improper taxpayer Determine the characteristic value of feature.

7. method as claimed in claim 4, which is characterized in that from the wide table of characteristic value label obtain test sample collection and The training sample set of the first setting quantity, specifically includes:

For each training sample set that the training sample of the first setting quantity is concentrated, execute: from the characteristic value label It extracts identical as improper taxpayer's quantity that described first improper taxpayer's set includes in wide table and is not extracted before Normal taxpayer, by the characteristic value of at least one selected feature of the normal taxpayer of extraction and corresponding label, with it is described The characteristic value and corresponding set of tags of at least one selected feature of improper taxpayer in first improper taxpayer's set It closes, obtains a training sample set；

By the characteristic value of at least one the selected feature for the normal taxpayer not being extracted in the wide table of characteristic value label and right The characteristic value of at least one selected feature of the label answered and the improper taxpayer in described second improper taxpayer's set With corresponding tag combination, test sample collection is obtained.

8. method as claimed in claim 4, which is characterized in that accurate rate and recall rate are determined based on test result, it is specific to wrap It includes:

The calculation formula of the accurate rate is as follows:

Precision=TP/ (TP+FP)；

The calculation formula of the recall rate is as follows:

Recall=TP/ (TP+FN)；

Where it is assumed that being positive sample by improper taxpayer's sample, normal taxpayer's sample is negative sample, and TP indicates test result For positive sample, it is practical be also positive sample sample size, FP indicate test result be positive sample, be actually negative sample sample number Amount, FN indicate test result be negative sample, be actually positive sample sample size.

9. a kind of improper taxpayer's identification device, which is characterized in that described device includes:

Second obtains module, for obtaining the taxpayer to be identified at least from the selected information of the taxpayer to be identified The characteristic value of one selected feature；

Input module, for the characteristic value of at least one selected feature of the taxpayer to be identified to be sequentially input the first setting In xgboost model after the training of quantity, the probability value of the first setting quantity of the taxpayer to be identified is obtained；

Identification module, the probability value for the first setting quantity based on the taxpayer to be identified obtain described to be identified pay taxes The recognition result of people.

10. device as claimed in claim 9, which is characterized in that described second obtains module, for be identified paying taxes from described The characteristic value that at least one selected feature of the taxpayer to be identified is obtained in the selected information of people, is specifically used for:

11. device as claimed in claim 9, which is characterized in that the identification module, for based on the first setting quantity Probability value obtain recognition result, be specifically used for:

The mean value is compared with given threshold；

If the mean value is more than or equal to the given threshold, it is determined that the artificial improper taxpayer to be identified that pays taxes； If the mean value is less than the given threshold, it is determined that the taxpayer to be identified is normal taxpayer.

12. the device as described in claim 9-11 is any, which is characterized in that further include:

Third obtains module, and the selected information and third for obtaining the normal taxpayer of the second setting quantity set the non-of quantity The selected information of normal taxpayer；

Adding module corresponding is normally paid taxes for obtaining from the selected information of the normal taxpayer of the second setting quantity The characteristic value of the selected feature of at least one of people, by the characteristic value of at least one selected feature of each normal taxpayer and normally Taxpayer's label is added in the wide table of characteristic value label；From the selected information of the improper taxpayer of third setting quantity The characteristic value for obtaining at least one selected feature of corresponding improper taxpayer, by least one of each improper taxpayer The characteristic value of selected feature and improper taxpayer's label are added in the wide table of characteristic value label；

4th obtains module, sets quantity for obtaining test sample collection and described first from the wide table of characteristic value label Training sample set；

Test module, for concentrating the characteristic value of at least one selected feature of each taxpayer and right using the test sample The label answered tests the candidate xgboost model of the first setting quantity；

Second determining module, if up to standard for the accurate rate and the recall rate, by the time of the first setting quantity The xgboost model of choosing is determined as the xgboost model after the training of the first setting quantity.

13. device as claimed in claim 12, which is characterized in that the adding module, for setting quantity from described second Normal taxpayer selected information in obtain corresponding normal taxpayer at least one selected feature characteristic value, it is specific to use In:

14. device as claimed in claim 12, which is characterized in that the adding module, for setting quantity from the third Improper taxpayer selected information in obtain corresponding improper taxpayer at least one selected feature characteristic value, tool Body is used for:

15. device as claimed in claim 12, which is characterized in that the described 4th obtains module, is used for from the characteristic value mark The training sample set for obtaining test sample collection and the first setting quantity in wide table is signed, is specifically used for:

16. device as claimed in claim 12, which is characterized in that first determining module, for true based on test result Determine accurate rate and recall rate, be specifically used for:

The calculation formula of the accurate rate is as follows:

Precision=TP/ (TP+FP)；

The calculation formula of the recall rate is as follows:

Recall=TP/ (TP+FN)；