CN112270553A - Malicious registered enterprise behavior identification method and system based on isolated forest algorithm - Google Patents
Malicious registered enterprise behavior identification method and system based on isolated forest algorithm Download PDFInfo
- Publication number
- CN112270553A CN112270553A CN202011237306.0A CN202011237306A CN112270553A CN 112270553 A CN112270553 A CN 112270553A CN 202011237306 A CN202011237306 A CN 202011237306A CN 112270553 A CN112270553 A CN 112270553A
- Authority
- CN
- China
- Prior art keywords
- malicious
- enterprise
- registered
- information
- forest algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000006399 behavior Effects 0.000 claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 32
- 230000002159 abnormal effect Effects 0.000 claims abstract description 25
- 238000007781 pre-processing Methods 0.000 claims abstract description 18
- 238000001514 detection method Methods 0.000 claims abstract description 14
- 230000005611 electricity Effects 0.000 claims abstract description 11
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims abstract description 10
- 238000004458 analytical method Methods 0.000 claims description 18
- 238000003860 storage Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 6
- 238000013075 data extraction Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 5
- 238000012216 screening Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000008713 feedback mechanism Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/018—Certifying business or products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Business, Economics & Management (AREA)
- Computational Linguistics (AREA)
- Strategic Management (AREA)
- Biomedical Technology (AREA)
- Marketing (AREA)
- Finance (AREA)
- Quality & Reliability (AREA)
- Economics (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Business, Economics & Management (AREA)
- Biophysics (AREA)
- Entrepreneurship & Innovation (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
Abstract
The invention discloses a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm, belonging to the field of public credit, and aiming at solving the technical problem that malicious registered behaviors in the field of public credit are difficult to identify, and adopting the technical scheme as follows: the method comprises the steps of extracting measurable detection characteristics from market main body registration information, tax registration information, operation conditions, daily water and electricity utilization information, employee social security payment information, ticket getting information and billing information by using an isolated forest algorithm, constructing an enterprise data set by using the measurable detection characteristics, randomly dividing the data set by using a hyperplane by using the isolated forest algorithm, wherein the earliest isolated point is an abnormal point, and further screening abnormal enterprises; the method comprises the following specific steps: collecting data; selecting characteristics; preprocessing data; training a model; pushing a result; analyzing a feedback result; and (6) model release.
Description
Technical Field
The invention relates to the field of public credit, in particular to a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm.
Background
Isolated forest algorithm: an unsupervised anomaly detection method suitable for continuous data is different from other anomaly detection algorithms in describing the separation degree among samples through equivalent indexes of distance and density, and an isolated forest algorithm detects an abnormal value through isolation of sample points. In particular, the algorithm isolates samples using a binary search tree structure known as an isolation tree. Because of the small number of outliers and the interspersion with most samples, outliers are isolated earlier, i.e., outliers are closer to the root node, while normal values are further away from the root node. In addition, compared with traditional algorithms such as LOF and K-means, the isolated forest algorithm has better robustness on high-dimensional data. The application scene of the isolated forest algorithm is the condition that abnormal samples are few and normal samples are many, the quantity of the maliciously registered enterprises in all the registered enterprises is not large, and the isolated forest algorithm is suitable for the scene.
The feature selection of the traditional isolated forest algorithm basically depends on expert experience, the method has no reliable data support, the model identification accuracy is low easily, and the result verification has certain difficulty due to the fact that the algorithm is an unsupervised algorithm.
Disclosure of Invention
The invention provides a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm, and aims to solve the problem that malicious registered behaviors in the field of public credit are difficult to identify.
The technical task of the invention is realized in the following way, the method for identifying the malicious registered enterprise behaviors based on the isolated forest algorithm extracts measurable detection characteristics from the registered registration information, tax registration information, management condition, daily water and electricity consumption information, employee social security payment information, receipt information and billing information of a market main body by using the isolated forest algorithm, an enterprise data set is constructed by using the measurable detection characteristics, and then the data set is randomly divided by using the isolated forest algorithm by using a hyperplane, wherein the earliest isolated point is an abnormal point, and abnormal enterprises are screened out.
Preferably, the method is specifically as follows:
collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis;
selecting the characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;
data preprocessing: preprocessing the data to form a standard data set;
model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
and (4) result pushing: pushing the obtained list of suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
and (3) feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and the tax department, adding new high-relevance characteristics, and continuing to perform model training;
model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.
Preferably, the enterprise information includes enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments, and employee social security payment information.
Preferably, the data preprocessing is specifically as follows:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
Preferably, the feedback result analysis is specifically as follows:
calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
analyzing other characteristics of the malicious registered enterprise, and adding new high-relevance characteristics;
and model training is carried out again, so that the recognition result is more accurate.
A malicious registered enterprise behavior recognition system based on an isolated forest algorithm comprises,
the data acquisition module is used for acquiring enterprise information, and extracting the enterprise information to a big data platform for analysis by using a data extraction tool; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information;
the characteristic selection module is used for selecting characteristics highly associated with malicious registration behaviors according to expert experience and characteristic summary of the past malicious registered enterprises, and extracting new characteristics according to the existing data;
the data preprocessing module is used for preprocessing the data to form a standard data set;
the model training module is used for training a standard data set by utilizing an isolated forest algorithm to obtain a model result, and an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
the result pushing module is used for pushing the obtained list of the suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, and the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments;
and the model issuing module is used for performing multi-round training and feedback, the model tends to be stable, the recognition accuracy is high, and the issued model can be used for malicious registration recognition of a new enterprise.
Preferably, the data preprocessing module specifically comprises the following working processes:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
Preferably, the working process of the feedback result analysis module is as follows:
(1) calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
(2) analyzing other characteristics of the malicious registered enterprise and adding new high-relevance characteristics;
(3) and model training is carried out again, so that the recognition result is more accurate.
An electronic device, comprising: a memory and at least one processor;
wherein the memory has stored thereon a computer program;
the at least one processor executes the memory-stored computer program such that the at least one processor executes the orphan forest algorithm-based malicious registered enterprise behavior identification method as described above.
A computer-readable storage medium having stored thereon a computer program executable by a processor to implement a method for malicious registered enterprise behavior identification based on isolated forest algorithms as described above.
The malicious registered enterprise behavior identification method and system based on the isolated forest algorithm have the following advantages:
the invention relates to a model and a technology which use an isolated forest algorithm to identify malicious registered enterprise behaviors, distribute results and feed back detection results by means of a public credit information platform and adjust feature selection by using the feedback results; because the detection characteristics are highly related to the malicious registration behaviors, the abnormal enterprises serving as suspected malicious registration enterprises are pushed to a market supervision department and a tax department, and the detection results fed back by the two departments are collected as subsequent characteristic screening bases, so that the model identification accuracy is improved; the invention can effectively find potential malicious registered enterprises and assist market supervision and tax departments in carrying out daily supervision work, and meanwhile, the invention has universality, and the whole analysis process is suitable for anomaly detection in other fields and can achieve ideal effect;
the invention applies the isolated forest algorithm to the identification of malicious registered enterprises in the field of public credit, determines the preliminary characteristics according to the expert experience, automatically identifies abnormal points by using the algorithm, takes the abnormal points as suspected malicious registered enterprises and pushes the suspected malicious registered enterprises to market supervision and tax departments;
when the unsupervised algorithm is used for monitoring the abnormity, a feedback mechanism is added, and multiple rounds of feature screening and feature addition are carried out by utilizing real malicious registered enterprises fed back by market supervision and tax departments, so that feature selection is more practical, and the accuracy of the model is improved;
the method starts from the characteristics of malicious registered enterprises to form a feature set, so that the feature set used in the first training is beneficial to identifying abnormality by an isolated forest algorithm;
the method utilizes the real data fed back by the market supervision and the tax department to gradually optimize the feature set, so that the model identification is stable and accurate;
and sixthly, pushing the identified suspected malicious registered enterprises, reminding market supervision and tax departments to carry out key supervision and investigation.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a flow chart of a malicious registered enterprise behavior identification method based on an isolated forest algorithm.
Detailed Description
The malicious registered enterprise behavior identification method and system based on the isolated forest algorithm are described in detail below with reference to the drawings and specific embodiments of the specification.
Example 1:
as shown in the attached figure 1, the malicious registered enterprise behavior identification method based on the isolated forest algorithm extracts measurable detection characteristics from market main body registration information, tax registration information, business conditions, daily water and electricity consumption information, employee social security payment information, receipt information and billing information by using the isolated forest algorithm, constructs an enterprise data set by using the measurable detection characteristics, and randomly divides the data set by using the isolated forest algorithm and a hyperplane, wherein the earliest isolated point is an abnormal point, so that abnormal enterprises are screened out; the method comprises the following specific steps:
s1, collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information.
S2, selecting characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;
wherein, characteristics of maliciously registering the enterprise include actually not managing, falsely use the ID card, and legal person, financial staff and tax personnel are the same person, and the production energy consumption is seriously inconsistent with the sales situation like the charges of electricity condition, and the standing time is short, registers a large amount of enterprises in the short time, receives a large amount of invoices and concentrated quota invoices in short-term, specifically as following:
s3, preprocessing data: preprocessing the data to form a standard data set; the method comprises the following specific steps:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
S4, model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
s5, result pushing: pushing the list of suspected malicious registered enterprises obtained in the step S4 to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
s6, feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and tax departments, adding new high-relevance characteristics, and returning to the step S4 to continue model training; the method comprises the following specific steps:
calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
analyzing other characteristics of the malicious registered enterprise, and adding new high-relevance characteristics;
and model training is carried out again, so that the recognition result is more accurate.
S7, model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.
Example 2:
the invention relates to a malicious registered enterprise behavior recognition system based on an isolated forest algorithm, which comprises,
the data acquisition module is used for acquiring enterprise information, and extracting the enterprise information to a big data platform for analysis by using a data extraction tool; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information;
the characteristic selection module is used for selecting characteristics highly associated with malicious registration behaviors according to expert experience and characteristic summary of the past malicious registered enterprises, and extracting new characteristics according to the existing data;
the data preprocessing module is used for preprocessing the data to form a standard data set; the working process of the data preprocessing module is as follows:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
The model training module is used for training a standard data set by utilizing an isolated forest algorithm to obtain a model result, and an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
the result pushing module is used for pushing the obtained list of the suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, and the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments; the working process of the feedback result analysis module is as follows:
(1) calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
(2) analyzing other characteristics of the malicious registered enterprise and adding new high-relevance characteristics;
(3) and model training is carried out again, so that the recognition result is more accurate.
And the model issuing module is used for performing multi-round training and feedback, the model tends to be stable, the recognition accuracy is high, and the issued model can be used for malicious registration recognition of a new enterprise.
Example 3:
an embodiment of the present invention further provides an electronic device, including: a memory and at least one processor;
wherein the memory stores computer-executable instructions;
the at least one processor executes the computer-executable instructions stored in the memory to cause the at least one processor to perform the method for identifying malicious registered enterprise behaviors based on isolated forest algorithms in any of the embodiments of the present invention.
Example 4:
the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the malicious registered enterprise behavior identification method based on the isolated forest algorithm in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RYM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A malicious registered enterprise behavior identification method based on an isolated forest algorithm is characterized in that the isolated forest algorithm is used, measurable detection features are extracted from market subject registration information, tax registration information, business conditions, daily water and electricity consumption information, employee social security payment information, ticket collecting information and billing information, an enterprise data set is established by using the measurable detection features, the isolated forest algorithm is used for randomly dividing the data set by using a hyperplane, and the point which is isolated at the earliest is an abnormal point, so that abnormal enterprises are screened out.
2. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in claim 1, which is characterized by comprising the following steps:
collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis;
selecting the characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;
data preprocessing: preprocessing the data to form a standard data set;
model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
and (4) result pushing: pushing the obtained list of suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
and (3) feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and the tax department, adding new high-relevance characteristics, and continuing to perform model training;
model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.
3. The method for identifying malicious registered enterprise behaviors based on an isolated forest algorithm as claimed in claim 1, wherein the enterprise information comprises enterprise registration information and business conditions of a market supervision department, tax registration information, receipt taking information and invoice making information of a tax department, daily water and electricity consumption information of other departments and staff social security payment information.
4. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in claim 1, wherein the data preprocessing is specifically as follows:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: and carrying out standardization processing on the features to form a standard data set.
5. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in any one of claims 1-4, wherein the feedback result analysis is specifically as follows:
calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
analyzing other characteristics of the malicious registered enterprise, and adding new high-relevance characteristics;
and model training is carried out again, so that the recognition result is more accurate.
6. A malicious registered enterprise behavior recognition system based on an isolated forest algorithm is characterized by comprising,
the data acquisition module is used for acquiring enterprise information, and extracting the enterprise information to a big data platform for analysis by using a data extraction tool; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information;
the characteristic selection module is used for selecting characteristics highly associated with malicious registration behaviors according to expert experience and characteristic summary of the past malicious registered enterprises, and extracting new characteristics according to the existing data;
the data preprocessing module is used for preprocessing the data to form a standard data set;
the model training module is used for training a standard data set by utilizing an isolated forest algorithm to obtain a model result, and an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
the result pushing module is used for pushing the obtained list of the suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, and the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments;
and the model issuing module is used for performing multi-round training and feedback, the model tends to be stable, the recognition accuracy is high, and the issued model can be used for malicious registration recognition of a new enterprise.
7. The system for identifying malicious registered enterprise behaviors based on an isolated forest algorithm according to claim 6, wherein the data preprocessing module specifically comprises the following working processes:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
8. The malicious registered enterprise behavior recognition system based on the isolated forest algorithm as claimed in claim 6 or 7, wherein the feedback result analysis module specifically comprises the following working processes:
(1) calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
(2) analyzing other characteristics of the malicious registered enterprise and adding new high-relevance characteristics;
(3) and model training is carried out again, so that the recognition result is more accurate.
9. An electronic device, comprising: a memory and at least one processor;
wherein the memory has stored thereon a computer program;
the at least one processor executing the memory-stored computer program causes the at least one processor to perform the orphan forest algorithm-based malicious registered enterprise behavior identification method of any of claims 1 to 5.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program is executable by a processor to implement the method for identifying malicious registered enterprise behaviors based on isolated forest algorithms as claimed in claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011237306.0A CN112270553A (en) | 2020-11-09 | 2020-11-09 | Malicious registered enterprise behavior identification method and system based on isolated forest algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011237306.0A CN112270553A (en) | 2020-11-09 | 2020-11-09 | Malicious registered enterprise behavior identification method and system based on isolated forest algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112270553A true CN112270553A (en) | 2021-01-26 |
Family
ID=74339698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011237306.0A Pending CN112270553A (en) | 2020-11-09 | 2020-11-09 | Malicious registered enterprise behavior identification method and system based on isolated forest algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112270553A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113240010A (en) * | 2021-05-14 | 2021-08-10 | 烟台海颐软件股份有限公司 | Abnormity detection method and system supporting non-independent distribution of mixed data |
CN113327037A (en) * | 2021-05-31 | 2021-08-31 | 平安国际智慧城市科技股份有限公司 | Model-based risk identification method and device, computer equipment and storage medium |
CN114495137A (en) * | 2022-04-15 | 2022-05-13 | 深圳高灯计算机科技有限公司 | Bill abnormity detection model generation method and bill abnormity detection method |
CN116681358A (en) * | 2023-08-04 | 2023-09-01 | 深圳中科闻歌科技有限公司 | XGBoost model-based new registration abnormal enterprise detection method |
CN116720787A (en) * | 2023-08-04 | 2023-09-08 | 深圳中科闻歌科技有限公司 | XGBoost model-based new change abnormal enterprise detection method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108510006A (en) * | 2018-04-08 | 2018-09-07 | 重庆邮电大学 | A kind of analysis of business electrical amount and prediction technique based on data mining |
CN111192140A (en) * | 2020-01-02 | 2020-05-22 | 北京明略软件系统有限公司 | Method and device for predicting customer default probability |
WO2020111571A1 (en) * | 2018-11-26 | 2020-06-04 | (주) 위세아이텍 | Artificial intelligence-based severe delinquency detection device and method |
CN111833172A (en) * | 2020-05-25 | 2020-10-27 | 百维金科(上海)信息科技有限公司 | Consumption credit fraud detection method and system based on isolated forest |
-
2020
- 2020-11-09 CN CN202011237306.0A patent/CN112270553A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108510006A (en) * | 2018-04-08 | 2018-09-07 | 重庆邮电大学 | A kind of analysis of business electrical amount and prediction technique based on data mining |
WO2020111571A1 (en) * | 2018-11-26 | 2020-06-04 | (주) 위세아이텍 | Artificial intelligence-based severe delinquency detection device and method |
CN111192140A (en) * | 2020-01-02 | 2020-05-22 | 北京明略软件系统有限公司 | Method and device for predicting customer default probability |
CN111833172A (en) * | 2020-05-25 | 2020-10-27 | 百维金科(上海)信息科技有限公司 | Consumption credit fraud detection method and system based on isolated forest |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113240010A (en) * | 2021-05-14 | 2021-08-10 | 烟台海颐软件股份有限公司 | Abnormity detection method and system supporting non-independent distribution of mixed data |
CN113240010B (en) * | 2021-05-14 | 2023-10-24 | 烟台海颐软件股份有限公司 | Anomaly detection method and system supporting non-independent distribution mixed data |
CN113327037A (en) * | 2021-05-31 | 2021-08-31 | 平安国际智慧城市科技股份有限公司 | Model-based risk identification method and device, computer equipment and storage medium |
CN114495137A (en) * | 2022-04-15 | 2022-05-13 | 深圳高灯计算机科技有限公司 | Bill abnormity detection model generation method and bill abnormity detection method |
CN116681358A (en) * | 2023-08-04 | 2023-09-01 | 深圳中科闻歌科技有限公司 | XGBoost model-based new registration abnormal enterprise detection method |
CN116720787A (en) * | 2023-08-04 | 2023-09-08 | 深圳中科闻歌科技有限公司 | XGBoost model-based new change abnormal enterprise detection method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112270553A (en) | Malicious registered enterprise behavior identification method and system based on isolated forest algorithm | |
EP3686756A1 (en) | Method and apparatus for grouping data records | |
CN108629413A (en) | Neural network model training, trading activity Risk Identification Method and device | |
CN113204603B (en) | Category labeling method and device for financial data assets | |
CN108268886B (en) | Method and system for identifying plug-in operation | |
US20150221045A1 (en) | System and method of normalizing vendor data | |
CN105096195A (en) | Account money amount processing method and system based on internet application platform | |
CN114186760A (en) | Analysis method and system for stable operation of enterprise and readable storage medium | |
CN113989859B (en) | Fingerprint similarity identification method and device for anti-flashing equipment | |
CN116485519A (en) | Data processing method, device, equipment and storage medium | |
CN109636378B (en) | Account identification method and device and electronic equipment | |
CN112990989B (en) | Value prediction model input data generation method, device, equipment and medium | |
CN112966728B (en) | Transaction monitoring method and device | |
CN113505117A (en) | Data quality evaluation method, device, equipment and medium based on data indexes | |
CN112365352A (en) | Anti-cash-out method and device based on graph neural network | |
CN116361488A (en) | Method and device for mining risk object based on knowledge graph | |
CN114817518B (en) | License handling method, system and medium based on big data archive identification | |
CN114495137B (en) | Bill abnormity detection model generation method and bill abnormity detection method | |
CN115392351A (en) | Risk user identification method and device, electronic equipment and storage medium | |
CN111460052B (en) | Low-security fund supervision method and system based on supervised data correlation analysis | |
CN112949752B (en) | Training method and device of business prediction system | |
CN115392206B (en) | Method, device and equipment for quickly querying data based on WPS/EXCEL and storage medium | |
CN113362151B (en) | Data processing method and device for financial business, electronic equipment and storage medium | |
CN116739752A (en) | Message reminding method and device, electronic equipment and storage medium | |
CN117893321A (en) | Account anomaly detection method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210126 |
|
RJ01 | Rejection of invention patent application after publication |