CN112270553A - Malicious registered enterprise behavior identification method and system based on isolated forest algorithm - Google Patents

Malicious registered enterprise behavior identification method and system based on isolated forest algorithm Download PDF

Info

Publication number
CN112270553A
CN112270553A CN202011237306.0A CN202011237306A CN112270553A CN 112270553 A CN112270553 A CN 112270553A CN 202011237306 A CN202011237306 A CN 202011237306A CN 112270553 A CN112270553 A CN 112270553A
Authority
CN
China
Prior art keywords
malicious
enterprise
registered
information
forest algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011237306.0A
Other languages
Chinese (zh)
Inventor
曲金涛
彭光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN202011237306.0A priority Critical patent/CN112270553A/en
Publication of CN112270553A publication Critical patent/CN112270553A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Strategic Management (AREA)
  • Biomedical Technology (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • Quality & Reliability (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Biophysics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention discloses a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm, belonging to the field of public credit, and aiming at solving the technical problem that malicious registered behaviors in the field of public credit are difficult to identify, and adopting the technical scheme as follows: the method comprises the steps of extracting measurable detection characteristics from market main body registration information, tax registration information, operation conditions, daily water and electricity utilization information, employee social security payment information, ticket getting information and billing information by using an isolated forest algorithm, constructing an enterprise data set by using the measurable detection characteristics, randomly dividing the data set by using a hyperplane by using the isolated forest algorithm, wherein the earliest isolated point is an abnormal point, and further screening abnormal enterprises; the method comprises the following specific steps: collecting data; selecting characteristics; preprocessing data; training a model; pushing a result; analyzing a feedback result; and (6) model release.

Description

Malicious registered enterprise behavior identification method and system based on isolated forest algorithm
Technical Field
The invention relates to the field of public credit, in particular to a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm.
Background
Isolated forest algorithm: an unsupervised anomaly detection method suitable for continuous data is different from other anomaly detection algorithms in describing the separation degree among samples through equivalent indexes of distance and density, and an isolated forest algorithm detects an abnormal value through isolation of sample points. In particular, the algorithm isolates samples using a binary search tree structure known as an isolation tree. Because of the small number of outliers and the interspersion with most samples, outliers are isolated earlier, i.e., outliers are closer to the root node, while normal values are further away from the root node. In addition, compared with traditional algorithms such as LOF and K-means, the isolated forest algorithm has better robustness on high-dimensional data. The application scene of the isolated forest algorithm is the condition that abnormal samples are few and normal samples are many, the quantity of the maliciously registered enterprises in all the registered enterprises is not large, and the isolated forest algorithm is suitable for the scene.
The feature selection of the traditional isolated forest algorithm basically depends on expert experience, the method has no reliable data support, the model identification accuracy is low easily, and the result verification has certain difficulty due to the fact that the algorithm is an unsupervised algorithm.
Disclosure of Invention
The invention provides a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm, and aims to solve the problem that malicious registered behaviors in the field of public credit are difficult to identify.
The technical task of the invention is realized in the following way, the method for identifying the malicious registered enterprise behaviors based on the isolated forest algorithm extracts measurable detection characteristics from the registered registration information, tax registration information, management condition, daily water and electricity consumption information, employee social security payment information, receipt information and billing information of a market main body by using the isolated forest algorithm, an enterprise data set is constructed by using the measurable detection characteristics, and then the data set is randomly divided by using the isolated forest algorithm by using a hyperplane, wherein the earliest isolated point is an abnormal point, and abnormal enterprises are screened out.
Preferably, the method is specifically as follows:
collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis;
selecting the characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;
data preprocessing: preprocessing the data to form a standard data set;
model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
and (4) result pushing: pushing the obtained list of suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
and (3) feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and the tax department, adding new high-relevance characteristics, and continuing to perform model training;
model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.
Preferably, the enterprise information includes enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments, and employee social security payment information.
Preferably, the data preprocessing is specifically as follows:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
Preferably, the feedback result analysis is specifically as follows:
calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
analyzing other characteristics of the malicious registered enterprise, and adding new high-relevance characteristics;
and model training is carried out again, so that the recognition result is more accurate.
A malicious registered enterprise behavior recognition system based on an isolated forest algorithm comprises,
the data acquisition module is used for acquiring enterprise information, and extracting the enterprise information to a big data platform for analysis by using a data extraction tool; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information;
the characteristic selection module is used for selecting characteristics highly associated with malicious registration behaviors according to expert experience and characteristic summary of the past malicious registered enterprises, and extracting new characteristics according to the existing data;
the data preprocessing module is used for preprocessing the data to form a standard data set;
the model training module is used for training a standard data set by utilizing an isolated forest algorithm to obtain a model result, and an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
the result pushing module is used for pushing the obtained list of the suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, and the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments;
and the model issuing module is used for performing multi-round training and feedback, the model tends to be stable, the recognition accuracy is high, and the issued model can be used for malicious registration recognition of a new enterprise.
Preferably, the data preprocessing module specifically comprises the following working processes:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
Preferably, the working process of the feedback result analysis module is as follows:
(1) calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
(2) analyzing other characteristics of the malicious registered enterprise and adding new high-relevance characteristics;
(3) and model training is carried out again, so that the recognition result is more accurate.
An electronic device, comprising: a memory and at least one processor;
wherein the memory has stored thereon a computer program;
the at least one processor executes the memory-stored computer program such that the at least one processor executes the orphan forest algorithm-based malicious registered enterprise behavior identification method as described above.
A computer-readable storage medium having stored thereon a computer program executable by a processor to implement a method for malicious registered enterprise behavior identification based on isolated forest algorithms as described above.
The malicious registered enterprise behavior identification method and system based on the isolated forest algorithm have the following advantages:
the invention relates to a model and a technology which use an isolated forest algorithm to identify malicious registered enterprise behaviors, distribute results and feed back detection results by means of a public credit information platform and adjust feature selection by using the feedback results; because the detection characteristics are highly related to the malicious registration behaviors, the abnormal enterprises serving as suspected malicious registration enterprises are pushed to a market supervision department and a tax department, and the detection results fed back by the two departments are collected as subsequent characteristic screening bases, so that the model identification accuracy is improved; the invention can effectively find potential malicious registered enterprises and assist market supervision and tax departments in carrying out daily supervision work, and meanwhile, the invention has universality, and the whole analysis process is suitable for anomaly detection in other fields and can achieve ideal effect;
the invention applies the isolated forest algorithm to the identification of malicious registered enterprises in the field of public credit, determines the preliminary characteristics according to the expert experience, automatically identifies abnormal points by using the algorithm, takes the abnormal points as suspected malicious registered enterprises and pushes the suspected malicious registered enterprises to market supervision and tax departments;
when the unsupervised algorithm is used for monitoring the abnormity, a feedback mechanism is added, and multiple rounds of feature screening and feature addition are carried out by utilizing real malicious registered enterprises fed back by market supervision and tax departments, so that feature selection is more practical, and the accuracy of the model is improved;
the method starts from the characteristics of malicious registered enterprises to form a feature set, so that the feature set used in the first training is beneficial to identifying abnormality by an isolated forest algorithm;
the method utilizes the real data fed back by the market supervision and the tax department to gradually optimize the feature set, so that the model identification is stable and accurate;
and sixthly, pushing the identified suspected malicious registered enterprises, reminding market supervision and tax departments to carry out key supervision and investigation.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a flow chart of a malicious registered enterprise behavior identification method based on an isolated forest algorithm.
Detailed Description
The malicious registered enterprise behavior identification method and system based on the isolated forest algorithm are described in detail below with reference to the drawings and specific embodiments of the specification.
Example 1:
as shown in the attached figure 1, the malicious registered enterprise behavior identification method based on the isolated forest algorithm extracts measurable detection characteristics from market main body registration information, tax registration information, business conditions, daily water and electricity consumption information, employee social security payment information, receipt information and billing information by using the isolated forest algorithm, constructs an enterprise data set by using the measurable detection characteristics, and randomly divides the data set by using the isolated forest algorithm and a hyperplane, wherein the earliest isolated point is an abnormal point, so that abnormal enterprises are screened out; the method comprises the following specific steps:
s1, collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information.
S2, selecting characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;
wherein, characteristics of maliciously registering the enterprise include actually not managing, falsely use the ID card, and legal person, financial staff and tax personnel are the same person, and the production energy consumption is seriously inconsistent with the sales situation like the charges of electricity condition, and the standing time is short, registers a large amount of enterprises in the short time, receives a large amount of invoices and concentrated quota invoices in short-term, specifically as following:
Figure BDA0002767128530000051
s3, preprocessing data: preprocessing the data to form a standard data set; the method comprises the following specific steps:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
S4, model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
s5, result pushing: pushing the list of suspected malicious registered enterprises obtained in the step S4 to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
s6, feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and tax departments, adding new high-relevance characteristics, and returning to the step S4 to continue model training; the method comprises the following specific steps:
calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
analyzing other characteristics of the malicious registered enterprise, and adding new high-relevance characteristics;
and model training is carried out again, so that the recognition result is more accurate.
S7, model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.
Example 2:
the invention relates to a malicious registered enterprise behavior recognition system based on an isolated forest algorithm, which comprises,
the data acquisition module is used for acquiring enterprise information, and extracting the enterprise information to a big data platform for analysis by using a data extraction tool; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information;
the characteristic selection module is used for selecting characteristics highly associated with malicious registration behaviors according to expert experience and characteristic summary of the past malicious registered enterprises, and extracting new characteristics according to the existing data;
the data preprocessing module is used for preprocessing the data to form a standard data set; the working process of the data preprocessing module is as follows:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
The model training module is used for training a standard data set by utilizing an isolated forest algorithm to obtain a model result, and an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
the result pushing module is used for pushing the obtained list of the suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, and the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments; the working process of the feedback result analysis module is as follows:
(1) calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
(2) analyzing other characteristics of the malicious registered enterprise and adding new high-relevance characteristics;
(3) and model training is carried out again, so that the recognition result is more accurate.
And the model issuing module is used for performing multi-round training and feedback, the model tends to be stable, the recognition accuracy is high, and the issued model can be used for malicious registration recognition of a new enterprise.
Example 3:
an embodiment of the present invention further provides an electronic device, including: a memory and at least one processor;
wherein the memory stores computer-executable instructions;
the at least one processor executes the computer-executable instructions stored in the memory to cause the at least one processor to perform the method for identifying malicious registered enterprise behaviors based on isolated forest algorithms in any of the embodiments of the present invention.
Example 4:
the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the malicious registered enterprise behavior identification method based on the isolated forest algorithm in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RYM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A malicious registered enterprise behavior identification method based on an isolated forest algorithm is characterized in that the isolated forest algorithm is used, measurable detection features are extracted from market subject registration information, tax registration information, business conditions, daily water and electricity consumption information, employee social security payment information, ticket collecting information and billing information, an enterprise data set is established by using the measurable detection features, the isolated forest algorithm is used for randomly dividing the data set by using a hyperplane, and the point which is isolated at the earliest is an abnormal point, so that abnormal enterprises are screened out.
2. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in claim 1, which is characterized by comprising the following steps:
collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis;
selecting the characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;
data preprocessing: preprocessing the data to form a standard data set;
model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
and (4) result pushing: pushing the obtained list of suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
and (3) feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and the tax department, adding new high-relevance characteristics, and continuing to perform model training;
model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.
3. The method for identifying malicious registered enterprise behaviors based on an isolated forest algorithm as claimed in claim 1, wherein the enterprise information comprises enterprise registration information and business conditions of a market supervision department, tax registration information, receipt taking information and invoice making information of a tax department, daily water and electricity consumption information of other departments and staff social security payment information.
4. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in claim 1, wherein the data preprocessing is specifically as follows:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: and carrying out standardization processing on the features to form a standard data set.
5. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in any one of claims 1-4, wherein the feedback result analysis is specifically as follows:
calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
analyzing other characteristics of the malicious registered enterprise, and adding new high-relevance characteristics;
and model training is carried out again, so that the recognition result is more accurate.
6. A malicious registered enterprise behavior recognition system based on an isolated forest algorithm is characterized by comprising,
the data acquisition module is used for acquiring enterprise information, and extracting the enterprise information to a big data platform for analysis by using a data extraction tool; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information;
the characteristic selection module is used for selecting characteristics highly associated with malicious registration behaviors according to expert experience and characteristic summary of the past malicious registered enterprises, and extracting new characteristics according to the existing data;
the data preprocessing module is used for preprocessing the data to form a standard data set;
the model training module is used for training a standard data set by utilizing an isolated forest algorithm to obtain a model result, and an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;
the result pushing module is used for pushing the obtained list of the suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, and the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;
the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments;
and the model issuing module is used for performing multi-round training and feedback, the model tends to be stable, the recognition accuracy is high, and the issued model can be used for malicious registration recognition of a new enterprise.
7. The system for identifying malicious registered enterprise behaviors based on an isolated forest algorithm according to claim 6, wherein the data preprocessing module specifically comprises the following working processes:
cleaning key characteristic fields;
secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;
thirdly, the abnormal value is manually modified and can not be modified to be removed;
fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;
characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.
8. The malicious registered enterprise behavior recognition system based on the isolated forest algorithm as claimed in claim 6 or 7, wherein the feedback result analysis module specifically comprises the following working processes:
(1) calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;
(2) analyzing other characteristics of the malicious registered enterprise and adding new high-relevance characteristics;
(3) and model training is carried out again, so that the recognition result is more accurate.
9. An electronic device, comprising: a memory and at least one processor;
wherein the memory has stored thereon a computer program;
the at least one processor executing the memory-stored computer program causes the at least one processor to perform the orphan forest algorithm-based malicious registered enterprise behavior identification method of any of claims 1 to 5.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program is executable by a processor to implement the method for identifying malicious registered enterprise behaviors based on isolated forest algorithms as claimed in claims 1 to 5.
CN202011237306.0A 2020-11-09 2020-11-09 Malicious registered enterprise behavior identification method and system based on isolated forest algorithm Pending CN112270553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011237306.0A CN112270553A (en) 2020-11-09 2020-11-09 Malicious registered enterprise behavior identification method and system based on isolated forest algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011237306.0A CN112270553A (en) 2020-11-09 2020-11-09 Malicious registered enterprise behavior identification method and system based on isolated forest algorithm

Publications (1)

Publication Number Publication Date
CN112270553A true CN112270553A (en) 2021-01-26

Family

ID=74339698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011237306.0A Pending CN112270553A (en) 2020-11-09 2020-11-09 Malicious registered enterprise behavior identification method and system based on isolated forest algorithm

Country Status (1)

Country Link
CN (1) CN112270553A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240010A (en) * 2021-05-14 2021-08-10 烟台海颐软件股份有限公司 Abnormity detection method and system supporting non-independent distribution of mixed data
CN113327037A (en) * 2021-05-31 2021-08-31 平安国际智慧城市科技股份有限公司 Model-based risk identification method and device, computer equipment and storage medium
CN114495137A (en) * 2022-04-15 2022-05-13 深圳高灯计算机科技有限公司 Bill abnormity detection model generation method and bill abnormity detection method
CN116681358A (en) * 2023-08-04 2023-09-01 深圳中科闻歌科技有限公司 XGBoost model-based new registration abnormal enterprise detection method
CN116720787A (en) * 2023-08-04 2023-09-08 深圳中科闻歌科技有限公司 XGBoost model-based new change abnormal enterprise detection method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510006A (en) * 2018-04-08 2018-09-07 重庆邮电大学 A kind of analysis of business electrical amount and prediction technique based on data mining
CN111192140A (en) * 2020-01-02 2020-05-22 北京明略软件系统有限公司 Method and device for predicting customer default probability
WO2020111571A1 (en) * 2018-11-26 2020-06-04 (주) 위세아이텍 Artificial intelligence-based severe delinquency detection device and method
CN111833172A (en) * 2020-05-25 2020-10-27 百维金科(上海)信息科技有限公司 Consumption credit fraud detection method and system based on isolated forest

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510006A (en) * 2018-04-08 2018-09-07 重庆邮电大学 A kind of analysis of business electrical amount and prediction technique based on data mining
WO2020111571A1 (en) * 2018-11-26 2020-06-04 (주) 위세아이텍 Artificial intelligence-based severe delinquency detection device and method
CN111192140A (en) * 2020-01-02 2020-05-22 北京明略软件系统有限公司 Method and device for predicting customer default probability
CN111833172A (en) * 2020-05-25 2020-10-27 百维金科(上海)信息科技有限公司 Consumption credit fraud detection method and system based on isolated forest

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240010A (en) * 2021-05-14 2021-08-10 烟台海颐软件股份有限公司 Abnormity detection method and system supporting non-independent distribution of mixed data
CN113240010B (en) * 2021-05-14 2023-10-24 烟台海颐软件股份有限公司 Anomaly detection method and system supporting non-independent distribution mixed data
CN113327037A (en) * 2021-05-31 2021-08-31 平安国际智慧城市科技股份有限公司 Model-based risk identification method and device, computer equipment and storage medium
CN114495137A (en) * 2022-04-15 2022-05-13 深圳高灯计算机科技有限公司 Bill abnormity detection model generation method and bill abnormity detection method
CN116681358A (en) * 2023-08-04 2023-09-01 深圳中科闻歌科技有限公司 XGBoost model-based new registration abnormal enterprise detection method
CN116720787A (en) * 2023-08-04 2023-09-08 深圳中科闻歌科技有限公司 XGBoost model-based new change abnormal enterprise detection method

Similar Documents

Publication Publication Date Title
CN112270553A (en) Malicious registered enterprise behavior identification method and system based on isolated forest algorithm
EP3686756A1 (en) Method and apparatus for grouping data records
CN108629413A (en) Neural network model training, trading activity Risk Identification Method and device
CN113204603B (en) Category labeling method and device for financial data assets
CN108268886B (en) Method and system for identifying plug-in operation
US20150221045A1 (en) System and method of normalizing vendor data
CN105096195A (en) Account money amount processing method and system based on internet application platform
CN114186760A (en) Analysis method and system for stable operation of enterprise and readable storage medium
CN113989859B (en) Fingerprint similarity identification method and device for anti-flashing equipment
CN116485519A (en) Data processing method, device, equipment and storage medium
CN109636378B (en) Account identification method and device and electronic equipment
CN112990989B (en) Value prediction model input data generation method, device, equipment and medium
CN112966728B (en) Transaction monitoring method and device
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN112365352A (en) Anti-cash-out method and device based on graph neural network
CN116361488A (en) Method and device for mining risk object based on knowledge graph
CN114817518B (en) License handling method, system and medium based on big data archive identification
CN114495137B (en) Bill abnormity detection model generation method and bill abnormity detection method
CN115392351A (en) Risk user identification method and device, electronic equipment and storage medium
CN111460052B (en) Low-security fund supervision method and system based on supervised data correlation analysis
CN112949752B (en) Training method and device of business prediction system
CN115392206B (en) Method, device and equipment for quickly querying data based on WPS/EXCEL and storage medium
CN113362151B (en) Data processing method and device for financial business, electronic equipment and storage medium
CN116739752A (en) Message reminding method and device, electronic equipment and storage medium
CN117893321A (en) Account anomaly detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210126

RJ01 Rejection of invention patent application after publication