CN112270553A

CN112270553A - Malicious registered enterprise behavior identification method and system based on isolated forest algorithm

Info

Publication number: CN112270553A
Application number: CN202011237306.0A
Authority: CN
Inventors: 曲金涛; 彭光
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-01-26

Abstract

The invention discloses a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm, belonging to the field of public credit, and aiming at solving the technical problem that malicious registered behaviors in the field of public credit are difficult to identify, and adopting the technical scheme as follows: the method comprises the steps of extracting measurable detection characteristics from market main body registration information, tax registration information, operation conditions, daily water and electricity utilization information, employee social security payment information, ticket getting information and billing information by using an isolated forest algorithm, constructing an enterprise data set by using the measurable detection characteristics, randomly dividing the data set by using a hyperplane by using the isolated forest algorithm, wherein the earliest isolated point is an abnormal point, and further screening abnormal enterprises; the method comprises the following specific steps: collecting data; selecting characteristics; preprocessing data; training a model; pushing a result; analyzing a feedback result; and (6) model release.

Description

Malicious registered enterprise behavior identification method and system based on isolated forest algorithm

Technical Field

The invention relates to the field of public credit, in particular to a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm.

Background

Isolated forest algorithm: an unsupervised anomaly detection method suitable for continuous data is different from other anomaly detection algorithms in describing the separation degree among samples through equivalent indexes of distance and density, and an isolated forest algorithm detects an abnormal value through isolation of sample points. In particular, the algorithm isolates samples using a binary search tree structure known as an isolation tree. Because of the small number of outliers and the interspersion with most samples, outliers are isolated earlier, i.e., outliers are closer to the root node, while normal values are further away from the root node. In addition, compared with traditional algorithms such as LOF and K-means, the isolated forest algorithm has better robustness on high-dimensional data. The application scene of the isolated forest algorithm is the condition that abnormal samples are few and normal samples are many, the quantity of the maliciously registered enterprises in all the registered enterprises is not large, and the isolated forest algorithm is suitable for the scene.

The feature selection of the traditional isolated forest algorithm basically depends on expert experience, the method has no reliable data support, the model identification accuracy is low easily, and the result verification has certain difficulty due to the fact that the algorithm is an unsupervised algorithm.

Disclosure of Invention

The invention provides a malicious registered enterprise behavior identification method and system based on an isolated forest algorithm, and aims to solve the problem that malicious registered behaviors in the field of public credit are difficult to identify.

The technical task of the invention is realized in the following way, the method for identifying the malicious registered enterprise behaviors based on the isolated forest algorithm extracts measurable detection characteristics from the registered registration information, tax registration information, management condition, daily water and electricity consumption information, employee social security payment information, receipt information and billing information of a market main body by using the isolated forest algorithm, an enterprise data set is constructed by using the measurable detection characteristics, and then the data set is randomly divided by using the isolated forest algorithm by using a hyperplane, wherein the earliest isolated point is an abnormal point, and abnormal enterprises are screened out.

Preferably, the method is specifically as follows:

collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis;

selecting the characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;

data preprocessing: preprocessing the data to form a standard data set;

model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;

and (4) result pushing: pushing the obtained list of suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;

and (3) feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and the tax department, adding new high-relevance characteristics, and continuing to perform model training;

model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.

Preferably, the enterprise information includes enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments, and employee social security payment information.

Preferably, the data preprocessing is specifically as follows:

cleaning key characteristic fields;

secondly, the empty default values are manually supplemented, and the empty default values which cannot be supplemented are removed;

thirdly, the abnormal value is manually modified and can not be modified to be removed;

fourthly, calculating the characteristics: calculating and forming the features to be refined according to a feature calculation formula;

characteristic normalization treatment: because the measurement scales of a plurality of characteristics are different, in order to achieve a more ideal dividing effect, the characteristics are subjected to standardization processing to form a standard data set.

Preferably, the feedback result analysis is specifically as follows:

calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;

analyzing other characteristics of the malicious registered enterprise, and adding new high-relevance characteristics;

and model training is carried out again, so that the recognition result is more accurate.

A malicious registered enterprise behavior recognition system based on an isolated forest algorithm comprises,

the data acquisition module is used for acquiring enterprise information, and extracting the enterprise information to a big data platform for analysis by using a data extraction tool; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information;

the characteristic selection module is used for selecting characteristics highly associated with malicious registration behaviors according to expert experience and characteristic summary of the past malicious registered enterprises, and extracting new characteristics according to the existing data;

the data preprocessing module is used for preprocessing the data to form a standard data set;

the model training module is used for training a standard data set by utilizing an isolated forest algorithm to obtain a model result, and an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;

the result pushing module is used for pushing the obtained list of the suspected malicious registered enterprises to a market supervision department and a tax department through a public credit information platform, and the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;

the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments;

and the model issuing module is used for performing multi-round training and feedback, the model tends to be stable, the recognition accuracy is high, and the issued model can be used for malicious registration recognition of a new enterprise.

Preferably, the data preprocessing module specifically comprises the following working processes:

cleaning key characteristic fields;

Preferably, the working process of the feedback result analysis module is as follows:

(1) calculating the correlation between each characteristic and a malicious registration behavior by combining the real malicious registration enterprise result fed back by the market supervision and the tax department, eliminating the characteristics with low correlation degree and no correlation, and improving the characteristic proportion with high correlation degree;

(2) analyzing other characteristics of the malicious registered enterprise and adding new high-relevance characteristics;

(3) and model training is carried out again, so that the recognition result is more accurate.

An electronic device, comprising: a memory and at least one processor;

wherein the memory has stored thereon a computer program;

the at least one processor executes the memory-stored computer program such that the at least one processor executes the orphan forest algorithm-based malicious registered enterprise behavior identification method as described above.

A computer-readable storage medium having stored thereon a computer program executable by a processor to implement a method for malicious registered enterprise behavior identification based on isolated forest algorithms as described above.

The malicious registered enterprise behavior identification method and system based on the isolated forest algorithm have the following advantages:

the invention relates to a model and a technology which use an isolated forest algorithm to identify malicious registered enterprise behaviors, distribute results and feed back detection results by means of a public credit information platform and adjust feature selection by using the feedback results; because the detection characteristics are highly related to the malicious registration behaviors, the abnormal enterprises serving as suspected malicious registration enterprises are pushed to a market supervision department and a tax department, and the detection results fed back by the two departments are collected as subsequent characteristic screening bases, so that the model identification accuracy is improved; the invention can effectively find potential malicious registered enterprises and assist market supervision and tax departments in carrying out daily supervision work, and meanwhile, the invention has universality, and the whole analysis process is suitable for anomaly detection in other fields and can achieve ideal effect;

the invention applies the isolated forest algorithm to the identification of malicious registered enterprises in the field of public credit, determines the preliminary characteristics according to the expert experience, automatically identifies abnormal points by using the algorithm, takes the abnormal points as suspected malicious registered enterprises and pushes the suspected malicious registered enterprises to market supervision and tax departments;

when the unsupervised algorithm is used for monitoring the abnormity, a feedback mechanism is added, and multiple rounds of feature screening and feature addition are carried out by utilizing real malicious registered enterprises fed back by market supervision and tax departments, so that feature selection is more practical, and the accuracy of the model is improved;

the method starts from the characteristics of malicious registered enterprises to form a feature set, so that the feature set used in the first training is beneficial to identifying abnormality by an isolated forest algorithm;

the method utilizes the real data fed back by the market supervision and the tax department to gradually optimize the feature set, so that the model identification is stable and accurate;

and sixthly, pushing the identified suspected malicious registered enterprises, reminding market supervision and tax departments to carry out key supervision and investigation.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow chart of a malicious registered enterprise behavior identification method based on an isolated forest algorithm.

Detailed Description

The malicious registered enterprise behavior identification method and system based on the isolated forest algorithm are described in detail below with reference to the drawings and specific embodiments of the specification.

Example 1:

as shown in the attached figure 1, the malicious registered enterprise behavior identification method based on the isolated forest algorithm extracts measurable detection characteristics from market main body registration information, tax registration information, business conditions, daily water and electricity consumption information, employee social security payment information, receipt information and billing information by using the isolated forest algorithm, constructs an enterprise data set by using the measurable detection characteristics, and randomly divides the data set by using the isolated forest algorithm and a hyperplane, wherein the earliest isolated point is an abnormal point, so that abnormal enterprises are screened out; the method comprises the following specific steps:

s1, collecting data: enterprise information is collected, and a data extraction tool is used for extracting the enterprise information to a big data platform to wait for analysis; the enterprise information comprises enterprise registration information and operation conditions of a market supervision department, tax registration information, invoice receiving information and invoice making information of a tax department, daily water and electricity consumption information of other departments and employee social security payment information.

S2, selecting characteristics: according to expert experience and characteristic summary of previous malicious registered enterprises, characteristics highly associated with malicious registration behaviors are selected, and new characteristics are extracted according to existing data;

wherein, characteristics of maliciously registering the enterprise include actually not managing, falsely use the ID card, and legal person, financial staff and tax personnel are the same person, and the production energy consumption is seriously inconsistent with the sales situation like the charges of electricity condition, and the standing time is short, registers a large amount of enterprises in the short time, receives a large amount of invoices and concentrated quota invoices in short-term, specifically as following:

s3, preprocessing data: preprocessing the data to form a standard data set; the method comprises the following specific steps:

cleaning key characteristic fields;

S4, model training: training a standard data set by using an isolated forest algorithm to obtain a model result, wherein an abnormal point set in the model result is a suspected malicious registered enterprise; the trained model is directly used for suspected malicious registration recognition of a new enterprise;

s5, result pushing: pushing the list of suspected malicious registered enterprises obtained in the step S4 to a market supervision department and a tax department through a public credit information platform, wherein the two departments perform key supervision on the suspected malicious registered enterprises in daily supervision work and feed back real malicious registered enterprises to a big data platform;

s6, feedback result analysis: analyzing the characteristics of the malicious registered enterprises by combining the real malicious registered enterprise results fed back by the market supervision and tax departments, adding new high-relevance characteristics, and returning to the step S4 to continue model training; the method comprises the following specific steps:

S7, model release: through multiple rounds of training and feedback, the model tends to be stable, the recognition accuracy is high, and the model can be released for malicious registration recognition of new enterprises.

Example 2:

the invention relates to a malicious registered enterprise behavior recognition system based on an isolated forest algorithm, which comprises,

the data preprocessing module is used for preprocessing the data to form a standard data set; the working process of the data preprocessing module is as follows:

cleaning key characteristic fields;

the feedback result analysis module is used for analyzing the characteristics of the malicious registered enterprises, adding new high-relevance characteristics and continuing to perform model training by combining the real malicious registered enterprise results fed back by the market supervision and tax departments; the working process of the feedback result analysis module is as follows:

Example 3:

an embodiment of the present invention further provides an electronic device, including: a memory and at least one processor;

wherein the memory stores computer-executable instructions;

the at least one processor executes the computer-executable instructions stored in the memory to cause the at least one processor to perform the method for identifying malicious registered enterprise behaviors based on isolated forest algorithms in any of the embodiments of the present invention.

Example 4:

the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the malicious registered enterprise behavior identification method based on the isolated forest algorithm in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RYM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A malicious registered enterprise behavior identification method based on an isolated forest algorithm is characterized in that the isolated forest algorithm is used, measurable detection features are extracted from market subject registration information, tax registration information, business conditions, daily water and electricity consumption information, employee social security payment information, ticket collecting information and billing information, an enterprise data set is established by using the measurable detection features, the isolated forest algorithm is used for randomly dividing the data set by using a hyperplane, and the point which is isolated at the earliest is an abnormal point, so that abnormal enterprises are screened out.

2. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in claim 1, which is characterized by comprising the following steps:

data preprocessing: preprocessing the data to form a standard data set;

3. The method for identifying malicious registered enterprise behaviors based on an isolated forest algorithm as claimed in claim 1, wherein the enterprise information comprises enterprise registration information and business conditions of a market supervision department, tax registration information, receipt taking information and invoice making information of a tax department, daily water and electricity consumption information of other departments and staff social security payment information.

4. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in claim 1, wherein the data preprocessing is specifically as follows:

cleaning key characteristic fields;

characteristic normalization treatment: and carrying out standardization processing on the features to form a standard data set.

5. The malicious registered enterprise behavior identification method based on the isolated forest algorithm as claimed in any one of claims 1-4, wherein the feedback result analysis is specifically as follows:

6. A malicious registered enterprise behavior recognition system based on an isolated forest algorithm is characterized by comprising,

7. The system for identifying malicious registered enterprise behaviors based on an isolated forest algorithm according to claim 6, wherein the data preprocessing module specifically comprises the following working processes:

cleaning key characteristic fields;

8. The malicious registered enterprise behavior recognition system based on the isolated forest algorithm as claimed in claim 6 or 7, wherein the feedback result analysis module specifically comprises the following working processes:

9. An electronic device, comprising: a memory and at least one processor;

wherein the memory has stored thereon a computer program;

the at least one processor executing the memory-stored computer program causes the at least one processor to perform the orphan forest algorithm-based malicious registered enterprise behavior identification method of any of claims 1 to 5.

10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program is executable by a processor to implement the method for identifying malicious registered enterprise behaviors based on isolated forest algorithms as claimed in claims 1 to 5.