CN109657947B

CN109657947B - Enterprise industry classification-oriented anomaly detection method

Info

Publication number: CN109657947B
Application number: CN201811489291.XA
Authority: CN
Inventors: 郑庆华; 高宇达; 阮建飞; 赵珮瑶; 董博; 孙铭潞; 田雨润
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2021-03-16
Anticipated expiration: 2038-12-06
Also published as: CN109657947A

Abstract

The invention discloses an enterprise industry classification-oriented anomaly detection method, which comprises the following steps: firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network. According to the invention, the TADM model is used for carrying out anomaly detection on the original data, so that the macro management work such as national statistics, tax, industrial and commercial management and the like can be more reasonably and accurately analyzed.

Description

Enterprise industry classification-oriented anomaly detection method

Technical Field

The invention belongs to the field of data mining, and particularly relates to an Anomaly Detection method for enterprise industry classification based on a TADM (Two-level Anomaly Detection Model, 2-level Anomaly Detection Model).

Background

After the innovation, the national economy of China is rapidly developed, the market economy is continuously flourished, the national economic structure is gradually improved, and the division of industry of enterprises is gradually refined. In the new period, the research enterprise industry classification plays a fundamental role in promoting the management of finance, tax and national standards, and provides a basis for further analyzing the current development situation of the national economy industry and mastering the development trend of the national economy. The national economic industry classification (GB/T4754 plus 2017) issued by the State quality supervision, inspection and quarantine Bureau and the State standardization administration provides the industry classification and codes of enterprise economic activities, and specifically comprises 97 industry major classes and 1380 industry details. When an enterprise registers, the industrial and commercial management department needs to determine the national economic industry classification of the enterprise according to information such as the enterprise operation range. However, the existing enterprise industry classification is mainly realized manually, is limited by professional knowledge and experience of workers, and is often wrong in classification when facing a large amount of enterprise classification tasks. The wrong enterprise industry classification can generate a series of adverse effects on the works of national statistics, tax, industrial and commercial management and the like, so how to detect and identify the abnormal condition of the enterprise industry classification by using a computer program becomes a problem to be solved urgently.

At present, no relevant research provides a corresponding solution for detecting enterprise industry classification abnormity. The disclosed technology aims at establishing a general anomaly detection method, and the representative work is as follows:

document 1: distributed discrete point detection method and system based on automatic coding machine (201410225026.6)

Document 2: local outlier detection method based on density (201710559390.X)

Document 3: multi-dimensional data anomaly detection method and device (201710411852.3)

Document 1 proposes a distributed outlier detection method based on an automatic coding machine, which updates model parameters of the automatic coding machine by using a distributed computing technique, and performs anomaly detection according to reconstruction errors of samples.

Document 2 designs a local discrete point detection method based on density, which considers the degree of dispersion between a sample point and its neighboring sample points, defines k-neighborhood dispersion degree according to expectation and variance of distances between the sample point and its neighboring samples, redefines local outlier coefficients by using the k-neighborhood dispersion degree, and judges whether the sample is abnormal by calculating the neighborhood density of the sample point.

Document 3 uses a reconstruction network to perform anomaly detection on high-dimensional data, constructs a reconstruction model, and determines an anomaly of a sample according to multi-dimensional reconstruction data.

Although the traditional method can solve the specific problem of anomaly detection, the traditional method is difficult to be directly expanded to the problem of anomaly detection of industry classification because the anomaly detection of the industry classification has the characteristics of multiple classes and multiple levels. First, enterprise industry classification belongs to a multi-classification problem, and the anomaly detection problem becomes complicated due to various categories and large data volume. The self-coding network structures of

documents

1 and 3 are too simple, only have one hidden layer, cannot effectively extract detailed characteristics of data, and seriously lack generalization capability under a large-scale data set; document 2 defines a local outlier coefficient by using k neighborhood dispersion, but in the case of many industry categories and a large amount of industry information data, the selection of a k value becomes extremely difficult. Secondly, the enterprise industry major categories and the enterprise detail have hierarchical membership, the enterprise industry major categories and the enterprise detail belong to different hierarchies respectively, the industry detail is the expansion and refinement of the industry major categories, any enterprise corresponds to one industry major category and one industry detail, data with different information granularities (reflecting the detailed degree of information) are needed for carrying out anomaly analysis, and documents 1 to 3 do not have solutions for the problem of industry classification multi-hierarchy anomaly detection.

Aiming at the current situation that industry major categories and industry detailed categories belong to different levels, a deep self-coding network model is introduced, the deep self-coding network also has obvious level characteristics, in the process of coding and decoding, the outputs of different network layers respectively correspond to different feature spaces of the network, the information granularities represented by the different feature spaces are different and are matched with the level characteristics of industry classification, and therefore the problem of abnormal detection of the industry major categories and the industry detailed categories can be solved simultaneously by using the deep self-coding network.

Disclosure of Invention

The invention aims to provide an anomaly detection method for enterprise industry classification based on TADM. Firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using a dimension reduction characteristic fusion SOS (random anomaly Selection) algorithm of the industry major network, and carrying out anomaly detection on the industry detail network according to the reconstruction characteristic of the industry detail network, wherein the two processes are independent from each other, so that synchronous anomaly detection of the industry major network and the industry detail is realized.

The invention is realized by adopting the following technical scheme:

an enterprise industry classification-oriented anomaly detection method comprises the following steps:

firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network.

The invention has the further improvement that the method specifically comprises the following implementation steps:

1) taxpayer text attribute processing

Analyzing text information in the taxpayer industry information table, extracting representative text attributes, and performing anomaly detection by using the extracted attributes;

2) non-text attribute processing

The non-text attribute of the taxpayer industry information comprises two parts: a numerical type attribute and a categorical type attribute; processing the numerical type attribute by using a z-score standardization method, and encoding the category type attribute by using One-Hot;

3) anomaly indicator generation and analysis

Generating and analyzing abnormal indexes by taking a deep-learning self-coding network as a prototype, and designing a calculation method of industry classification abnormal indexes based on TADM according to the theory that industry information of different levels contains different information granularities;

4) anomaly evaluation

The business major anomaly index of the first level is obtained by a TADM network and an SOS anomaly detection algorithm, the model finally calculates the anomaly probability of each sample and provides an anomaly probability threshold, the anomaly probabilities of all taxpayer characteristic samples are compared with the threshold, and if the anomaly probability is greater than the threshold, the sample is judged to be anomalous data in the business major;

and reconstructing the industry detail abnormal index of the second level by a TADM network, finally calculating the reconstruction error of each sample by the model, giving a reconstruction error threshold value, comparing the reconstruction errors of the samples of all the industry details with the threshold value, and if the reconstruction errors are larger than the threshold value, judging that the samples are abnormal data in the industry details.

The invention has the further improvement that in the step 1), the taxpayer text attribute processing specifically comprises the following steps:

step1. preprocessing of text information

The text preprocessing is to carry out standardization operation on taxpayer business information, and the specific implementation comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting numbers and quantifier words in the text attribute; (3) deleting the data of the null identifier of the database;

step2, carrying out word segmentation based on Ansj word segmentation device

Constructing an industry classification professional dictionary based on the classification of the national economy industry, constructing a stop word dictionary based on the national province, city, county, region, city, county, level four, administrative district, region and name word banks, segmenting words of texts based on an Ansj segmenter according to the constructed stop word dictionary, and establishing a segmentation language database;

step3. construction of word vectors

Weighting words of all samples according to the proportion of different types of texts in the word segmentation corpus; and screening out words with larger weights, reserving N keywords with the largest weights in each corpus, and converting the N keywords into word vectors by using a word2vec tool.

The invention further improves that in the step 2), the non-text attribute processing of the taxpayer industry information comprises two parts: numerical attribute processing and category attribute processing;

the method for processing the numerical attributes by using the z-score standardization method comprises the following specific steps:

step1. calculate the mean of the various attributes

Remember u ═ u (u)₁,u₂,...,u_m) Is a mean vector, where m represents the number of classes of the numerical attribute, u_iThe average value of the ith numerical attribute is represented, and the specific calculation form is as follows:

wherein n represents the number of taxpayer industry information samples,

a j-th numerical attribute value representing an i-th sample;

step2. calculate the variance of each attribute

Let σ ═ e (σ)₁,σ₂,...,σ_m) Is the variance of each numerical attribute, where m represents the number of classes of the numerical attribute, σ_iRepresenting the variance, σ, of the ith numerical attribute_iThe specific form of calculation is:

the mean value and the variance are basic indexes of the numerical attribute, and the numerical attribute can be standardized through the mean value and the variance;

step3, standardizing the data

Standardizing the sample data according to the mean value and the variance of the numerical attribute calculated in the two steps, wherein the specific form is as follows:

wherein the content of the first and second substances,

is the result after z-score treatment, X_iIs the column vector, u, corresponding to the ith numerical attribute_iMeans, σ, representing the ith numerical attribute_iA party representing an ith numerical attribute;

and encoding the type attribute by using One-Hot, wherein the detailed steps are as follows:

step1, judging the number of discrete values of the attribute, setting M discrete values of the index, and encoding M states by adopting an M-bit state register, wherein each state has an independent register bit and only one bit is effective;

step2, setting M-bit state registers, wherein only one bit of each state register is 1, and the rest are 0, and converting the difference of the type data into the distance in the Euclidean space by the setting mode;

and step3, respectively corresponding the M state codes to the M discrete values One by One, and determining that the value of each attribute is an M-dimensional vector which represents One-Hot code of the attribute value.

The further improvement of the present invention is that, in step 3), the method specifically includes the construction of TADM network, the generation of industry large-scale abnormal indexes, and the generation of industry detail abnormal indexes, as follows:

TADM network construction

The TADM network has five layers of neural networks, neurons of an input layer and an output layer are the same in number and equal to the total dimension of data processed by taxpayer information text attributes and non-text attributes, and the TADM network takes taxpayer information characteristic vectors as the input of the network and takes a reconstruction result as the output of the network;

specifically, the TADM network employs a standard feedforward neural network, the weight update employs a BP neural network algorithm, the BP neural network algorithm is used to minimize the overall reconstruction error OF and update the network parameters, and the formal expression is:

wherein Ω represents a network parameter, n represents the number OF samples, OF_iRepresenting the reconstruction error of the ith sample, and the network coding and decoding capability is determined by a network parameter omega;

step2. Generation of industry major anomaly indicators

By utilizing the dimension reduction characteristic of the intermediate network layer of the TADM network, extracting the principal component spatial characteristics of the intermediate layer and fusing an SOS anomaly detection algorithm to carry out large-class anomaly detection in the industry;

the SOS anomaly detection algorithm is an anomaly detection method based on similarity, whether a sample is abnormal or not is judged by calculating the anomaly probability, and a proper threshold value is selected as the standard of abnormal data evaluation, and the SOS anomaly detection algorithm comprises the following steps:

(1) generating a dissimilarity matrix between taxpayer characteristics

For taxpayer characteristics, calculating dissimilarity degrees between different taxpayer characteristics respectively, wherein the dissimilarity degrees represent the difference degree between two data points and are specifically described in a form of Euclidean distance:

wherein, the characteristic matrix of the original data is X ═ X_ij)_n×mEach line in X is a feature vector of one sample; n represents the number of matrix rows, specifically the number of samples of data; m represents the number of matrix columns, specifically the number of attribute types of each sample; the dissimilarity matrix is D ═ D (D)_ij)_n×nElement d in the matrix_ijRepresenting the degree of difference between sample i and sample j;

(2) generating a matrix of degrees of association between taxpayer characteristics from dissimilarity information

The correlation degree between each taxpayer characteristic and other taxpayer characteristics is expressed by using a correlation degree, a specific generation method of a correlation degree matrix is to construct a nonlinear mapping, and the mapping can map a dissimilarity degree matrix D into a correlation degree matrix A; the specific mapping relation is as follows:

wherein the content of the first and second substances,

is a feature vector X_iThe variance of (1) represents the dispersion degree of data corresponding to the ith principal component characteristic in the taxpayer industry information; a is_ijIs the degree of correlation between sample i and sample j;

(3) generation of association probability matrix between taxpayer characteristics through association degree information

The probability of association between taxpayer features is given by the matrix B ═ B_ij)_n×nRepresenting, abstracting each sample to a point in the graph model in the calculation process of the relevance probability matrix, wherein the weight value of an edge between each point and each point represents the relevance probability; specifically, the formalization of the association probability calculation is represented as:

wherein, b_ijRepresents the association probability between the point i and the point j, and satisfies

(4) Deriving abnormal probabilities between taxpayer features by correlating probabilities

According to the graph theory, the probability that the taxpayer characteristic sample is an abnormal point is specifically expressed as follows: the probability that the click-in degree in the graph model corresponding to the taxpayer characteristics is zero; deriving abnormal probability through the associated probability among the samples, wherein the specific calculation method comprises the following steps:

wherein, P (X)_iE Outlier) represents a sample X_iThe probability of the abnormal data is calculated, and then the large-class abnormality of the industry is evaluated according to a given threshold value;

step3. Generation of industry detail anomaly indicators

The method comprises the steps that more characteristic information is needed for industry detail abnormity detection than for industry large class, a TADM network constructed by Step1 has the capability of encoding and decoding normal data, encoding specifically refers to a process of inputting data into the network and compressing the data into a low-dimensional space, and decoding specifically refers to a process of mapping and outputting the low-dimensional space data to a high-dimensional space; judging the abnormal condition of the industry detail through the reconstruction error of the sample, wherein the specific calculation of the reconstruction error is as follows:

OF therein_iDenotes the reconstruction error of the ith sample, N denotes the number of input layer and output layer neurons, x_ijThe value of the j-th neuron representing the input layer in the i-th sample, o_ijRepresenting the value of the jth neuron of the output layer in the ith sample.

The invention has the following beneficial technical effects:

the anomaly detection method for enterprise industry classification provided by the invention provides more accurate and reliable data for industry classification and other enterprise industry data analysis problems. The existing anomaly detection technology is improved, so that the anomaly detection method is suitable for solving the problem of industrial anomaly detection. Compared with the prior art, the invention has the advantages that:

(1) the invention combines text information and non-text information of taxpayers to carry out industry classification, the prior art generally only carries out semantic analysis on the text information and has the defect of incomplete information use, the invention carries out natural language processing on the text information and carries out feature vectorization processing on the non-text information, integrates important attributes in the taxpayers 'business information and more comprehensively utilizes the taxpayers' business information.

(2) The invention skillfully combines a self-coding network model in deep learning to detect the abnormity of industry classification, discloses a TADM abnormity detection method, utilizes the reconstruction characteristic of a network to detect the abnormity of industry details, and utilizes the dimension reduction characteristic of the network to assist an SOS abnormity detection algorithm to detect the abnormity of industry classes, thereby effectively solving the problem of 2-level industry classification abnormity detection.

(3) The network structure can be reused, the same network structure can be used for respectively carrying out abnormity detection on industry major categories and industry details, and the network structure does not need to be redesigned.

Drawings

FIG. 1 is an overall framework flow diagram.

FIG. 2 is a text attribute processing flow diagram.

FIG. 3 is a non-text property processing flow diagram.

FIG. 4 is a schematic diagram of taxpayer information 2 level exception data.

Fig. 5 is a schematic diagram of two-level anomaly detection performed by the TADM model.

Fig. 6 is a flow chart of a TADM model implementation.

FIG. 7 is a flowchart of the calculation of the large-scale industry anomaly indicators.

Detailed Description

The invention is further described below with reference to the following figures and examples.

The taxpayer information registered in the national tax of a certain area from 2011 to 2017 is selected, the taxpayer information comprises sample data of 25 industry major categories and 112 industry details, and each industry major category belongs to a plurality of different industry details. The present invention will be described in further detail with reference to the accompanying drawings, in conjunction with experimental examples and embodiments. All the technologies realized based on the present disclosure belong to the scope of the present invention.

As shown in fig. 1, in the embodiment of the present invention, the anomaly detection process for classifying the taxpayer industry level 2 includes the following steps:

step1, text attribute processing

The taxpayer industry information table has a lot of valuable information and knowledge which can be used for industry classification, wherein part of the information is stored in a database in a text mode. The method for extracting 9 text attributes from the register taxpayer information and the register taxpayer information expansion table comprises the following steps: { HY _ DM, ZY, JY, JYFS, CBBZL, FSHY1_ DM, FSHY2_ DM, FSHY3_ DM and JYFW }, which respectively represent the following meanings: { industry code, main business, concurrent business, business mode, financial statement type, national subject industry 1, national subject industry 2, national subject industry 3 and business scope }. As shown in fig. 2, the text attribute processing implementation process specifically includes the following steps:

s101, preprocessing text information

Carrying out standardized processing on taxpayer industry information, and specifically implementing the standardized processing comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting the numbers and the quantifier in the text; (3) the null-identified data is deleted.

S102, performing word segmentation based on Ansj word segmenter

The names of people, place names, industry descriptions and other words in the taxpayer registration information often exceed the covering capability of the word segmentation tool with the dictionary. In order to prevent the taxpayer information from being segmented into word fragments with incomplete semantics in the word segmentation process, an industry classification professional dictionary is constructed according to the national economic industry classification standard, a stop word dictionary of the place name in the taxpayer information is constructed according to the national province, city, county, region, name and word bank of four-level administrative division, the text attribute is segmented according to the constructed stop word dictionary based on an Ansj segmenter, and a language bank after segmentation is generated.

In this embodiment, according to the collected national economic industry classification standard professional dictionary and place name dictionary, each corpus is segmented by using an Ansj segmenter to stop using words. For example, in this embodiment, the JYFW (business scope) attribute value of a company is: the method is characterized by comprising the following steps of selling energy-saving and environment-friendly products, equipment and accessories, building decoration materials, communication equipment, mechanical equipment, hardware power distribution, electronic products and daily-use general goods, developing and selling computer software and hardware, and serving for cleaning. After word segmentation, the text information is divided into independent words, and the word segmentation result is as follows: "energy-saving environment-friendly product, equipment, accessory, building decoration material, communication equipment, mechanical equipment, hardware, communication electronic product, daily-used general merchandise, sales computer software and hardware, development, sales and cleaning service".

S103, constructing word vectors

And weighting the words of all samples according to the proportion occupied by the words of different classes in the corpus, screening out the words with larger weight, and reserving N keywords with the largest weight in the text of each sample. The present embodiment determines that N is 25 and finally converts the 25 words of each sample into a numerical vector using the word2vec tool.

Step2. non-text attribute processing

Besides text information, the taxpayer registration information database also comprises some non-text information which has more intuitive characteristics and also has important values for taxpayer industry classification, cluster analysis and anomaly detection.

As shown in fig. 3, the detailed processing steps of the non-text attribute of this embodiment include:

s201, processing numerical attribute

Although the value of the numerical attribute can be directly used for calculation, the value of the numerical attribute generally has different dimensions and orders of magnitude due to different properties of different attributes. In order to ensure that the distribution of the processed data conforms to normal distribution as much as possible and eliminate the influence caused by different dimensions. The present embodiment uses the z-score method to process the numerical attribute.

And inquiring the taxpayer information registered in the taxpayer industry information database and registering the taxpayer information expansion table. Extracting numerical type attributes { ZCZB, TZZE, CYRS, WJRS, HHRS, GDRS, ZRRTZBL, WZTZBL, GYTZBL }, wherein the corresponding meanings are respectively as follows: { registered capital, total investment, number of workers, number of foreign workers, number of partners, number of fixed workers, investment proportion of natural workers, investment proportion of foreign matters, and investment proportion of nationality }, and then z-score processing is performed on the above 9 numerical attributes.

Specifically, in this embodiment, the specific calculation form of the z-score process is:

wherein, X_iIs the vector of the ith numerical attribute value of the taxpayer information, u_iMeans, σ, representing the ith numerical attribute_iRepresents the variance of the ith numerical attribute,

is the vector after z-score processing. The data distribution of the numerical attribute after z-score processing is close to the standard plus-minus distribution.

S202, processing type attribute characteristics

In an anomaly detection algorithm, measurement of distance between data is necessary, however, values of the class type attributes are discrete, the discrete values specifically represent a type of identifier rather than a numerical value, the class type attributes need to be encoded again, and the encoded attribute values can be used for distance measurement.

Inquiring the taxpayer information registered in the taxpayer industry information database and a taxpayer information registered expansion table, and extracting 7-dimensional classification attributes: { DJZCLX _ DM, ZJG _ BZ, GGHBZ, ZZLB _ DM, HYMX _ DM, SFCLGJXZJZHY, DZFPQY _ BZ }, the corresponding meanings are as follows: { register type, general structure mark, whether it belongs to national tax and local tax administration, license category code, industry detail code, whether it is engaged in national restriction and banned industry, electronic invoice enterprise mark }, and encode the above category type attribute.

In this embodiment, the 7 types of attributes are encoded by using One-Hot technology. The One-Hot encoding process takes an attribute ZJG _ BZ (general organization identifier) as an example, and the detailed encoding steps are as follows:

(1) and judging the discrete value number of the main institution mark, and according to the taxpayer industry information table, the attribute values are 3, namely Y (main institution), N (non-main institution) and F (branch institution).

(2) Set 3-bit status registers, each status register having only one bit of 1 and all remaining 0, the 3-bit status code set is {001,010,100}, respectively.

(3) And respectively corresponding the 3 state codes to the 3 discrete values, and respectively corresponding Y (a general organization), N (a non-general organization) and F (a branch organization) to 001,010 and 100 to determine the One-Hot code of the final general organization mark.

S203, merging the feature vectors

After the non-text attribute and the text attribute are processed in steps S201 and S202, feature vectors for calculation are obtained, and the feature vectors are combined into a space to form a complete sample feature.

Step3. abnormal index generation and analysis

First, a detailed task of the abnormality detection will be described, and the present embodiment takes 2-level classification as an example, but the application range of the present invention is not limited to the abnormality detection in 2-level classification. FIG. 4 is a schematic of anomaly data for a 2-level industry classification. P and Q are two clusters of the first level business major class, in the first level cluster Q A, B, C, D and E are second level business detail clusters, sample a and sample f are abnormal data of the business major class in cluster Q. The sample e is industry detail abnormal data in the secondary cluster D, and the sample m is industry detail abnormal data in the secondary cluster A. Similarly, in the first-level cluster P, F, G, H, J and T are clusters classified by the level 2 industry detail, sample q is industry broad abnormal data in P, and sample T is industry detail abnormal data in the second-level cluster J. The invention finally aims to identify abnormal data of the major and detailed travel industry and remove the abnormal data from the original data, thereby providing more accurate and reliable original data for taxpayer industry classification.

As shown in fig. 5, the anomaly indicator generation and analysis diagram includes three parts, namely deep self-coding network construction, industry large-class anomaly indicator generation and industry detail anomaly indicator generation. In the process of generating the industry large-class abnormal index, the deep self-coding network middle layer extracts the industry large-class principal component characteristics, and then the calculation of the industry large-class abnormal index is realized by utilizing the principal component characteristics to fuse an SOS (sequence of service) abnormal detection algorithm. And in the process of generating the abnormal index of the industry detail, outputting the reconstruction result of each sample from the coding network as the abnormal index of the industry detail. The steps of creating and analyzing the abnormal index are shown in fig. 6, and the detailed construction process includes:

s301. network structure design

Firstly, determining a TADM network structure, determining the number of input and output neurons of the TADM network according to the dimension of the sample characteristic space obtained in the step2, wherein the dimension of the sample characteristic space is equal to N in the figure 5, and designing a network with 5 layers. The input layer and the output layer are both N neurons, and this embodiment finally determines N to be 65. The second layer is a hidden layer network, the number of neurons in the network is M, and M is finally determined to be 30 through experiments in this embodiment. The third layer is an intermediate hidden layer network, and the output result of the layer network is used as the basis for detecting the large-scale abnormality of the industry, and in this embodiment, K is determined to be 12. The layer four network and the layer two network have the same structure. The number of neurons in the output layer is the same as that of neurons in the input layer, and all the networks are connected in a full-connection mode.

Finally, a model for anomaly detection is obtained through TADM network training, and is determined by a network parameter omega, the model has the capability of encoding and decoding normal data in a sample space, normal samples are easier to copy from an input end to an output end by a network, the distribution difference of the abnormal data and the normal data is large, and the reconstruction effect of the network on the normal samples is not obvious, so that anomaly detection and identification can be performed by using the characteristic of the model. Although different training strategies are adopted for detecting the industry major and industry detail abnormity in the model training process, the two can share the network structure, so that the TADM does not need to redesign the network structure.

S302, setting network parameters

After the TADM network structure is determined, specific network parameters need to be determined. In this embodiment, all network layers are full-connection networks, and the activation functions of the second layer and the fourth layer are in the form of hyperbolic tangent functions, which are represented in the form:

f(x)＝tanh(x)

the setting of the activation function of the middle layer network is different from that of other layers, the activation function of the middle layer network is a ladder-like activation function, and the formalization is expressed as follows:

where k denotes the index of the network layer, N specifically denotes the number of steps of the activation function, and the a parameter controls the transition rate from one step to the next. Specifically, in this embodiment, N is 10, k is 3, and a is 100. The activation function quantizes the data into N discrete values:

original continuous data are mapped into a vector space of discrete values, and the purpose of data compression is achieved.

The output layer adopts a sigmoid activation function, and formalized expression is as follows:

s303. network training strategy

In this embodiment, 112 industry detail data of 25 industry major categories and industry major categories are selected for the experiment. In the network training process, data are divided into a training set, a verification set and a test set according to the ratio of 6:1:1, and then a TADM network is trained by using a cross-validation method. Different network parameters are trained aiming at different tasks, and training strategies adopted by abnormity detection of industry major categories and industry detailed categories are different.

Specifically, in the abnormal detection of the industry detail, each industry detail belonging to the industry major category trains a network, which is called tiny-net in this embodiment_iAnd i denotes an index of the corresponding business major class. tiny-net_iSpecifically, the network is used for detecting the abnormality of the detail of the subordinate industry of the ith industry major class. Each tiny-net_iDividing a training set, a verification set and a test set according to the ratio of 6:1:1, training corresponding networks to update network parameters, and then carrying out abnormity detection on the detail of the major subordinate industries of the industry.

The industry wide class of anomaly detection trains a network alone, referred to in this embodiment as large-net. This network training requires data from all 25 industry major classes in the database. The specific data usage policy is as follows: randomly extracting data from the data in each industry class according to the ratio of 6:1:1, then combining the data extracted from each industry class, and training large-net by using the combined data through a cross validation method.

S304, generation of large-class industry abnormal indexes

The distribution of data in a high-dimensional space is more discretized than that in a low-dimensional space, the clustering characteristic is lost, and abnormal data are difficult to detect. The large-scale industrial anomaly detection needs to reduce the dimension of original data and extract principal component space information.

The large-net network obtained in step S303 has the capability of extracting the principal component features of the sample. The industrial large-scale anomaly detection is mainly applied to the third layer of large-net, the third layer of network extracts principal component characteristics and then fuses an SOS anomaly detection algorithm to perform industrial large-scale anomaly detection, the flow of the SOS anomaly detection algorithm is shown in figure 7, and the detailed detection steps are as follows:

(1) generating a dissimilarity matrix between taxpayer characteristics

And (3) carrying out dimensionality reduction on the 25 major industry tax payer industry information data by large-net, wherein the principal component feature matrix after dimensionality reduction is X. The dissimilarity matrix is represented by D ═ D_ij)_n×nRepresentation, matrix element d_ijIndicating the degree of difference between the ith sample and the jth sample.

The dissimilarity in this embodiment is a form of euclidean distance, and is expressed as:

wherein x is_ijAnd the value of the jth characteristic of the ith sample is represented, n represents the number of samples, and m represents the dimension of the characteristic vector of each sample.

The association degree between taxpayer characteristics is defined by matrix A ═ a_ij)_n×nTo representThe specific method for generating the correlation matrix is to construct a nonlinear mapping, the mapping maps the dissimilarity matrix D into the correlation matrix a, and the specific mapping form is as follows:

wherein the content of the first and second substances,

is a principal component feature vector X_iThe variance of (a) is specifically expressed as the discrete degree of the data corresponding to the ith principal component characteristic in the taxpayer industry information, a_ijIndicating the degree of correlation between sample i and sample j.

The association probability between taxpayer characteristics is defined by matrix B ═ B_ij)_n×nAnd expressing that each sample is abstracted to a point in the graph model in the calculation process of the relevance probability matrix, and the weight value of an edge between the point and the point represents the relevance probability. In this embodiment, the associated probability calculation formalized expression is:

According to the graph theory, the probability that a sample is an outlier is specifically represented as the probability that the in-degree of the outlier is zero. The abnormal probability is derived through the association probability among taxpayer characteristics, and the specific calculation method is as follows:

wherein, P (X)_iE Outlier) represents a sample X_iAnd (4) calculating the probability of the abnormal data, and then evaluating the abnormality of the industry large-scale data according to a given threshold value.

S305. Generation of industry detail abnormal indexes

Tiny-net of each industry category obtained from S303_i(i denotes the industry major index) has the ability to self-encode industry detail data. tiny-net_iPerforming self-coding processing on all industry detail samples in the industry major class i, and calculating reconstruction errors of the industry detail samples, wherein the formal expression of the reconstruction errors is as follows:

where N represents the number of input and output layer neurons, x_ijRepresents the value of the ith sample in the jth neuron in the input layer, o_ijAnd the value of the jth neuron of the ith sample in the output layer is represented.

Step 4. anomaly assessment

And abnormal results need to be given after the TADM model is constructed. The TADM model provides specific calculation schemes for the abnormal indexes of the industry major and the industry detail, but the abnormal indexes obtained by the model cannot directly provide the conclusion whether the data is abnormal, and further evaluation on the result is required.

In the anomaly detection and evaluation of the industry major class, an anomaly probability threshold value theta is set for the industry major class, and if the anomaly probability value P of a sample is larger than the threshold value, the sample is judged to be anomalous data of the industry major class. The anomaly probability threshold θ in this embodiment is 0.15.

In the abnormal detection evaluation of the industry detail, a reconstruction error threshold epsilon is set for the industry detail, and if the reconstruction error of the sample is larger than epsilon, the sample is judged to be abnormal data. In this embodiment, ε is set to 0.08.

Step 5, abnormal data output

And (4) integrating the assessment results of the industry major categories and the industry details in the step 4, marking the abnormal data of the taxpayer industry information, and outputting the abnormal data number. And deleting the data marked with the abnormality in the original data, thereby providing more accurate and reliable data for the industry classification task.

It will be understood by those skilled in the art that the foregoing is only exemplary of the method of the present invention and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An enterprise industry classification-oriented anomaly detection method is characterized by comprising the following steps:

firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network; the method specifically comprises the following implementation steps:

1) taxpayer text attribute processing

Analyzing text information in the taxpayer industry information table, extracting representative text attributes, and performing anomaly detection by using the extracted attributes; the taxpayer text attribute processing specifically comprises the following steps:

step1. preprocessing of text information

step2, carrying out word segmentation based on Ansj word segmentation device

step3. construction of word vectors

Weighting words of all samples according to the proportion of different types of texts in the word segmentation corpus; screening out words with larger weights, reserving N keywords with the largest weights in each corpus, and converting the N keywords into word vectors by using a word2vec tool;

2) non-text attribute processing

The non-text attribute of the taxpayer industry information comprises two parts: a numerical type attribute and a categorical type attribute; processing the numerical type attribute by using a z-score standardization method, and encoding the category type attribute by using One-Hot; the non-text attribute processing of taxpayer industry information comprises two parts: numerical attribute processing and category attribute processing;

step1. calculate the mean of the various attributes

wherein n represents the number of taxpayer industry information samples,

a j-th numerical attribute value representing an i-th sample;

step2. calculate the variance of each attribute

step3, standardizing the data

wherein the content of the first and second substances,

is the result after z-score treatment, X_iIs the column vector, u, corresponding to the ith numerical attribute_iMeans, σ, representing the ith numerical attribute_iA variance representing the ith numerical attribute;

step3, respectively corresponding the M state codes to the M discrete values One by One, and determining that the value of each attribute is an M-dimensional vector which represents One-Hot code of the attribute value;

3) anomaly indicator generation and analysis

4) anomaly evaluation

The business major anomaly index of the first level is obtained by a TADM network and an SOS anomaly detection algorithm, the SOS anomaly detection algorithm finally calculates the anomaly probability of each sample and provides an anomaly probability threshold, the anomaly probabilities of all taxpayer characteristic samples are compared with the threshold, and if the anomaly probability is greater than the threshold, the sample is judged to be anomalous data in the business major;

and reconstructing the industry detail abnormal index of the second level by using a TADM network, calculating the reconstruction error of each sample by using the TADM network finally, giving a reconstruction error threshold, comparing the reconstruction errors of the samples of all the industry details with the threshold, and if the reconstruction errors are larger than the threshold, judging that the samples are abnormal data in the industry details.

2. The method for detecting abnormality facing enterprise business classification according to claim 1, wherein in step 3), the method for calculating the business classification abnormality index based on TADM specifically includes building a TADM network, generating a business broad abnormality index, and generating a business detail abnormality index, as follows:

TADM network construction

step2. Generation of industry major anomaly indicators

(1) generating a dissimilarity matrix between taxpayer characteristics

wherein the content of the first and second substances,

step3. Generation of industry detail anomaly indicators