CN109657947B - Enterprise industry classification-oriented anomaly detection method - Google Patents
Enterprise industry classification-oriented anomaly detection method Download PDFInfo
- Publication number
- CN109657947B CN109657947B CN201811489291.XA CN201811489291A CN109657947B CN 109657947 B CN109657947 B CN 109657947B CN 201811489291 A CN201811489291 A CN 201811489291A CN 109657947 B CN109657947 B CN 109657947B
- Authority
- CN
- China
- Prior art keywords
- industry
- network
- taxpayer
- attribute
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 84
- 238000012545 processing Methods 0.000 claims abstract description 43
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 22
- 210000002569 neuron Anatomy 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 19
- 230000009467 reduction Effects 0.000 claims abstract description 9
- 238000002790 cross-validation Methods 0.000 claims abstract description 6
- 230000002159 abnormal effect Effects 0.000 claims description 58
- 238000000034 method Methods 0.000 claims description 42
- 239000011159 matrix material Substances 0.000 claims description 38
- 239000013598 vector Substances 0.000 claims description 26
- 238000004364 calculation method Methods 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 20
- 230000011218 segmentation Effects 0.000 claims description 14
- 230000005856 abnormality Effects 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 7
- 238000004458 analytical method Methods 0.000 claims description 6
- 239000006185 dispersion Substances 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 5
- 235000019580 granularity Nutrition 0.000 claims description 4
- 238000011425 standardization method Methods 0.000 claims description 4
- 230000002547 anomalous effect Effects 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000011157 data evaluation Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 8
- 230000004913 activation Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000008520 organization Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000013450 outlier detection Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 210000002364 input neuron Anatomy 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Strategic Management (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Tourism & Hospitality (AREA)
- Health & Medical Sciences (AREA)
- Educational Administration (AREA)
- Development Economics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Marketing (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Entrepreneurship & Innovation (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Business, Economics & Management (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Primary Health Care (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an enterprise industry classification-oriented anomaly detection method, which comprises the following steps: firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network. According to the invention, the TADM model is used for carrying out anomaly detection on the original data, so that the macro management work such as national statistics, tax, industrial and commercial management and the like can be more reasonably and accurately analyzed.
Description
Technical Field
The invention belongs to the field of data mining, and particularly relates to an Anomaly Detection method for enterprise industry classification based on a TADM (Two-level Anomaly Detection Model, 2-level Anomaly Detection Model).
Background
After the innovation, the national economy of China is rapidly developed, the market economy is continuously flourished, the national economic structure is gradually improved, and the division of industry of enterprises is gradually refined. In the new period, the research enterprise industry classification plays a fundamental role in promoting the management of finance, tax and national standards, and provides a basis for further analyzing the current development situation of the national economy industry and mastering the development trend of the national economy. The national economic industry classification (GB/T4754 plus 2017) issued by the State quality supervision, inspection and quarantine Bureau and the State standardization administration provides the industry classification and codes of enterprise economic activities, and specifically comprises 97 industry major classes and 1380 industry details. When an enterprise registers, the industrial and commercial management department needs to determine the national economic industry classification of the enterprise according to information such as the enterprise operation range. However, the existing enterprise industry classification is mainly realized manually, is limited by professional knowledge and experience of workers, and is often wrong in classification when facing a large amount of enterprise classification tasks. The wrong enterprise industry classification can generate a series of adverse effects on the works of national statistics, tax, industrial and commercial management and the like, so how to detect and identify the abnormal condition of the enterprise industry classification by using a computer program becomes a problem to be solved urgently.
At present, no relevant research provides a corresponding solution for detecting enterprise industry classification abnormity. The disclosed technology aims at establishing a general anomaly detection method, and the representative work is as follows:
document 1: distributed discrete point detection method and system based on automatic coding machine (201410225026.6)
Document 2: local outlier detection method based on density (201710559390.X)
Document 3: multi-dimensional data anomaly detection method and device (201710411852.3)
Although the traditional method can solve the specific problem of anomaly detection, the traditional method is difficult to be directly expanded to the problem of anomaly detection of industry classification because the anomaly detection of the industry classification has the characteristics of multiple classes and multiple levels. First, enterprise industry classification belongs to a multi-classification problem, and the anomaly detection problem becomes complicated due to various categories and large data volume. The self-coding network structures of documents 1 and 3 are too simple, only have one hidden layer, cannot effectively extract detailed characteristics of data, and seriously lack generalization capability under a large-scale data set; document 2 defines a local outlier coefficient by using k neighborhood dispersion, but in the case of many industry categories and a large amount of industry information data, the selection of a k value becomes extremely difficult. Secondly, the enterprise industry major categories and the enterprise detail have hierarchical membership, the enterprise industry major categories and the enterprise detail belong to different hierarchies respectively, the industry detail is the expansion and refinement of the industry major categories, any enterprise corresponds to one industry major category and one industry detail, data with different information granularities (reflecting the detailed degree of information) are needed for carrying out anomaly analysis, and documents 1 to 3 do not have solutions for the problem of industry classification multi-hierarchy anomaly detection.
Aiming at the current situation that industry major categories and industry detailed categories belong to different levels, a deep self-coding network model is introduced, the deep self-coding network also has obvious level characteristics, in the process of coding and decoding, the outputs of different network layers respectively correspond to different feature spaces of the network, the information granularities represented by the different feature spaces are different and are matched with the level characteristics of industry classification, and therefore the problem of abnormal detection of the industry major categories and the industry detailed categories can be solved simultaneously by using the deep self-coding network.
Disclosure of Invention
The invention aims to provide an anomaly detection method for enterprise industry classification based on TADM. Firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using a dimension reduction characteristic fusion SOS (random anomaly Selection) algorithm of the industry major network, and carrying out anomaly detection on the industry detail network according to the reconstruction characteristic of the industry detail network, wherein the two processes are independent from each other, so that synchronous anomaly detection of the industry major network and the industry detail is realized.
The invention is realized by adopting the following technical scheme:
an enterprise industry classification-oriented anomaly detection method comprises the following steps:
firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network.
The invention has the further improvement that the method specifically comprises the following implementation steps:
1) taxpayer text attribute processing
Analyzing text information in the taxpayer industry information table, extracting representative text attributes, and performing anomaly detection by using the extracted attributes;
2) non-text attribute processing
The non-text attribute of the taxpayer industry information comprises two parts: a numerical type attribute and a categorical type attribute; processing the numerical type attribute by using a z-score standardization method, and encoding the category type attribute by using One-Hot;
3) anomaly indicator generation and analysis
Generating and analyzing abnormal indexes by taking a deep-learning self-coding network as a prototype, and designing a calculation method of industry classification abnormal indexes based on TADM according to the theory that industry information of different levels contains different information granularities;
4) anomaly evaluation
The business major anomaly index of the first level is obtained by a TADM network and an SOS anomaly detection algorithm, the model finally calculates the anomaly probability of each sample and provides an anomaly probability threshold, the anomaly probabilities of all taxpayer characteristic samples are compared with the threshold, and if the anomaly probability is greater than the threshold, the sample is judged to be anomalous data in the business major;
and reconstructing the industry detail abnormal index of the second level by a TADM network, finally calculating the reconstruction error of each sample by the model, giving a reconstruction error threshold value, comparing the reconstruction errors of the samples of all the industry details with the threshold value, and if the reconstruction errors are larger than the threshold value, judging that the samples are abnormal data in the industry details.
The invention has the further improvement that in the step 1), the taxpayer text attribute processing specifically comprises the following steps:
step1. preprocessing of text information
The text preprocessing is to carry out standardization operation on taxpayer business information, and the specific implementation comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting numbers and quantifier words in the text attribute; (3) deleting the data of the null identifier of the database;
step2, carrying out word segmentation based on Ansj word segmentation device
Constructing an industry classification professional dictionary based on the classification of the national economy industry, constructing a stop word dictionary based on the national province, city, county, region, city, county, level four, administrative district, region and name word banks, segmenting words of texts based on an Ansj segmenter according to the constructed stop word dictionary, and establishing a segmentation language database;
step3. construction of word vectors
Weighting words of all samples according to the proportion of different types of texts in the word segmentation corpus; and screening out words with larger weights, reserving N keywords with the largest weights in each corpus, and converting the N keywords into word vectors by using a word2vec tool.
The invention further improves that in the step 2), the non-text attribute processing of the taxpayer industry information comprises two parts: numerical attribute processing and category attribute processing;
the method for processing the numerical attributes by using the z-score standardization method comprises the following specific steps:
step1. calculate the mean of the various attributes
Remember u ═ u (u)1,u2,...,um) Is a mean vector, where m represents the number of classes of the numerical attribute, uiThe average value of the ith numerical attribute is represented, and the specific calculation form is as follows:
wherein n represents the number of taxpayer industry information samples,a j-th numerical attribute value representing an i-th sample;
step2. calculate the variance of each attribute
Let σ ═ e (σ)1,σ2,...,σm) Is the variance of each numerical attribute, where m represents the number of classes of the numerical attribute, σiRepresenting the variance, σ, of the ith numerical attributeiThe specific form of calculation is:
the mean value and the variance are basic indexes of the numerical attribute, and the numerical attribute can be standardized through the mean value and the variance;
step3, standardizing the data
Standardizing the sample data according to the mean value and the variance of the numerical attribute calculated in the two steps, wherein the specific form is as follows:
wherein the content of the first and second substances,is the result after z-score treatment, XiIs the column vector, u, corresponding to the ith numerical attributeiMeans, σ, representing the ith numerical attributeiA party representing an ith numerical attribute;
and encoding the type attribute by using One-Hot, wherein the detailed steps are as follows:
step1, judging the number of discrete values of the attribute, setting M discrete values of the index, and encoding M states by adopting an M-bit state register, wherein each state has an independent register bit and only one bit is effective;
step2, setting M-bit state registers, wherein only one bit of each state register is 1, and the rest are 0, and converting the difference of the type data into the distance in the Euclidean space by the setting mode;
and step3, respectively corresponding the M state codes to the M discrete values One by One, and determining that the value of each attribute is an M-dimensional vector which represents One-Hot code of the attribute value.
The further improvement of the present invention is that, in step 3), the method specifically includes the construction of TADM network, the generation of industry large-scale abnormal indexes, and the generation of industry detail abnormal indexes, as follows:
TADM network construction
The TADM network has five layers of neural networks, neurons of an input layer and an output layer are the same in number and equal to the total dimension of data processed by taxpayer information text attributes and non-text attributes, and the TADM network takes taxpayer information characteristic vectors as the input of the network and takes a reconstruction result as the output of the network;
specifically, the TADM network employs a standard feedforward neural network, the weight update employs a BP neural network algorithm, the BP neural network algorithm is used to minimize the overall reconstruction error OF and update the network parameters, and the formal expression is:
wherein Ω represents a network parameter, n represents the number OF samples, OFiRepresenting the reconstruction error of the ith sample, and the network coding and decoding capability is determined by a network parameter omega;
step2. Generation of industry major anomaly indicators
By utilizing the dimension reduction characteristic of the intermediate network layer of the TADM network, extracting the principal component spatial characteristics of the intermediate layer and fusing an SOS anomaly detection algorithm to carry out large-class anomaly detection in the industry;
the SOS anomaly detection algorithm is an anomaly detection method based on similarity, whether a sample is abnormal or not is judged by calculating the anomaly probability, and a proper threshold value is selected as the standard of abnormal data evaluation, and the SOS anomaly detection algorithm comprises the following steps:
(1) generating a dissimilarity matrix between taxpayer characteristics
For taxpayer characteristics, calculating dissimilarity degrees between different taxpayer characteristics respectively, wherein the dissimilarity degrees represent the difference degree between two data points and are specifically described in a form of Euclidean distance:
wherein, the characteristic matrix of the original data is X ═ Xij)n×mEach line in X is a feature vector of one sample; n represents the number of matrix rows, specifically the number of samples of data; m represents the number of matrix columns, specifically the number of attribute types of each sample; the dissimilarity matrix is D ═ D (D)ij)n×nElement d in the matrixijRepresenting the degree of difference between sample i and sample j;
(2) generating a matrix of degrees of association between taxpayer characteristics from dissimilarity information
The correlation degree between each taxpayer characteristic and other taxpayer characteristics is expressed by using a correlation degree, a specific generation method of a correlation degree matrix is to construct a nonlinear mapping, and the mapping can map a dissimilarity degree matrix D into a correlation degree matrix A; the specific mapping relation is as follows:
wherein the content of the first and second substances,is a feature vector XiThe variance of (1) represents the dispersion degree of data corresponding to the ith principal component characteristic in the taxpayer industry information; a isijIs the degree of correlation between sample i and sample j;
(3) generation of association probability matrix between taxpayer characteristics through association degree information
The probability of association between taxpayer features is given by the matrix B ═ Bij)n×nRepresenting, abstracting each sample to a point in the graph model in the calculation process of the relevance probability matrix, wherein the weight value of an edge between each point and each point represents the relevance probability; specifically, the formalization of the association probability calculation is represented as:
wherein, bijRepresents the association probability between the point i and the point j, and satisfies
(4) Deriving abnormal probabilities between taxpayer features by correlating probabilities
According to the graph theory, the probability that the taxpayer characteristic sample is an abnormal point is specifically expressed as follows: the probability that the click-in degree in the graph model corresponding to the taxpayer characteristics is zero; deriving abnormal probability through the associated probability among the samples, wherein the specific calculation method comprises the following steps:
wherein, P (X)iE Outlier) represents a sample XiThe probability of the abnormal data is calculated, and then the large-class abnormality of the industry is evaluated according to a given threshold value;
step3. Generation of industry detail anomaly indicators
The method comprises the steps that more characteristic information is needed for industry detail abnormity detection than for industry large class, a TADM network constructed by Step1 has the capability of encoding and decoding normal data, encoding specifically refers to a process of inputting data into the network and compressing the data into a low-dimensional space, and decoding specifically refers to a process of mapping and outputting the low-dimensional space data to a high-dimensional space; judging the abnormal condition of the industry detail through the reconstruction error of the sample, wherein the specific calculation of the reconstruction error is as follows:
OF thereiniDenotes the reconstruction error of the ith sample, N denotes the number of input layer and output layer neurons, xijThe value of the j-th neuron representing the input layer in the i-th sample, oijRepresenting the value of the jth neuron of the output layer in the ith sample.
The invention has the following beneficial technical effects:
the anomaly detection method for enterprise industry classification provided by the invention provides more accurate and reliable data for industry classification and other enterprise industry data analysis problems. The existing anomaly detection technology is improved, so that the anomaly detection method is suitable for solving the problem of industrial anomaly detection. Compared with the prior art, the invention has the advantages that:
(1) the invention combines text information and non-text information of taxpayers to carry out industry classification, the prior art generally only carries out semantic analysis on the text information and has the defect of incomplete information use, the invention carries out natural language processing on the text information and carries out feature vectorization processing on the non-text information, integrates important attributes in the taxpayers 'business information and more comprehensively utilizes the taxpayers' business information.
(2) The invention skillfully combines a self-coding network model in deep learning to detect the abnormity of industry classification, discloses a TADM abnormity detection method, utilizes the reconstruction characteristic of a network to detect the abnormity of industry details, and utilizes the dimension reduction characteristic of the network to assist an SOS abnormity detection algorithm to detect the abnormity of industry classes, thereby effectively solving the problem of 2-level industry classification abnormity detection.
(3) The network structure can be reused, the same network structure can be used for respectively carrying out abnormity detection on industry major categories and industry details, and the network structure does not need to be redesigned.
Drawings
FIG. 1 is an overall framework flow diagram.
FIG. 2 is a text attribute processing flow diagram.
FIG. 3 is a non-text property processing flow diagram.
FIG. 4 is a schematic diagram of taxpayer information 2 level exception data.
Fig. 5 is a schematic diagram of two-level anomaly detection performed by the TADM model.
Fig. 6 is a flow chart of a TADM model implementation.
FIG. 7 is a flowchart of the calculation of the large-scale industry anomaly indicators.
Detailed Description
The invention is further described below with reference to the following figures and examples.
The taxpayer information registered in the national tax of a certain area from 2011 to 2017 is selected, the taxpayer information comprises sample data of 25 industry major categories and 112 industry details, and each industry major category belongs to a plurality of different industry details. The present invention will be described in further detail with reference to the accompanying drawings, in conjunction with experimental examples and embodiments. All the technologies realized based on the present disclosure belong to the scope of the present invention.
As shown in fig. 1, in the embodiment of the present invention, the anomaly detection process for classifying the taxpayer industry level 2 includes the following steps:
step1, text attribute processing
The taxpayer industry information table has a lot of valuable information and knowledge which can be used for industry classification, wherein part of the information is stored in a database in a text mode. The method for extracting 9 text attributes from the register taxpayer information and the register taxpayer information expansion table comprises the following steps: { HY _ DM, ZY, JY, JYFS, CBBZL, FSHY1_ DM, FSHY2_ DM, FSHY3_ DM and JYFW }, which respectively represent the following meanings: { industry code, main business, concurrent business, business mode, financial statement type, national subject industry 1, national subject industry 2, national subject industry 3 and business scope }. As shown in fig. 2, the text attribute processing implementation process specifically includes the following steps:
s101, preprocessing text information
Carrying out standardized processing on taxpayer industry information, and specifically implementing the standardized processing comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting the numbers and the quantifier in the text; (3) the null-identified data is deleted.
S102, performing word segmentation based on Ansj word segmenter
The names of people, place names, industry descriptions and other words in the taxpayer registration information often exceed the covering capability of the word segmentation tool with the dictionary. In order to prevent the taxpayer information from being segmented into word fragments with incomplete semantics in the word segmentation process, an industry classification professional dictionary is constructed according to the national economic industry classification standard, a stop word dictionary of the place name in the taxpayer information is constructed according to the national province, city, county, region, name and word bank of four-level administrative division, the text attribute is segmented according to the constructed stop word dictionary based on an Ansj segmenter, and a language bank after segmentation is generated.
In this embodiment, according to the collected national economic industry classification standard professional dictionary and place name dictionary, each corpus is segmented by using an Ansj segmenter to stop using words. For example, in this embodiment, the JYFW (business scope) attribute value of a company is: the method is characterized by comprising the following steps of selling energy-saving and environment-friendly products, equipment and accessories, building decoration materials, communication equipment, mechanical equipment, hardware power distribution, electronic products and daily-use general goods, developing and selling computer software and hardware, and serving for cleaning. After word segmentation, the text information is divided into independent words, and the word segmentation result is as follows: "energy-saving environment-friendly product, equipment, accessory, building decoration material, communication equipment, mechanical equipment, hardware, communication electronic product, daily-used general merchandise, sales computer software and hardware, development, sales and cleaning service".
S103, constructing word vectors
And weighting the words of all samples according to the proportion occupied by the words of different classes in the corpus, screening out the words with larger weight, and reserving N keywords with the largest weight in the text of each sample. The present embodiment determines that N is 25 and finally converts the 25 words of each sample into a numerical vector using the word2vec tool.
Step2. non-text attribute processing
Besides text information, the taxpayer registration information database also comprises some non-text information which has more intuitive characteristics and also has important values for taxpayer industry classification, cluster analysis and anomaly detection.
As shown in fig. 3, the detailed processing steps of the non-text attribute of this embodiment include:
s201, processing numerical attribute
Although the value of the numerical attribute can be directly used for calculation, the value of the numerical attribute generally has different dimensions and orders of magnitude due to different properties of different attributes. In order to ensure that the distribution of the processed data conforms to normal distribution as much as possible and eliminate the influence caused by different dimensions. The present embodiment uses the z-score method to process the numerical attribute.
And inquiring the taxpayer information registered in the taxpayer industry information database and registering the taxpayer information expansion table. Extracting numerical type attributes { ZCZB, TZZE, CYRS, WJRS, HHRS, GDRS, ZRRTZBL, WZTZBL, GYTZBL }, wherein the corresponding meanings are respectively as follows: { registered capital, total investment, number of workers, number of foreign workers, number of partners, number of fixed workers, investment proportion of natural workers, investment proportion of foreign matters, and investment proportion of nationality }, and then z-score processing is performed on the above 9 numerical attributes.
Specifically, in this embodiment, the specific calculation form of the z-score process is:
wherein, XiIs the vector of the ith numerical attribute value of the taxpayer information, uiMeans, σ, representing the ith numerical attributeiRepresents the variance of the ith numerical attribute,is the vector after z-score processing. The data distribution of the numerical attribute after z-score processing is close to the standard plus-minus distribution.
S202, processing type attribute characteristics
In an anomaly detection algorithm, measurement of distance between data is necessary, however, values of the class type attributes are discrete, the discrete values specifically represent a type of identifier rather than a numerical value, the class type attributes need to be encoded again, and the encoded attribute values can be used for distance measurement.
Inquiring the taxpayer information registered in the taxpayer industry information database and a taxpayer information registered expansion table, and extracting 7-dimensional classification attributes: { DJZCLX _ DM, ZJG _ BZ, GGHBZ, ZZLB _ DM, HYMX _ DM, SFCLGJXZJZHY, DZFPQY _ BZ }, the corresponding meanings are as follows: { register type, general structure mark, whether it belongs to national tax and local tax administration, license category code, industry detail code, whether it is engaged in national restriction and banned industry, electronic invoice enterprise mark }, and encode the above category type attribute.
In this embodiment, the 7 types of attributes are encoded by using One-Hot technology. The One-Hot encoding process takes an attribute ZJG _ BZ (general organization identifier) as an example, and the detailed encoding steps are as follows:
(1) and judging the discrete value number of the main institution mark, and according to the taxpayer industry information table, the attribute values are 3, namely Y (main institution), N (non-main institution) and F (branch institution).
(2) Set 3-bit status registers, each status register having only one bit of 1 and all remaining 0, the 3-bit status code set is {001,010,100}, respectively.
(3) And respectively corresponding the 3 state codes to the 3 discrete values, and respectively corresponding Y (a general organization), N (a non-general organization) and F (a branch organization) to 001,010 and 100 to determine the One-Hot code of the final general organization mark.
S203, merging the feature vectors
After the non-text attribute and the text attribute are processed in steps S201 and S202, feature vectors for calculation are obtained, and the feature vectors are combined into a space to form a complete sample feature.
Step3. abnormal index generation and analysis
First, a detailed task of the abnormality detection will be described, and the present embodiment takes 2-level classification as an example, but the application range of the present invention is not limited to the abnormality detection in 2-level classification. FIG. 4 is a schematic of anomaly data for a 2-level industry classification. P and Q are two clusters of the first level business major class, in the first level cluster Q A, B, C, D and E are second level business detail clusters, sample a and sample f are abnormal data of the business major class in cluster Q. The sample e is industry detail abnormal data in the secondary cluster D, and the sample m is industry detail abnormal data in the secondary cluster A. Similarly, in the first-level cluster P, F, G, H, J and T are clusters classified by the level 2 industry detail, sample q is industry broad abnormal data in P, and sample T is industry detail abnormal data in the second-level cluster J. The invention finally aims to identify abnormal data of the major and detailed travel industry and remove the abnormal data from the original data, thereby providing more accurate and reliable original data for taxpayer industry classification.
As shown in fig. 5, the anomaly indicator generation and analysis diagram includes three parts, namely deep self-coding network construction, industry large-class anomaly indicator generation and industry detail anomaly indicator generation. In the process of generating the industry large-class abnormal index, the deep self-coding network middle layer extracts the industry large-class principal component characteristics, and then the calculation of the industry large-class abnormal index is realized by utilizing the principal component characteristics to fuse an SOS (sequence of service) abnormal detection algorithm. And in the process of generating the abnormal index of the industry detail, outputting the reconstruction result of each sample from the coding network as the abnormal index of the industry detail. The steps of creating and analyzing the abnormal index are shown in fig. 6, and the detailed construction process includes:
s301. network structure design
Firstly, determining a TADM network structure, determining the number of input and output neurons of the TADM network according to the dimension of the sample characteristic space obtained in the step2, wherein the dimension of the sample characteristic space is equal to N in the figure 5, and designing a network with 5 layers. The input layer and the output layer are both N neurons, and this embodiment finally determines N to be 65. The second layer is a hidden layer network, the number of neurons in the network is M, and M is finally determined to be 30 through experiments in this embodiment. The third layer is an intermediate hidden layer network, and the output result of the layer network is used as the basis for detecting the large-scale abnormality of the industry, and in this embodiment, K is determined to be 12. The layer four network and the layer two network have the same structure. The number of neurons in the output layer is the same as that of neurons in the input layer, and all the networks are connected in a full-connection mode.
Finally, a model for anomaly detection is obtained through TADM network training, and is determined by a network parameter omega, the model has the capability of encoding and decoding normal data in a sample space, normal samples are easier to copy from an input end to an output end by a network, the distribution difference of the abnormal data and the normal data is large, and the reconstruction effect of the network on the normal samples is not obvious, so that anomaly detection and identification can be performed by using the characteristic of the model. Although different training strategies are adopted for detecting the industry major and industry detail abnormity in the model training process, the two can share the network structure, so that the TADM does not need to redesign the network structure.
S302, setting network parameters
After the TADM network structure is determined, specific network parameters need to be determined. In this embodiment, all network layers are full-connection networks, and the activation functions of the second layer and the fourth layer are in the form of hyperbolic tangent functions, which are represented in the form:
f(x)=tanh(x)
the setting of the activation function of the middle layer network is different from that of other layers, the activation function of the middle layer network is a ladder-like activation function, and the formalization is expressed as follows:
where k denotes the index of the network layer, N specifically denotes the number of steps of the activation function, and the a parameter controls the transition rate from one step to the next. Specifically, in this embodiment, N is 10, k is 3, and a is 100. The activation function quantizes the data into N discrete values:original continuous data are mapped into a vector space of discrete values, and the purpose of data compression is achieved.
The output layer adopts a sigmoid activation function, and formalized expression is as follows:
s303. network training strategy
In this embodiment, 112 industry detail data of 25 industry major categories and industry major categories are selected for the experiment. In the network training process, data are divided into a training set, a verification set and a test set according to the ratio of 6:1:1, and then a TADM network is trained by using a cross-validation method. Different network parameters are trained aiming at different tasks, and training strategies adopted by abnormity detection of industry major categories and industry detailed categories are different.
Specifically, in the abnormal detection of the industry detail, each industry detail belonging to the industry major category trains a network, which is called tiny-net in this embodimentiAnd i denotes an index of the corresponding business major class. tiny-netiSpecifically, the network is used for detecting the abnormality of the detail of the subordinate industry of the ith industry major class. Each tiny-netiDividing a training set, a verification set and a test set according to the ratio of 6:1:1, training corresponding networks to update network parameters, and then carrying out abnormity detection on the detail of the major subordinate industries of the industry.
The industry wide class of anomaly detection trains a network alone, referred to in this embodiment as large-net. This network training requires data from all 25 industry major classes in the database. The specific data usage policy is as follows: randomly extracting data from the data in each industry class according to the ratio of 6:1:1, then combining the data extracted from each industry class, and training large-net by using the combined data through a cross validation method.
S304, generation of large-class industry abnormal indexes
The distribution of data in a high-dimensional space is more discretized than that in a low-dimensional space, the clustering characteristic is lost, and abnormal data are difficult to detect. The large-scale industrial anomaly detection needs to reduce the dimension of original data and extract principal component space information.
The large-net network obtained in step S303 has the capability of extracting the principal component features of the sample. The industrial large-scale anomaly detection is mainly applied to the third layer of large-net, the third layer of network extracts principal component characteristics and then fuses an SOS anomaly detection algorithm to perform industrial large-scale anomaly detection, the flow of the SOS anomaly detection algorithm is shown in figure 7, and the detailed detection steps are as follows:
(1) generating a dissimilarity matrix between taxpayer characteristics
And (3) carrying out dimensionality reduction on the 25 major industry tax payer industry information data by large-net, wherein the principal component feature matrix after dimensionality reduction is X. The dissimilarity matrix is represented by D ═ Dij)n×nRepresentation, matrix element dijIndicating the degree of difference between the ith sample and the jth sample.
The dissimilarity in this embodiment is a form of euclidean distance, and is expressed as:
wherein x isijAnd the value of the jth characteristic of the ith sample is represented, n represents the number of samples, and m represents the dimension of the characteristic vector of each sample.
(2) Generating a matrix of degrees of association between taxpayer characteristics from dissimilarity information
The association degree between taxpayer characteristics is defined by matrix A ═ aij)n×nTo representThe specific method for generating the correlation matrix is to construct a nonlinear mapping, the mapping maps the dissimilarity matrix D into the correlation matrix a, and the specific mapping form is as follows:
wherein the content of the first and second substances,is a principal component feature vector XiThe variance of (a) is specifically expressed as the discrete degree of the data corresponding to the ith principal component characteristic in the taxpayer industry information, aijIndicating the degree of correlation between sample i and sample j.
(3) Generation of association probability matrix between taxpayer characteristics through association degree information
The association probability between taxpayer characteristics is defined by matrix B ═ Bij)n×nAnd expressing that each sample is abstracted to a point in the graph model in the calculation process of the relevance probability matrix, and the weight value of an edge between the point and the point represents the relevance probability. In this embodiment, the associated probability calculation formalized expression is:
wherein, bijRepresents the association probability between the point i and the point j, and satisfies
(4) Deriving abnormal probabilities between taxpayer features by correlating probabilities
According to the graph theory, the probability that a sample is an outlier is specifically represented as the probability that the in-degree of the outlier is zero. The abnormal probability is derived through the association probability among taxpayer characteristics, and the specific calculation method is as follows:
wherein, P (X)iE Outlier) represents a sample XiAnd (4) calculating the probability of the abnormal data, and then evaluating the abnormality of the industry large-scale data according to a given threshold value.
S305. Generation of industry detail abnormal indexes
Tiny-net of each industry category obtained from S303i(i denotes the industry major index) has the ability to self-encode industry detail data. tiny-netiPerforming self-coding processing on all industry detail samples in the industry major class i, and calculating reconstruction errors of the industry detail samples, wherein the formal expression of the reconstruction errors is as follows:
where N represents the number of input and output layer neurons, xijRepresents the value of the ith sample in the jth neuron in the input layer, oijAnd the value of the jth neuron of the ith sample in the output layer is represented.
And abnormal results need to be given after the TADM model is constructed. The TADM model provides specific calculation schemes for the abnormal indexes of the industry major and the industry detail, but the abnormal indexes obtained by the model cannot directly provide the conclusion whether the data is abnormal, and further evaluation on the result is required.
In the anomaly detection and evaluation of the industry major class, an anomaly probability threshold value theta is set for the industry major class, and if the anomaly probability value P of a sample is larger than the threshold value, the sample is judged to be anomalous data of the industry major class. The anomaly probability threshold θ in this embodiment is 0.15.
In the abnormal detection evaluation of the industry detail, a reconstruction error threshold epsilon is set for the industry detail, and if the reconstruction error of the sample is larger than epsilon, the sample is judged to be abnormal data. In this embodiment, ε is set to 0.08.
And (4) integrating the assessment results of the industry major categories and the industry details in the step 4, marking the abnormal data of the taxpayer industry information, and outputting the abnormal data number. And deleting the data marked with the abnormality in the original data, thereby providing more accurate and reliable data for the industry classification task.
It will be understood by those skilled in the art that the foregoing is only exemplary of the method of the present invention and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (2)
1. An enterprise industry classification-oriented anomaly detection method is characterized by comprising the following steps:
firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network; the method specifically comprises the following implementation steps:
1) taxpayer text attribute processing
Analyzing text information in the taxpayer industry information table, extracting representative text attributes, and performing anomaly detection by using the extracted attributes; the taxpayer text attribute processing specifically comprises the following steps:
step1. preprocessing of text information
The text preprocessing is to carry out standardization operation on taxpayer business information, and the specific implementation comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting numbers and quantifier words in the text attribute; (3) deleting the data of the null identifier of the database;
step2, carrying out word segmentation based on Ansj word segmentation device
Constructing an industry classification professional dictionary based on the classification of the national economy industry, constructing a stop word dictionary based on the national province, city, county, region, city, county, level four, administrative district, region and name word banks, segmenting words of texts based on an Ansj segmenter according to the constructed stop word dictionary, and establishing a segmentation language database;
step3. construction of word vectors
Weighting words of all samples according to the proportion of different types of texts in the word segmentation corpus; screening out words with larger weights, reserving N keywords with the largest weights in each corpus, and converting the N keywords into word vectors by using a word2vec tool;
2) non-text attribute processing
The non-text attribute of the taxpayer industry information comprises two parts: a numerical type attribute and a categorical type attribute; processing the numerical type attribute by using a z-score standardization method, and encoding the category type attribute by using One-Hot; the non-text attribute processing of taxpayer industry information comprises two parts: numerical attribute processing and category attribute processing;
the method for processing the numerical attributes by using the z-score standardization method comprises the following specific steps:
step1. calculate the mean of the various attributes
Remember u ═ u (u)1,u2,...,um) Is a mean vector, where m represents the number of classes of the numerical attribute, uiThe average value of the ith numerical attribute is represented, and the specific calculation form is as follows:
wherein n represents the number of taxpayer industry information samples,a j-th numerical attribute value representing an i-th sample;
step2. calculate the variance of each attribute
Let σ ═ e (σ)1,σ2,...,σm) Is the variance of each numerical attribute, where m represents the number of classes of the numerical attribute, σiRepresenting the variance, σ, of the ith numerical attributeiThe specific form of calculation is:
the mean value and the variance are basic indexes of the numerical attribute, and the numerical attribute can be standardized through the mean value and the variance;
step3, standardizing the data
Standardizing the sample data according to the mean value and the variance of the numerical attribute calculated in the two steps, wherein the specific form is as follows:
wherein the content of the first and second substances,is the result after z-score treatment, XiIs the column vector, u, corresponding to the ith numerical attributeiMeans, σ, representing the ith numerical attributeiA variance representing the ith numerical attribute;
and encoding the type attribute by using One-Hot, wherein the detailed steps are as follows:
step1, judging the number of discrete values of the attribute, setting M discrete values of the index, and encoding M states by adopting an M-bit state register, wherein each state has an independent register bit and only one bit is effective;
step2, setting M-bit state registers, wherein only one bit of each state register is 1, and the rest are 0, and converting the difference of the type data into the distance in the Euclidean space by the setting mode;
step3, respectively corresponding the M state codes to the M discrete values One by One, and determining that the value of each attribute is an M-dimensional vector which represents One-Hot code of the attribute value;
3) anomaly indicator generation and analysis
Generating and analyzing abnormal indexes by taking a deep-learning self-coding network as a prototype, and designing a calculation method of industry classification abnormal indexes based on TADM according to the theory that industry information of different levels contains different information granularities;
4) anomaly evaluation
The business major anomaly index of the first level is obtained by a TADM network and an SOS anomaly detection algorithm, the SOS anomaly detection algorithm finally calculates the anomaly probability of each sample and provides an anomaly probability threshold, the anomaly probabilities of all taxpayer characteristic samples are compared with the threshold, and if the anomaly probability is greater than the threshold, the sample is judged to be anomalous data in the business major;
and reconstructing the industry detail abnormal index of the second level by using a TADM network, calculating the reconstruction error of each sample by using the TADM network finally, giving a reconstruction error threshold, comparing the reconstruction errors of the samples of all the industry details with the threshold, and if the reconstruction errors are larger than the threshold, judging that the samples are abnormal data in the industry details.
2. The method for detecting abnormality facing enterprise business classification according to claim 1, wherein in step 3), the method for calculating the business classification abnormality index based on TADM specifically includes building a TADM network, generating a business broad abnormality index, and generating a business detail abnormality index, as follows:
TADM network construction
The TADM network has five layers of neural networks, neurons of an input layer and an output layer are the same in number and equal to the total dimension of data processed by taxpayer information text attributes and non-text attributes, and the TADM network takes taxpayer information characteristic vectors as the input of the network and takes a reconstruction result as the output of the network;
specifically, the TADM network employs a standard feedforward neural network, the weight update employs a BP neural network algorithm, the BP neural network algorithm is used to minimize the overall reconstruction error OF and update the network parameters, and the formal expression is:
wherein Ω represents a network parameter, n represents the number OF samples, OFiRepresenting the reconstruction error of the ith sample, and the network coding and decoding capability is determined by a network parameter omega;
step2. Generation of industry major anomaly indicators
By utilizing the dimension reduction characteristic of the intermediate network layer of the TADM network, extracting the principal component spatial characteristics of the intermediate layer and fusing an SOS anomaly detection algorithm to carry out large-class anomaly detection in the industry;
the SOS anomaly detection algorithm is an anomaly detection method based on similarity, whether a sample is abnormal or not is judged by calculating the anomaly probability, and a proper threshold value is selected as the standard of abnormal data evaluation, and the SOS anomaly detection algorithm comprises the following steps:
(1) generating a dissimilarity matrix between taxpayer characteristics
For taxpayer characteristics, calculating dissimilarity degrees between different taxpayer characteristics respectively, wherein the dissimilarity degrees represent the difference degree between two data points and are specifically described in a form of Euclidean distance:
wherein, the characteristic matrix of the original data is X ═ Xij)n×mEach line in X is a feature vector of one sample; n represents the number of matrix rows, specifically the number of samples of data; m represents the number of matrix columns, specifically the number of attribute types of each sample; the dissimilarity matrix is D ═ D (D)ij)n×nElement d in the matrixijRepresenting the degree of difference between sample i and sample j;
(2) generating a matrix of degrees of association between taxpayer characteristics from dissimilarity information
The correlation degree between each taxpayer characteristic and other taxpayer characteristics is expressed by using a correlation degree, a specific generation method of a correlation degree matrix is to construct a nonlinear mapping, and the mapping can map a dissimilarity degree matrix D into a correlation degree matrix A; the specific mapping relation is as follows:
wherein the content of the first and second substances,is a feature vector XiThe variance of (1) represents the dispersion degree of data corresponding to the ith principal component characteristic in the taxpayer industry information; a isijIs the degree of correlation between sample i and sample j;
(3) generation of association probability matrix between taxpayer characteristics through association degree information
The probability of association between taxpayer features is given by the matrix B ═ Bij)n×nRepresenting, abstracting each sample to a point in the graph model in the calculation process of the relevance probability matrix, wherein the weight value of an edge between each point and each point represents the relevance probability; specifically, the formalization of the association probability calculation is represented as:
wherein, bijRepresents the association probability between the point i and the point j, and satisfies
(4) Deriving abnormal probabilities between taxpayer features by correlating probabilities
According to the graph theory, the probability that the taxpayer characteristic sample is an abnormal point is specifically expressed as follows: the probability that the click-in degree in the graph model corresponding to the taxpayer characteristics is zero; deriving abnormal probability through the associated probability among the samples, wherein the specific calculation method comprises the following steps:
wherein, P (X)iE Outlier) represents a sample XiThe probability of the abnormal data is calculated, and then the large-class abnormality of the industry is evaluated according to a given threshold value;
step3. Generation of industry detail anomaly indicators
The method comprises the steps that more characteristic information is needed for industry detail abnormity detection than for industry large class, a TADM network constructed by Step1 has the capability of encoding and decoding normal data, encoding specifically refers to a process of inputting data into the network and compressing the data into a low-dimensional space, and decoding specifically refers to a process of mapping and outputting the low-dimensional space data to a high-dimensional space; judging the abnormal condition of the industry detail through the reconstruction error of the sample, wherein the specific calculation of the reconstruction error is as follows:
OF thereiniDenotes the reconstruction error of the ith sample, N denotes the number of input layer and output layer neurons, xijThe value of the j-th neuron representing the input layer in the i-th sample, oijRepresenting the value of the jth neuron of the output layer in the ith sample.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811489291.XA CN109657947B (en) | 2018-12-06 | 2018-12-06 | Enterprise industry classification-oriented anomaly detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811489291.XA CN109657947B (en) | 2018-12-06 | 2018-12-06 | Enterprise industry classification-oriented anomaly detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109657947A CN109657947A (en) | 2019-04-19 |
CN109657947B true CN109657947B (en) | 2021-03-16 |
Family
ID=66112636
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811489291.XA Active CN109657947B (en) | 2018-12-06 | 2018-12-06 | Enterprise industry classification-oriented anomaly detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109657947B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110430224B (en) * | 2019-09-12 | 2021-11-16 | 贵州电网有限责任公司 | Communication network abnormal behavior detection method based on random block model |
CN110705607B (en) * | 2019-09-12 | 2022-10-25 | 西安交通大学 | Industry multi-label noise reduction method based on cyclic re-labeling self-service method |
CN110609905A (en) * | 2019-09-12 | 2019-12-24 | 深圳众赢维融科技有限公司 | Method and device for recognizing type of over-point and processing graph data |
CN111178615B (en) * | 2019-12-24 | 2023-10-27 | 成都数联铭品科技有限公司 | Method and system for constructing enterprise risk identification model |
CN111914090B (en) * | 2020-08-18 | 2021-05-04 | 生态环境部环境规划院 | Method and device for enterprise industry classification identification and characteristic pollutant identification |
CN112084764B (en) * | 2020-09-02 | 2022-06-17 | 北京字节跳动网络技术有限公司 | Data detection method, device, storage medium and equipment |
CN112215288B (en) * | 2020-10-13 | 2024-04-30 | 中国光大银行股份有限公司 | Method and device for determining category of target enterprise, storage medium and electronic device |
CN112434518B (en) * | 2020-11-30 | 2023-08-15 | 北京师范大学 | Text report scoring method and system |
CN112767106B (en) * | 2021-01-14 | 2023-11-07 | 中国科学院上海高等研究院 | Automatic auditing method, system, computer readable storage medium and auditing equipment |
CN113077100A (en) * | 2021-04-16 | 2021-07-06 | 西安交通大学 | Online learning potential exit prediction method based on automatic coding machine |
CN113283972A (en) * | 2021-05-06 | 2021-08-20 | 胡立禄 | System and method for constructing tax big data model |
CN114330370B (en) * | 2022-03-17 | 2022-05-20 | 天津思睿信息技术有限公司 | Natural language processing system and method based on artificial intelligence |
CN114997978B (en) * | 2022-06-08 | 2023-06-30 | 深圳多有米网络技术有限公司 | High-quality taxpayer identification method based on taxpayer operation characteristics |
CN115794798B (en) * | 2022-12-12 | 2023-09-15 | 江苏省工商行政管理局信息中心 | Market supervision informatization standard management and dynamic maintenance system and method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008420A (en) * | 2014-05-26 | 2014-08-27 | 中国科学院信息工程研究所 | Distributed outlier detection method and system based on automatic coding machine |
CN104869126A (en) * | 2015-06-19 | 2015-08-26 | 中国人民解放军61599部队计算所 | Network intrusion anomaly detection method |
CN107688822A (en) * | 2017-07-18 | 2018-02-13 | 中国科学院计算技术研究所 | Newly-increased classification recognition methods based on deep learning |
CN108809974A (en) * | 2018-06-07 | 2018-11-13 | 深圳先进技术研究院 | A kind of Network Abnormal recognition detection method and device |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6237699B2 (en) * | 2015-05-25 | 2017-11-29 | トヨタ自動車株式会社 | Anomaly detection device |
CN106294823B (en) * | 2016-08-17 | 2019-03-22 | 上海云信留客信息科技有限公司 | The method of abnormality detection and elimination for big data cleaning |
CN107122394B (en) * | 2017-03-10 | 2020-02-14 | 博彦科技股份有限公司 | Abnormal data detection method and device |
CN108287782A (en) * | 2017-06-05 | 2018-07-17 | 中兴通讯股份有限公司 | A kind of multidimensional data method for detecting abnormality and device |
CN107391443B (en) * | 2017-06-28 | 2020-12-25 | 北京航空航天大学 | Sparse data anomaly detection method and device |
CN107944480B (en) * | 2017-11-16 | 2020-11-24 | 广州探迹科技有限公司 | Enterprise industry classification method |
-
2018
- 2018-12-06 CN CN201811489291.XA patent/CN109657947B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008420A (en) * | 2014-05-26 | 2014-08-27 | 中国科学院信息工程研究所 | Distributed outlier detection method and system based on automatic coding machine |
CN104869126A (en) * | 2015-06-19 | 2015-08-26 | 中国人民解放军61599部队计算所 | Network intrusion anomaly detection method |
CN107688822A (en) * | 2017-07-18 | 2018-02-13 | 中国科学院计算技术研究所 | Newly-increased classification recognition methods based on deep learning |
CN108809974A (en) * | 2018-06-07 | 2018-11-13 | 深圳先进技术研究院 | A kind of Network Abnormal recognition detection method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109657947A (en) | 2019-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109657947B (en) | Enterprise industry classification-oriented anomaly detection method | |
Benabdellah et al. | A survey of clustering algorithms for an industrial context | |
CN113822494B (en) | Risk prediction method, device, equipment and storage medium | |
CN110532542B (en) | Invoice false invoice identification method and system based on positive case and unmarked learning | |
Govaert et al. | Mutual information, phi-squared and model-based co-clustering for contingency tables | |
US20070094216A1 (en) | Uncertainty management in a decision-making system | |
CN106611375A (en) | Text analysis-based credit risk assessment method and apparatus | |
JP2018503206A (en) | Technical and semantic signal processing in large unstructured data fields | |
CN113255321B (en) | Financial field chapter-level event extraction method based on article entity word dependency relationship | |
CN115878904A (en) | Intellectual property personalized recommendation method, system and medium based on deep learning | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN117473431A (en) | Airport data classification and classification method and system based on knowledge graph | |
Manziuk et al. | Structural Alignment Method of Conceptual Categories of Ontology and Formalized Domain. | |
Hamad et al. | Sentiment analysis of restaurant reviews in social media using naïve bayes | |
CN115496571A (en) | Interpretable invoice false-invoice detection method based on mesology | |
CN115618297A (en) | Method and device for identifying abnormal enterprise | |
CN115545437A (en) | Financial enterprise operation risk early warning method based on multi-source heterogeneous data fusion | |
CN114579761A (en) | Information security knowledge entity relation connection prediction method, system and medium | |
Kulothungan | Loan Forecast by Using Machine Learning | |
Shen et al. | Stock trends prediction by hypergraph modeling | |
CN113643141A (en) | Method, device and equipment for generating explanatory conclusion report and storage medium | |
Wang et al. | Preprocessing and feature extraction methods for microfinance overdue data | |
Pachot et al. | Identify Potential Diversification to Companies through Collaborative Filtering | |
KR102663632B1 (en) | Device and method for artwork trend data prediction using artificial intelligence | |
Wang | Check for Preprocessing and Feature Extraction Methods for Microfinance Overdue Data Jiahao Wang, Liang Zhang', Peiyi Shen 2 and Yuhuai Zhang² |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |