CN109657947B - Enterprise industry classification-oriented anomaly detection method - Google Patents

Enterprise industry classification-oriented anomaly detection method Download PDF

Info

Publication number
CN109657947B
CN109657947B CN201811489291.XA CN201811489291A CN109657947B CN 109657947 B CN109657947 B CN 109657947B CN 201811489291 A CN201811489291 A CN 201811489291A CN 109657947 B CN109657947 B CN 109657947B
Authority
CN
China
Prior art keywords
industry
network
taxpayer
attribute
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811489291.XA
Other languages
Chinese (zh)
Other versions
CN109657947A (en
Inventor
郑庆华
高宇达
阮建飞
赵珮瑶
董博
孙铭潞
田雨润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811489291.XA priority Critical patent/CN109657947B/en
Publication of CN109657947A publication Critical patent/CN109657947A/en
Application granted granted Critical
Publication of CN109657947B publication Critical patent/CN109657947B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an enterprise industry classification-oriented anomaly detection method, which comprises the following steps: firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network. According to the invention, the TADM model is used for carrying out anomaly detection on the original data, so that the macro management work such as national statistics, tax, industrial and commercial management and the like can be more reasonably and accurately analyzed.

Description

Enterprise industry classification-oriented anomaly detection method
Technical Field
The invention belongs to the field of data mining, and particularly relates to an Anomaly Detection method for enterprise industry classification based on a TADM (Two-level Anomaly Detection Model, 2-level Anomaly Detection Model).
Background
After the innovation, the national economy of China is rapidly developed, the market economy is continuously flourished, the national economic structure is gradually improved, and the division of industry of enterprises is gradually refined. In the new period, the research enterprise industry classification plays a fundamental role in promoting the management of finance, tax and national standards, and provides a basis for further analyzing the current development situation of the national economy industry and mastering the development trend of the national economy. The national economic industry classification (GB/T4754 plus 2017) issued by the State quality supervision, inspection and quarantine Bureau and the State standardization administration provides the industry classification and codes of enterprise economic activities, and specifically comprises 97 industry major classes and 1380 industry details. When an enterprise registers, the industrial and commercial management department needs to determine the national economic industry classification of the enterprise according to information such as the enterprise operation range. However, the existing enterprise industry classification is mainly realized manually, is limited by professional knowledge and experience of workers, and is often wrong in classification when facing a large amount of enterprise classification tasks. The wrong enterprise industry classification can generate a series of adverse effects on the works of national statistics, tax, industrial and commercial management and the like, so how to detect and identify the abnormal condition of the enterprise industry classification by using a computer program becomes a problem to be solved urgently.
At present, no relevant research provides a corresponding solution for detecting enterprise industry classification abnormity. The disclosed technology aims at establishing a general anomaly detection method, and the representative work is as follows:
document 1: distributed discrete point detection method and system based on automatic coding machine (201410225026.6)
Document 2: local outlier detection method based on density (201710559390.X)
Document 3: multi-dimensional data anomaly detection method and device (201710411852.3)
Document 1 proposes a distributed outlier detection method based on an automatic coding machine, which updates model parameters of the automatic coding machine by using a distributed computing technique, and performs anomaly detection according to reconstruction errors of samples.
Document 2 designs a local discrete point detection method based on density, which considers the degree of dispersion between a sample point and its neighboring sample points, defines k-neighborhood dispersion degree according to expectation and variance of distances between the sample point and its neighboring samples, redefines local outlier coefficients by using the k-neighborhood dispersion degree, and judges whether the sample is abnormal by calculating the neighborhood density of the sample point.
Document 3 uses a reconstruction network to perform anomaly detection on high-dimensional data, constructs a reconstruction model, and determines an anomaly of a sample according to multi-dimensional reconstruction data.
Although the traditional method can solve the specific problem of anomaly detection, the traditional method is difficult to be directly expanded to the problem of anomaly detection of industry classification because the anomaly detection of the industry classification has the characteristics of multiple classes and multiple levels. First, enterprise industry classification belongs to a multi-classification problem, and the anomaly detection problem becomes complicated due to various categories and large data volume. The self-coding network structures of documents 1 and 3 are too simple, only have one hidden layer, cannot effectively extract detailed characteristics of data, and seriously lack generalization capability under a large-scale data set; document 2 defines a local outlier coefficient by using k neighborhood dispersion, but in the case of many industry categories and a large amount of industry information data, the selection of a k value becomes extremely difficult. Secondly, the enterprise industry major categories and the enterprise detail have hierarchical membership, the enterprise industry major categories and the enterprise detail belong to different hierarchies respectively, the industry detail is the expansion and refinement of the industry major categories, any enterprise corresponds to one industry major category and one industry detail, data with different information granularities (reflecting the detailed degree of information) are needed for carrying out anomaly analysis, and documents 1 to 3 do not have solutions for the problem of industry classification multi-hierarchy anomaly detection.
Aiming at the current situation that industry major categories and industry detailed categories belong to different levels, a deep self-coding network model is introduced, the deep self-coding network also has obvious level characteristics, in the process of coding and decoding, the outputs of different network layers respectively correspond to different feature spaces of the network, the information granularities represented by the different feature spaces are different and are matched with the level characteristics of industry classification, and therefore the problem of abnormal detection of the industry major categories and the industry detailed categories can be solved simultaneously by using the deep self-coding network.
Disclosure of Invention
The invention aims to provide an anomaly detection method for enterprise industry classification based on TADM. Firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using a dimension reduction characteristic fusion SOS (random anomaly Selection) algorithm of the industry major network, and carrying out anomaly detection on the industry detail network according to the reconstruction characteristic of the industry detail network, wherein the two processes are independent from each other, so that synchronous anomaly detection of the industry major network and the industry detail is realized.
The invention is realized by adopting the following technical scheme:
an enterprise industry classification-oriented anomaly detection method comprises the following steps:
firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; and finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network.
The invention has the further improvement that the method specifically comprises the following implementation steps:
1) taxpayer text attribute processing
Analyzing text information in the taxpayer industry information table, extracting representative text attributes, and performing anomaly detection by using the extracted attributes;
2) non-text attribute processing
The non-text attribute of the taxpayer industry information comprises two parts: a numerical type attribute and a categorical type attribute; processing the numerical type attribute by using a z-score standardization method, and encoding the category type attribute by using One-Hot;
3) anomaly indicator generation and analysis
Generating and analyzing abnormal indexes by taking a deep-learning self-coding network as a prototype, and designing a calculation method of industry classification abnormal indexes based on TADM according to the theory that industry information of different levels contains different information granularities;
4) anomaly evaluation
The business major anomaly index of the first level is obtained by a TADM network and an SOS anomaly detection algorithm, the model finally calculates the anomaly probability of each sample and provides an anomaly probability threshold, the anomaly probabilities of all taxpayer characteristic samples are compared with the threshold, and if the anomaly probability is greater than the threshold, the sample is judged to be anomalous data in the business major;
and reconstructing the industry detail abnormal index of the second level by a TADM network, finally calculating the reconstruction error of each sample by the model, giving a reconstruction error threshold value, comparing the reconstruction errors of the samples of all the industry details with the threshold value, and if the reconstruction errors are larger than the threshold value, judging that the samples are abnormal data in the industry details.
The invention has the further improvement that in the step 1), the taxpayer text attribute processing specifically comprises the following steps:
step1. preprocessing of text information
The text preprocessing is to carry out standardization operation on taxpayer business information, and the specific implementation comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting numbers and quantifier words in the text attribute; (3) deleting the data of the null identifier of the database;
step2, carrying out word segmentation based on Ansj word segmentation device
Constructing an industry classification professional dictionary based on the classification of the national economy industry, constructing a stop word dictionary based on the national province, city, county, region, city, county, level four, administrative district, region and name word banks, segmenting words of texts based on an Ansj segmenter according to the constructed stop word dictionary, and establishing a segmentation language database;
step3. construction of word vectors
Weighting words of all samples according to the proportion of different types of texts in the word segmentation corpus; and screening out words with larger weights, reserving N keywords with the largest weights in each corpus, and converting the N keywords into word vectors by using a word2vec tool.
The invention further improves that in the step 2), the non-text attribute processing of the taxpayer industry information comprises two parts: numerical attribute processing and category attribute processing;
the method for processing the numerical attributes by using the z-score standardization method comprises the following specific steps:
step1. calculate the mean of the various attributes
Remember u ═ u (u)1,u2,...,um) Is a mean vector, where m represents the number of classes of the numerical attribute, uiThe average value of the ith numerical attribute is represented, and the specific calculation form is as follows:
Figure BDA0001895293060000051
wherein n represents the number of taxpayer industry information samples,
Figure BDA0001895293060000052
a j-th numerical attribute value representing an i-th sample;
step2. calculate the variance of each attribute
Let σ ═ e (σ)12,...,σm) Is the variance of each numerical attribute, where m represents the number of classes of the numerical attribute, σiRepresenting the variance, σ, of the ith numerical attributeiThe specific form of calculation is:
Figure BDA0001895293060000053
the mean value and the variance are basic indexes of the numerical attribute, and the numerical attribute can be standardized through the mean value and the variance;
step3, standardizing the data
Standardizing the sample data according to the mean value and the variance of the numerical attribute calculated in the two steps, wherein the specific form is as follows:
Figure BDA0001895293060000061
wherein the content of the first and second substances,
Figure BDA0001895293060000062
is the result after z-score treatment, XiIs the column vector, u, corresponding to the ith numerical attributeiMeans, σ, representing the ith numerical attributeiA party representing an ith numerical attribute;
and encoding the type attribute by using One-Hot, wherein the detailed steps are as follows:
step1, judging the number of discrete values of the attribute, setting M discrete values of the index, and encoding M states by adopting an M-bit state register, wherein each state has an independent register bit and only one bit is effective;
step2, setting M-bit state registers, wherein only one bit of each state register is 1, and the rest are 0, and converting the difference of the type data into the distance in the Euclidean space by the setting mode;
and step3, respectively corresponding the M state codes to the M discrete values One by One, and determining that the value of each attribute is an M-dimensional vector which represents One-Hot code of the attribute value.
The further improvement of the present invention is that, in step 3), the method specifically includes the construction of TADM network, the generation of industry large-scale abnormal indexes, and the generation of industry detail abnormal indexes, as follows:
TADM network construction
The TADM network has five layers of neural networks, neurons of an input layer and an output layer are the same in number and equal to the total dimension of data processed by taxpayer information text attributes and non-text attributes, and the TADM network takes taxpayer information characteristic vectors as the input of the network and takes a reconstruction result as the output of the network;
specifically, the TADM network employs a standard feedforward neural network, the weight update employs a BP neural network algorithm, the BP neural network algorithm is used to minimize the overall reconstruction error OF and update the network parameters, and the formal expression is:
Figure BDA0001895293060000063
wherein Ω represents a network parameter, n represents the number OF samples, OFiRepresenting the reconstruction error of the ith sample, and the network coding and decoding capability is determined by a network parameter omega;
step2. Generation of industry major anomaly indicators
By utilizing the dimension reduction characteristic of the intermediate network layer of the TADM network, extracting the principal component spatial characteristics of the intermediate layer and fusing an SOS anomaly detection algorithm to carry out large-class anomaly detection in the industry;
the SOS anomaly detection algorithm is an anomaly detection method based on similarity, whether a sample is abnormal or not is judged by calculating the anomaly probability, and a proper threshold value is selected as the standard of abnormal data evaluation, and the SOS anomaly detection algorithm comprises the following steps:
(1) generating a dissimilarity matrix between taxpayer characteristics
For taxpayer characteristics, calculating dissimilarity degrees between different taxpayer characteristics respectively, wherein the dissimilarity degrees represent the difference degree between two data points and are specifically described in a form of Euclidean distance:
Figure BDA0001895293060000071
wherein, the characteristic matrix of the original data is X ═ Xij)n×mEach line in X is a feature vector of one sample; n represents the number of matrix rows, specifically the number of samples of data; m represents the number of matrix columns, specifically the number of attribute types of each sample; the dissimilarity matrix is D ═ D (D)ij)n×nElement d in the matrixijRepresenting the degree of difference between sample i and sample j;
(2) generating a matrix of degrees of association between taxpayer characteristics from dissimilarity information
The correlation degree between each taxpayer characteristic and other taxpayer characteristics is expressed by using a correlation degree, a specific generation method of a correlation degree matrix is to construct a nonlinear mapping, and the mapping can map a dissimilarity degree matrix D into a correlation degree matrix A; the specific mapping relation is as follows:
Figure BDA0001895293060000072
wherein the content of the first and second substances,
Figure BDA0001895293060000073
is a feature vector XiThe variance of (1) represents the dispersion degree of data corresponding to the ith principal component characteristic in the taxpayer industry information; a isijIs the degree of correlation between sample i and sample j;
(3) generation of association probability matrix between taxpayer characteristics through association degree information
The probability of association between taxpayer features is given by the matrix B ═ Bij)n×nRepresenting, abstracting each sample to a point in the graph model in the calculation process of the relevance probability matrix, wherein the weight value of an edge between each point and each point represents the relevance probability; specifically, the formalization of the association probability calculation is represented as:
Figure BDA0001895293060000081
wherein, bijRepresents the association probability between the point i and the point j, and satisfies
Figure BDA0001895293060000082
(4) Deriving abnormal probabilities between taxpayer features by correlating probabilities
According to the graph theory, the probability that the taxpayer characteristic sample is an abnormal point is specifically expressed as follows: the probability that the click-in degree in the graph model corresponding to the taxpayer characteristics is zero; deriving abnormal probability through the associated probability among the samples, wherein the specific calculation method comprises the following steps:
Figure BDA0001895293060000083
wherein, P (X)iE Outlier) represents a sample XiThe probability of the abnormal data is calculated, and then the large-class abnormality of the industry is evaluated according to a given threshold value;
step3. Generation of industry detail anomaly indicators
The method comprises the steps that more characteristic information is needed for industry detail abnormity detection than for industry large class, a TADM network constructed by Step1 has the capability of encoding and decoding normal data, encoding specifically refers to a process of inputting data into the network and compressing the data into a low-dimensional space, and decoding specifically refers to a process of mapping and outputting the low-dimensional space data to a high-dimensional space; judging the abnormal condition of the industry detail through the reconstruction error of the sample, wherein the specific calculation of the reconstruction error is as follows:
Figure BDA0001895293060000084
OF thereiniDenotes the reconstruction error of the ith sample, N denotes the number of input layer and output layer neurons, xijThe value of the j-th neuron representing the input layer in the i-th sample, oijRepresenting the value of the jth neuron of the output layer in the ith sample.
The invention has the following beneficial technical effects:
the anomaly detection method for enterprise industry classification provided by the invention provides more accurate and reliable data for industry classification and other enterprise industry data analysis problems. The existing anomaly detection technology is improved, so that the anomaly detection method is suitable for solving the problem of industrial anomaly detection. Compared with the prior art, the invention has the advantages that:
(1) the invention combines text information and non-text information of taxpayers to carry out industry classification, the prior art generally only carries out semantic analysis on the text information and has the defect of incomplete information use, the invention carries out natural language processing on the text information and carries out feature vectorization processing on the non-text information, integrates important attributes in the taxpayers 'business information and more comprehensively utilizes the taxpayers' business information.
(2) The invention skillfully combines a self-coding network model in deep learning to detect the abnormity of industry classification, discloses a TADM abnormity detection method, utilizes the reconstruction characteristic of a network to detect the abnormity of industry details, and utilizes the dimension reduction characteristic of the network to assist an SOS abnormity detection algorithm to detect the abnormity of industry classes, thereby effectively solving the problem of 2-level industry classification abnormity detection.
(3) The network structure can be reused, the same network structure can be used for respectively carrying out abnormity detection on industry major categories and industry details, and the network structure does not need to be redesigned.
Drawings
FIG. 1 is an overall framework flow diagram.
FIG. 2 is a text attribute processing flow diagram.
FIG. 3 is a non-text property processing flow diagram.
FIG. 4 is a schematic diagram of taxpayer information 2 level exception data.
Fig. 5 is a schematic diagram of two-level anomaly detection performed by the TADM model.
Fig. 6 is a flow chart of a TADM model implementation.
FIG. 7 is a flowchart of the calculation of the large-scale industry anomaly indicators.
Detailed Description
The invention is further described below with reference to the following figures and examples.
The taxpayer information registered in the national tax of a certain area from 2011 to 2017 is selected, the taxpayer information comprises sample data of 25 industry major categories and 112 industry details, and each industry major category belongs to a plurality of different industry details. The present invention will be described in further detail with reference to the accompanying drawings, in conjunction with experimental examples and embodiments. All the technologies realized based on the present disclosure belong to the scope of the present invention.
As shown in fig. 1, in the embodiment of the present invention, the anomaly detection process for classifying the taxpayer industry level 2 includes the following steps:
step1, text attribute processing
The taxpayer industry information table has a lot of valuable information and knowledge which can be used for industry classification, wherein part of the information is stored in a database in a text mode. The method for extracting 9 text attributes from the register taxpayer information and the register taxpayer information expansion table comprises the following steps: { HY _ DM, ZY, JY, JYFS, CBBZL, FSHY1_ DM, FSHY2_ DM, FSHY3_ DM and JYFW }, which respectively represent the following meanings: { industry code, main business, concurrent business, business mode, financial statement type, national subject industry 1, national subject industry 2, national subject industry 3 and business scope }. As shown in fig. 2, the text attribute processing implementation process specifically includes the following steps:
s101, preprocessing text information
Carrying out standardized processing on taxpayer industry information, and specifically implementing the standardized processing comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting the numbers and the quantifier in the text; (3) the null-identified data is deleted.
S102, performing word segmentation based on Ansj word segmenter
The names of people, place names, industry descriptions and other words in the taxpayer registration information often exceed the covering capability of the word segmentation tool with the dictionary. In order to prevent the taxpayer information from being segmented into word fragments with incomplete semantics in the word segmentation process, an industry classification professional dictionary is constructed according to the national economic industry classification standard, a stop word dictionary of the place name in the taxpayer information is constructed according to the national province, city, county, region, name and word bank of four-level administrative division, the text attribute is segmented according to the constructed stop word dictionary based on an Ansj segmenter, and a language bank after segmentation is generated.
In this embodiment, according to the collected national economic industry classification standard professional dictionary and place name dictionary, each corpus is segmented by using an Ansj segmenter to stop using words. For example, in this embodiment, the JYFW (business scope) attribute value of a company is: the method is characterized by comprising the following steps of selling energy-saving and environment-friendly products, equipment and accessories, building decoration materials, communication equipment, mechanical equipment, hardware power distribution, electronic products and daily-use general goods, developing and selling computer software and hardware, and serving for cleaning. After word segmentation, the text information is divided into independent words, and the word segmentation result is as follows: "energy-saving environment-friendly product, equipment, accessory, building decoration material, communication equipment, mechanical equipment, hardware, communication electronic product, daily-used general merchandise, sales computer software and hardware, development, sales and cleaning service".
S103, constructing word vectors
And weighting the words of all samples according to the proportion occupied by the words of different classes in the corpus, screening out the words with larger weight, and reserving N keywords with the largest weight in the text of each sample. The present embodiment determines that N is 25 and finally converts the 25 words of each sample into a numerical vector using the word2vec tool.
Step2. non-text attribute processing
Besides text information, the taxpayer registration information database also comprises some non-text information which has more intuitive characteristics and also has important values for taxpayer industry classification, cluster analysis and anomaly detection.
As shown in fig. 3, the detailed processing steps of the non-text attribute of this embodiment include:
s201, processing numerical attribute
Although the value of the numerical attribute can be directly used for calculation, the value of the numerical attribute generally has different dimensions and orders of magnitude due to different properties of different attributes. In order to ensure that the distribution of the processed data conforms to normal distribution as much as possible and eliminate the influence caused by different dimensions. The present embodiment uses the z-score method to process the numerical attribute.
And inquiring the taxpayer information registered in the taxpayer industry information database and registering the taxpayer information expansion table. Extracting numerical type attributes { ZCZB, TZZE, CYRS, WJRS, HHRS, GDRS, ZRRTZBL, WZTZBL, GYTZBL }, wherein the corresponding meanings are respectively as follows: { registered capital, total investment, number of workers, number of foreign workers, number of partners, number of fixed workers, investment proportion of natural workers, investment proportion of foreign matters, and investment proportion of nationality }, and then z-score processing is performed on the above 9 numerical attributes.
Specifically, in this embodiment, the specific calculation form of the z-score process is:
Figure BDA0001895293060000121
wherein, XiIs the vector of the ith numerical attribute value of the taxpayer information, uiMeans, σ, representing the ith numerical attributeiRepresents the variance of the ith numerical attribute,
Figure BDA0001895293060000122
is the vector after z-score processing. The data distribution of the numerical attribute after z-score processing is close to the standard plus-minus distribution.
S202, processing type attribute characteristics
In an anomaly detection algorithm, measurement of distance between data is necessary, however, values of the class type attributes are discrete, the discrete values specifically represent a type of identifier rather than a numerical value, the class type attributes need to be encoded again, and the encoded attribute values can be used for distance measurement.
Inquiring the taxpayer information registered in the taxpayer industry information database and a taxpayer information registered expansion table, and extracting 7-dimensional classification attributes: { DJZCLX _ DM, ZJG _ BZ, GGHBZ, ZZLB _ DM, HYMX _ DM, SFCLGJXZJZHY, DZFPQY _ BZ }, the corresponding meanings are as follows: { register type, general structure mark, whether it belongs to national tax and local tax administration, license category code, industry detail code, whether it is engaged in national restriction and banned industry, electronic invoice enterprise mark }, and encode the above category type attribute.
In this embodiment, the 7 types of attributes are encoded by using One-Hot technology. The One-Hot encoding process takes an attribute ZJG _ BZ (general organization identifier) as an example, and the detailed encoding steps are as follows:
(1) and judging the discrete value number of the main institution mark, and according to the taxpayer industry information table, the attribute values are 3, namely Y (main institution), N (non-main institution) and F (branch institution).
(2) Set 3-bit status registers, each status register having only one bit of 1 and all remaining 0, the 3-bit status code set is {001,010,100}, respectively.
(3) And respectively corresponding the 3 state codes to the 3 discrete values, and respectively corresponding Y (a general organization), N (a non-general organization) and F (a branch organization) to 001,010 and 100 to determine the One-Hot code of the final general organization mark.
S203, merging the feature vectors
After the non-text attribute and the text attribute are processed in steps S201 and S202, feature vectors for calculation are obtained, and the feature vectors are combined into a space to form a complete sample feature.
Step3. abnormal index generation and analysis
First, a detailed task of the abnormality detection will be described, and the present embodiment takes 2-level classification as an example, but the application range of the present invention is not limited to the abnormality detection in 2-level classification. FIG. 4 is a schematic of anomaly data for a 2-level industry classification. P and Q are two clusters of the first level business major class, in the first level cluster Q A, B, C, D and E are second level business detail clusters, sample a and sample f are abnormal data of the business major class in cluster Q. The sample e is industry detail abnormal data in the secondary cluster D, and the sample m is industry detail abnormal data in the secondary cluster A. Similarly, in the first-level cluster P, F, G, H, J and T are clusters classified by the level 2 industry detail, sample q is industry broad abnormal data in P, and sample T is industry detail abnormal data in the second-level cluster J. The invention finally aims to identify abnormal data of the major and detailed travel industry and remove the abnormal data from the original data, thereby providing more accurate and reliable original data for taxpayer industry classification.
As shown in fig. 5, the anomaly indicator generation and analysis diagram includes three parts, namely deep self-coding network construction, industry large-class anomaly indicator generation and industry detail anomaly indicator generation. In the process of generating the industry large-class abnormal index, the deep self-coding network middle layer extracts the industry large-class principal component characteristics, and then the calculation of the industry large-class abnormal index is realized by utilizing the principal component characteristics to fuse an SOS (sequence of service) abnormal detection algorithm. And in the process of generating the abnormal index of the industry detail, outputting the reconstruction result of each sample from the coding network as the abnormal index of the industry detail. The steps of creating and analyzing the abnormal index are shown in fig. 6, and the detailed construction process includes:
s301. network structure design
Firstly, determining a TADM network structure, determining the number of input and output neurons of the TADM network according to the dimension of the sample characteristic space obtained in the step2, wherein the dimension of the sample characteristic space is equal to N in the figure 5, and designing a network with 5 layers. The input layer and the output layer are both N neurons, and this embodiment finally determines N to be 65. The second layer is a hidden layer network, the number of neurons in the network is M, and M is finally determined to be 30 through experiments in this embodiment. The third layer is an intermediate hidden layer network, and the output result of the layer network is used as the basis for detecting the large-scale abnormality of the industry, and in this embodiment, K is determined to be 12. The layer four network and the layer two network have the same structure. The number of neurons in the output layer is the same as that of neurons in the input layer, and all the networks are connected in a full-connection mode.
Finally, a model for anomaly detection is obtained through TADM network training, and is determined by a network parameter omega, the model has the capability of encoding and decoding normal data in a sample space, normal samples are easier to copy from an input end to an output end by a network, the distribution difference of the abnormal data and the normal data is large, and the reconstruction effect of the network on the normal samples is not obvious, so that anomaly detection and identification can be performed by using the characteristic of the model. Although different training strategies are adopted for detecting the industry major and industry detail abnormity in the model training process, the two can share the network structure, so that the TADM does not need to redesign the network structure.
S302, setting network parameters
After the TADM network structure is determined, specific network parameters need to be determined. In this embodiment, all network layers are full-connection networks, and the activation functions of the second layer and the fourth layer are in the form of hyperbolic tangent functions, which are represented in the form:
f(x)=tanh(x)
the setting of the activation function of the middle layer network is different from that of other layers, the activation function of the middle layer network is a ladder-like activation function, and the formalization is expressed as follows:
Figure BDA0001895293060000151
where k denotes the index of the network layer, N specifically denotes the number of steps of the activation function, and the a parameter controls the transition rate from one step to the next. Specifically, in this embodiment, N is 10, k is 3, and a is 100. The activation function quantizes the data into N discrete values:
Figure BDA0001895293060000152
original continuous data are mapped into a vector space of discrete values, and the purpose of data compression is achieved.
The output layer adopts a sigmoid activation function, and formalized expression is as follows:
Figure BDA0001895293060000153
s303. network training strategy
In this embodiment, 112 industry detail data of 25 industry major categories and industry major categories are selected for the experiment. In the network training process, data are divided into a training set, a verification set and a test set according to the ratio of 6:1:1, and then a TADM network is trained by using a cross-validation method. Different network parameters are trained aiming at different tasks, and training strategies adopted by abnormity detection of industry major categories and industry detailed categories are different.
Specifically, in the abnormal detection of the industry detail, each industry detail belonging to the industry major category trains a network, which is called tiny-net in this embodimentiAnd i denotes an index of the corresponding business major class. tiny-netiSpecifically, the network is used for detecting the abnormality of the detail of the subordinate industry of the ith industry major class. Each tiny-netiDividing a training set, a verification set and a test set according to the ratio of 6:1:1, training corresponding networks to update network parameters, and then carrying out abnormity detection on the detail of the major subordinate industries of the industry.
The industry wide class of anomaly detection trains a network alone, referred to in this embodiment as large-net. This network training requires data from all 25 industry major classes in the database. The specific data usage policy is as follows: randomly extracting data from the data in each industry class according to the ratio of 6:1:1, then combining the data extracted from each industry class, and training large-net by using the combined data through a cross validation method.
S304, generation of large-class industry abnormal indexes
The distribution of data in a high-dimensional space is more discretized than that in a low-dimensional space, the clustering characteristic is lost, and abnormal data are difficult to detect. The large-scale industrial anomaly detection needs to reduce the dimension of original data and extract principal component space information.
The large-net network obtained in step S303 has the capability of extracting the principal component features of the sample. The industrial large-scale anomaly detection is mainly applied to the third layer of large-net, the third layer of network extracts principal component characteristics and then fuses an SOS anomaly detection algorithm to perform industrial large-scale anomaly detection, the flow of the SOS anomaly detection algorithm is shown in figure 7, and the detailed detection steps are as follows:
(1) generating a dissimilarity matrix between taxpayer characteristics
And (3) carrying out dimensionality reduction on the 25 major industry tax payer industry information data by large-net, wherein the principal component feature matrix after dimensionality reduction is X. The dissimilarity matrix is represented by D ═ Dij)n×nRepresentation, matrix element dijIndicating the degree of difference between the ith sample and the jth sample.
The dissimilarity in this embodiment is a form of euclidean distance, and is expressed as:
Figure BDA0001895293060000161
wherein x isijAnd the value of the jth characteristic of the ith sample is represented, n represents the number of samples, and m represents the dimension of the characteristic vector of each sample.
(2) Generating a matrix of degrees of association between taxpayer characteristics from dissimilarity information
The association degree between taxpayer characteristics is defined by matrix A ═ aij)n×nTo representThe specific method for generating the correlation matrix is to construct a nonlinear mapping, the mapping maps the dissimilarity matrix D into the correlation matrix a, and the specific mapping form is as follows:
Figure BDA0001895293060000162
wherein the content of the first and second substances,
Figure BDA0001895293060000163
is a principal component feature vector XiThe variance of (a) is specifically expressed as the discrete degree of the data corresponding to the ith principal component characteristic in the taxpayer industry information, aijIndicating the degree of correlation between sample i and sample j.
(3) Generation of association probability matrix between taxpayer characteristics through association degree information
The association probability between taxpayer characteristics is defined by matrix B ═ Bij)n×nAnd expressing that each sample is abstracted to a point in the graph model in the calculation process of the relevance probability matrix, and the weight value of an edge between the point and the point represents the relevance probability. In this embodiment, the associated probability calculation formalized expression is:
Figure BDA0001895293060000171
wherein, bijRepresents the association probability between the point i and the point j, and satisfies
Figure BDA0001895293060000172
(4) Deriving abnormal probabilities between taxpayer features by correlating probabilities
According to the graph theory, the probability that a sample is an outlier is specifically represented as the probability that the in-degree of the outlier is zero. The abnormal probability is derived through the association probability among taxpayer characteristics, and the specific calculation method is as follows:
Figure BDA0001895293060000173
wherein, P (X)iE Outlier) represents a sample XiAnd (4) calculating the probability of the abnormal data, and then evaluating the abnormality of the industry large-scale data according to a given threshold value.
S305. Generation of industry detail abnormal indexes
Tiny-net of each industry category obtained from S303i(i denotes the industry major index) has the ability to self-encode industry detail data. tiny-netiPerforming self-coding processing on all industry detail samples in the industry major class i, and calculating reconstruction errors of the industry detail samples, wherein the formal expression of the reconstruction errors is as follows:
Figure BDA0001895293060000174
where N represents the number of input and output layer neurons, xijRepresents the value of the ith sample in the jth neuron in the input layer, oijAnd the value of the jth neuron of the ith sample in the output layer is represented.
Step 4. anomaly assessment
And abnormal results need to be given after the TADM model is constructed. The TADM model provides specific calculation schemes for the abnormal indexes of the industry major and the industry detail, but the abnormal indexes obtained by the model cannot directly provide the conclusion whether the data is abnormal, and further evaluation on the result is required.
In the anomaly detection and evaluation of the industry major class, an anomaly probability threshold value theta is set for the industry major class, and if the anomaly probability value P of a sample is larger than the threshold value, the sample is judged to be anomalous data of the industry major class. The anomaly probability threshold θ in this embodiment is 0.15.
In the abnormal detection evaluation of the industry detail, a reconstruction error threshold epsilon is set for the industry detail, and if the reconstruction error of the sample is larger than epsilon, the sample is judged to be abnormal data. In this embodiment, ε is set to 0.08.
Step 5, abnormal data output
And (4) integrating the assessment results of the industry major categories and the industry details in the step 4, marking the abnormal data of the taxpayer industry information, and outputting the abnormal data number. And deleting the data marked with the abnormality in the original data, thereby providing more accurate and reliable data for the industry classification task.
It will be understood by those skilled in the art that the foregoing is only exemplary of the method of the present invention and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (2)

1. An enterprise industry classification-oriented anomaly detection method is characterized by comprising the following steps:
firstly, extracting text information and non-text information to be mined from taxpayer industry information, and performing feature processing and coding processing; secondly, constructing a deep network structure which meets the industry classification abnormity detection problem, and determining the number of neurons of an input layer and an output layer of the network according to the feature dimension of the data after encoding processing; thirdly, based on the constructed deep network structure, respectively training the network of industry major categories and industry detailed by adopting different training strategies through cross validation; finally, carrying out anomaly detection on the industry major network by using the dimension reduction characteristic of the industry major network and fusing an SOS anomaly detection algorithm, and carrying out anomaly detection on the industry detail according to the reconstruction characteristic of the industry detail network; the method specifically comprises the following implementation steps:
1) taxpayer text attribute processing
Analyzing text information in the taxpayer industry information table, extracting representative text attributes, and performing anomaly detection by using the extracted attributes; the taxpayer text attribute processing specifically comprises the following steps:
step1. preprocessing of text information
The text preprocessing is to carry out standardization operation on taxpayer business information, and the specific implementation comprises the following steps: (1) deleting the character messy codes in the database; (2) deleting numbers and quantifier words in the text attribute; (3) deleting the data of the null identifier of the database;
step2, carrying out word segmentation based on Ansj word segmentation device
Constructing an industry classification professional dictionary based on the classification of the national economy industry, constructing a stop word dictionary based on the national province, city, county, region, city, county, level four, administrative district, region and name word banks, segmenting words of texts based on an Ansj segmenter according to the constructed stop word dictionary, and establishing a segmentation language database;
step3. construction of word vectors
Weighting words of all samples according to the proportion of different types of texts in the word segmentation corpus; screening out words with larger weights, reserving N keywords with the largest weights in each corpus, and converting the N keywords into word vectors by using a word2vec tool;
2) non-text attribute processing
The non-text attribute of the taxpayer industry information comprises two parts: a numerical type attribute and a categorical type attribute; processing the numerical type attribute by using a z-score standardization method, and encoding the category type attribute by using One-Hot; the non-text attribute processing of taxpayer industry information comprises two parts: numerical attribute processing and category attribute processing;
the method for processing the numerical attributes by using the z-score standardization method comprises the following specific steps:
step1. calculate the mean of the various attributes
Remember u ═ u (u)1,u2,...,um) Is a mean vector, where m represents the number of classes of the numerical attribute, uiThe average value of the ith numerical attribute is represented, and the specific calculation form is as follows:
Figure FDA0002865351970000021
wherein n represents the number of taxpayer industry information samples,
Figure FDA0002865351970000022
a j-th numerical attribute value representing an i-th sample;
step2. calculate the variance of each attribute
Let σ ═ e (σ)12,...,σm) Is the variance of each numerical attribute, where m represents the number of classes of the numerical attribute, σiRepresenting the variance, σ, of the ith numerical attributeiThe specific form of calculation is:
Figure FDA0002865351970000023
the mean value and the variance are basic indexes of the numerical attribute, and the numerical attribute can be standardized through the mean value and the variance;
step3, standardizing the data
Standardizing the sample data according to the mean value and the variance of the numerical attribute calculated in the two steps, wherein the specific form is as follows:
Figure FDA0002865351970000024
wherein the content of the first and second substances,
Figure FDA0002865351970000025
is the result after z-score treatment, XiIs the column vector, u, corresponding to the ith numerical attributeiMeans, σ, representing the ith numerical attributeiA variance representing the ith numerical attribute;
and encoding the type attribute by using One-Hot, wherein the detailed steps are as follows:
step1, judging the number of discrete values of the attribute, setting M discrete values of the index, and encoding M states by adopting an M-bit state register, wherein each state has an independent register bit and only one bit is effective;
step2, setting M-bit state registers, wherein only one bit of each state register is 1, and the rest are 0, and converting the difference of the type data into the distance in the Euclidean space by the setting mode;
step3, respectively corresponding the M state codes to the M discrete values One by One, and determining that the value of each attribute is an M-dimensional vector which represents One-Hot code of the attribute value;
3) anomaly indicator generation and analysis
Generating and analyzing abnormal indexes by taking a deep-learning self-coding network as a prototype, and designing a calculation method of industry classification abnormal indexes based on TADM according to the theory that industry information of different levels contains different information granularities;
4) anomaly evaluation
The business major anomaly index of the first level is obtained by a TADM network and an SOS anomaly detection algorithm, the SOS anomaly detection algorithm finally calculates the anomaly probability of each sample and provides an anomaly probability threshold, the anomaly probabilities of all taxpayer characteristic samples are compared with the threshold, and if the anomaly probability is greater than the threshold, the sample is judged to be anomalous data in the business major;
and reconstructing the industry detail abnormal index of the second level by using a TADM network, calculating the reconstruction error of each sample by using the TADM network finally, giving a reconstruction error threshold, comparing the reconstruction errors of the samples of all the industry details with the threshold, and if the reconstruction errors are larger than the threshold, judging that the samples are abnormal data in the industry details.
2. The method for detecting abnormality facing enterprise business classification according to claim 1, wherein in step 3), the method for calculating the business classification abnormality index based on TADM specifically includes building a TADM network, generating a business broad abnormality index, and generating a business detail abnormality index, as follows:
TADM network construction
The TADM network has five layers of neural networks, neurons of an input layer and an output layer are the same in number and equal to the total dimension of data processed by taxpayer information text attributes and non-text attributes, and the TADM network takes taxpayer information characteristic vectors as the input of the network and takes a reconstruction result as the output of the network;
specifically, the TADM network employs a standard feedforward neural network, the weight update employs a BP neural network algorithm, the BP neural network algorithm is used to minimize the overall reconstruction error OF and update the network parameters, and the formal expression is:
Figure FDA0002865351970000041
wherein Ω represents a network parameter, n represents the number OF samples, OFiRepresenting the reconstruction error of the ith sample, and the network coding and decoding capability is determined by a network parameter omega;
step2. Generation of industry major anomaly indicators
By utilizing the dimension reduction characteristic of the intermediate network layer of the TADM network, extracting the principal component spatial characteristics of the intermediate layer and fusing an SOS anomaly detection algorithm to carry out large-class anomaly detection in the industry;
the SOS anomaly detection algorithm is an anomaly detection method based on similarity, whether a sample is abnormal or not is judged by calculating the anomaly probability, and a proper threshold value is selected as the standard of abnormal data evaluation, and the SOS anomaly detection algorithm comprises the following steps:
(1) generating a dissimilarity matrix between taxpayer characteristics
For taxpayer characteristics, calculating dissimilarity degrees between different taxpayer characteristics respectively, wherein the dissimilarity degrees represent the difference degree between two data points and are specifically described in a form of Euclidean distance:
Figure FDA0002865351970000042
wherein, the characteristic matrix of the original data is X ═ Xij)n×mEach line in X is a feature vector of one sample; n represents the number of matrix rows, specifically the number of samples of data; m represents the number of matrix columns, specifically the number of attribute types of each sample; the dissimilarity matrix is D ═ D (D)ij)n×nElement d in the matrixijRepresenting the degree of difference between sample i and sample j;
(2) generating a matrix of degrees of association between taxpayer characteristics from dissimilarity information
The correlation degree between each taxpayer characteristic and other taxpayer characteristics is expressed by using a correlation degree, a specific generation method of a correlation degree matrix is to construct a nonlinear mapping, and the mapping can map a dissimilarity degree matrix D into a correlation degree matrix A; the specific mapping relation is as follows:
Figure FDA0002865351970000051
wherein the content of the first and second substances,
Figure FDA0002865351970000052
is a feature vector XiThe variance of (1) represents the dispersion degree of data corresponding to the ith principal component characteristic in the taxpayer industry information; a isijIs the degree of correlation between sample i and sample j;
(3) generation of association probability matrix between taxpayer characteristics through association degree information
The probability of association between taxpayer features is given by the matrix B ═ Bij)n×nRepresenting, abstracting each sample to a point in the graph model in the calculation process of the relevance probability matrix, wherein the weight value of an edge between each point and each point represents the relevance probability; specifically, the formalization of the association probability calculation is represented as:
Figure FDA0002865351970000053
wherein, bijRepresents the association probability between the point i and the point j, and satisfies
Figure FDA0002865351970000054
(4) Deriving abnormal probabilities between taxpayer features by correlating probabilities
According to the graph theory, the probability that the taxpayer characteristic sample is an abnormal point is specifically expressed as follows: the probability that the click-in degree in the graph model corresponding to the taxpayer characteristics is zero; deriving abnormal probability through the associated probability among the samples, wherein the specific calculation method comprises the following steps:
Figure FDA0002865351970000055
wherein, P (X)iE Outlier) represents a sample XiThe probability of the abnormal data is calculated, and then the large-class abnormality of the industry is evaluated according to a given threshold value;
step3. Generation of industry detail anomaly indicators
The method comprises the steps that more characteristic information is needed for industry detail abnormity detection than for industry large class, a TADM network constructed by Step1 has the capability of encoding and decoding normal data, encoding specifically refers to a process of inputting data into the network and compressing the data into a low-dimensional space, and decoding specifically refers to a process of mapping and outputting the low-dimensional space data to a high-dimensional space; judging the abnormal condition of the industry detail through the reconstruction error of the sample, wherein the specific calculation of the reconstruction error is as follows:
Figure FDA0002865351970000061
OF thereiniDenotes the reconstruction error of the ith sample, N denotes the number of input layer and output layer neurons, xijThe value of the j-th neuron representing the input layer in the i-th sample, oijRepresenting the value of the jth neuron of the output layer in the ith sample.
CN201811489291.XA 2018-12-06 2018-12-06 Enterprise industry classification-oriented anomaly detection method Active CN109657947B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811489291.XA CN109657947B (en) 2018-12-06 2018-12-06 Enterprise industry classification-oriented anomaly detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811489291.XA CN109657947B (en) 2018-12-06 2018-12-06 Enterprise industry classification-oriented anomaly detection method

Publications (2)

Publication Number Publication Date
CN109657947A CN109657947A (en) 2019-04-19
CN109657947B true CN109657947B (en) 2021-03-16

Family

ID=66112636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811489291.XA Active CN109657947B (en) 2018-12-06 2018-12-06 Enterprise industry classification-oriented anomaly detection method

Country Status (1)

Country Link
CN (1) CN109657947B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110430224B (en) * 2019-09-12 2021-11-16 贵州电网有限责任公司 Communication network abnormal behavior detection method based on random block model
CN110705607B (en) * 2019-09-12 2022-10-25 西安交通大学 Industry multi-label noise reduction method based on cyclic re-labeling self-service method
CN110609905A (en) * 2019-09-12 2019-12-24 深圳众赢维融科技有限公司 Method and device for recognizing type of over-point and processing graph data
CN111178615B (en) * 2019-12-24 2023-10-27 成都数联铭品科技有限公司 Method and system for constructing enterprise risk identification model
CN111914090B (en) * 2020-08-18 2021-05-04 生态环境部环境规划院 Method and device for enterprise industry classification identification and characteristic pollutant identification
CN112084764B (en) * 2020-09-02 2022-06-17 北京字节跳动网络技术有限公司 Data detection method, device, storage medium and equipment
CN112215288B (en) * 2020-10-13 2024-04-30 中国光大银行股份有限公司 Method and device for determining category of target enterprise, storage medium and electronic device
CN112434518B (en) * 2020-11-30 2023-08-15 北京师范大学 Text report scoring method and system
CN112767106B (en) * 2021-01-14 2023-11-07 中国科学院上海高等研究院 Automatic auditing method, system, computer readable storage medium and auditing equipment
CN113077100A (en) * 2021-04-16 2021-07-06 西安交通大学 Online learning potential exit prediction method based on automatic coding machine
CN113283972A (en) * 2021-05-06 2021-08-20 胡立禄 System and method for constructing tax big data model
CN114330370B (en) * 2022-03-17 2022-05-20 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence
CN114997978B (en) * 2022-06-08 2023-06-30 深圳多有米网络技术有限公司 High-quality taxpayer identification method based on taxpayer operation characteristics
CN115794798B (en) * 2022-12-12 2023-09-15 江苏省工商行政管理局信息中心 Market supervision informatization standard management and dynamic maintenance system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008420A (en) * 2014-05-26 2014-08-27 中国科学院信息工程研究所 Distributed outlier detection method and system based on automatic coding machine
CN104869126A (en) * 2015-06-19 2015-08-26 中国人民解放军61599部队计算所 Network intrusion anomaly detection method
CN107688822A (en) * 2017-07-18 2018-02-13 中国科学院计算技术研究所 Newly-increased classification recognition methods based on deep learning
CN108809974A (en) * 2018-06-07 2018-11-13 深圳先进技术研究院 A kind of Network Abnormal recognition detection method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6237699B2 (en) * 2015-05-25 2017-11-29 トヨタ自動車株式会社 Anomaly detection device
CN106294823B (en) * 2016-08-17 2019-03-22 上海云信留客信息科技有限公司 The method of abnormality detection and elimination for big data cleaning
CN107122394B (en) * 2017-03-10 2020-02-14 博彦科技股份有限公司 Abnormal data detection method and device
CN108287782A (en) * 2017-06-05 2018-07-17 中兴通讯股份有限公司 A kind of multidimensional data method for detecting abnormality and device
CN107391443B (en) * 2017-06-28 2020-12-25 北京航空航天大学 Sparse data anomaly detection method and device
CN107944480B (en) * 2017-11-16 2020-11-24 广州探迹科技有限公司 Enterprise industry classification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008420A (en) * 2014-05-26 2014-08-27 中国科学院信息工程研究所 Distributed outlier detection method and system based on automatic coding machine
CN104869126A (en) * 2015-06-19 2015-08-26 中国人民解放军61599部队计算所 Network intrusion anomaly detection method
CN107688822A (en) * 2017-07-18 2018-02-13 中国科学院计算技术研究所 Newly-increased classification recognition methods based on deep learning
CN108809974A (en) * 2018-06-07 2018-11-13 深圳先进技术研究院 A kind of Network Abnormal recognition detection method and device

Also Published As

Publication number Publication date
CN109657947A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN109657947B (en) Enterprise industry classification-oriented anomaly detection method
Benabdellah et al. A survey of clustering algorithms for an industrial context
CN113822494B (en) Risk prediction method, device, equipment and storage medium
CN110532542B (en) Invoice false invoice identification method and system based on positive case and unmarked learning
Govaert et al. Mutual information, phi-squared and model-based co-clustering for contingency tables
US20070094216A1 (en) Uncertainty management in a decision-making system
CN106611375A (en) Text analysis-based credit risk assessment method and apparatus
JP2018503206A (en) Technical and semantic signal processing in large unstructured data fields
CN113255321B (en) Financial field chapter-level event extraction method based on article entity word dependency relationship
CN115878904A (en) Intellectual property personalized recommendation method, system and medium based on deep learning
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN117473431A (en) Airport data classification and classification method and system based on knowledge graph
Manziuk et al. Structural Alignment Method of Conceptual Categories of Ontology and Formalized Domain.
Hamad et al. Sentiment analysis of restaurant reviews in social media using naïve bayes
CN115496571A (en) Interpretable invoice false-invoice detection method based on mesology
CN115618297A (en) Method and device for identifying abnormal enterprise
CN115545437A (en) Financial enterprise operation risk early warning method based on multi-source heterogeneous data fusion
CN114579761A (en) Information security knowledge entity relation connection prediction method, system and medium
Kulothungan Loan Forecast by Using Machine Learning
Shen et al. Stock trends prediction by hypergraph modeling
CN113643141A (en) Method, device and equipment for generating explanatory conclusion report and storage medium
Wang et al. Preprocessing and feature extraction methods for microfinance overdue data
Pachot et al. Identify Potential Diversification to Companies through Collaborative Filtering
KR102663632B1 (en) Device and method for artwork trend data prediction using artificial intelligence
Wang Check for Preprocessing and Feature Extraction Methods for Microfinance Overdue Data Jiahao Wang, Liang Zhang', Peiyi Shen 2 and Yuhuai Zhang²

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant