CN116257759A - Structured data intelligent classification grading system of deep neural network model - Google Patents

Structured data intelligent classification grading system of deep neural network model Download PDF

Info

Publication number
CN116257759A
CN116257759A CN202310215953.9A CN202310215953A CN116257759A CN 116257759 A CN116257759 A CN 116257759A CN 202310215953 A CN202310215953 A CN 202310215953A CN 116257759 A CN116257759 A CN 116257759A
Authority
CN
China
Prior art keywords
data
attribute
classification
model
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310215953.9A
Other languages
Chinese (zh)
Inventor
史扬
曹凌云
刘文懋
高翔
尤扬
李一珉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Nsfocus Technologies Group Co Ltd
Original Assignee
Tongji University
Nsfocus Technologies Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University, Nsfocus Technologies Group Co Ltd filed Critical Tongji University
Priority to CN202310215953.9A priority Critical patent/CN116257759A/en
Publication of CN116257759A publication Critical patent/CN116257759A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An intelligent classification and grading system for structured data of a deep neural network model comprises the following modules: i, a structured data processing module; II, a data labeling processing module; III, an attribute column data windowing conversion module; IV, constructing and training a module by a self-coding feature extractor; v, a windowed attribute data feature transformation module; VI, constructing and training a data classification neural network; VII, constructing and training a regression model of the data grading multi-layer perceptron; and VIII, a data classification hierarchical prediction module. The invention utilizes the established regularization method and the keyword lexicon to carry out labeling treatment on the data trained by the simulated input model, thereby solving the problem of full-dependence manual labeling treatment. The combination of the artificial features and the deep neural network automation features provides ideas for the feature extraction of the structured data attribute column, and can effectively complete the classification and grading of the data.

Description

Structured data intelligent classification grading system of deep neural network model
Technical Field
The invention relates to a data classification hierarchical prediction technology.
Background
The national export of the "network security law" and the "data security law" require classified and hierarchical management of data. Related industries should pay attention to national regulations and strengthen data standardization management and control. The data is the assets, the data security relationship legal persons, individuals, and also relationship public and national interests. With the advent of the large data age, management of data is becoming increasingly important. The data classification and grading is the basis of data security control. If the data is not stored in the category, the related domain data is mixed and mixed, the safety management of the data can not be completed, and the problems of data loss and leakage are seriously caused.
Deep learning is an important research direction in the machine learning field, which has made breakthrough progress in several aspects [1-8]The deep neural network is used for establishing a model to simulate a neural connection structure of a human brain, and the data characteristics are described through a plurality of neurons when various input signals are processed, so that the explanation of the data is given. The machine learning method of the deep neural network provides a new thought and a new idea for classifying data in a grading way with high accuracy in classification. Data classification methods are numerous, with rule-based classification methods [9,10] And the like, the data flow is required to have macroscopic control, and the data is screened according to rules in the data flow direction process, wherein the rules are mainly based on relevant specified standards. According to the method, a plurality of systems or methods are often designed according to the complexity of the data to finish screening and distinguishing the data. And for the machine learning method, the classification judgment of the related data can be realized through the feature learning of the data. Has no supervision and noSupervised machine learning method [11]]The classification of the data can be completed to a certain extent. Compared with the traditional data classification and classification method based on rules, the machine learning method requires lower human participation, relies on rules relatively less, and the data classification and classification method based on rules is poor in expandability, so that new data sets often need to reformulate rules and reconstruct a discrimination system.
Although the existing data classification methods are various, most data classification methods are still constructed based on rules, and the data classification methods are often focused on specific rules and methods, meanwhile, the focused fields are limited, and the rules need to be reconstructed for changing the data. At the same time, rule changes are often abstracted to function changes of different systems or different functions on a software level, which also means that rule-based methods have poor scalability. Although the machine learning method is utilized at present to realize the research of data classification and grading [12-15] The process is still in depth, but the following defects still exist: (1) The tagging process of the data still depends on a manual mode seriously, and the processing of a large amount of data is time-consuming; (2) In the data classification and grading method, the attribute column data is not fully used, and only the attribute name is often concerned; (3) Column data identification feature extraction based on machine learning methods is difficult.
The closest prior art to the present application
In the prior art, the technical scheme of the metadata classification method based on the machine learning algorithm is similar to that of the prior art [13] The metadata classification method of the machine learning algorithm utilizes a sensitive original metadata set in the financial field to create a frequent item word stock, and utilizes the word stock to convert the characteristics of text-like fields in the corresponding data set into numerical characteristics. And constructing two classification models, performing sensitive discrimination on the metadata, and finally finishing subdivision of the data by using the multi-classification model. The method can solve the problem of dependence of data classification on manpower, and improves classification efficiency. The specific scheme flow is shown in figure 1.
The technical scheme is that: (1) And carrying out feature vectorization on the text-like data by using a word set method, wherein the considered text features are only represented in the frequency of the data, and cannot pay attention to other data features of the text-like data. (2) Feature vectors of frequent word set construction class text data are relatively limited, and it is necessary to collect all data of the domain as much as possible to achieve accurate vectorization of the quasi-characterized data. (3) The word combinations of interest are constructed by the frequent item word library into three types, and the types of interest can be properly expanded.
In the prior art, the technical scheme of the data classification system and method based on data security and privacy protection [16] And (5) completing the hierarchical classification of the data by using a method combining rules and industry standards. The scheme comprises a plurality of subsystems, including a data receiving subsystem, a data identifying subsystem, a data screening subsystem, a data classifying subsystem and a data classifying subsystem; the data receiving subsystem completes data receiving; the data identification subsystem identifies industry data; the data screening subsystem screens industry data; the data classification subsystem completes data classification by utilizing industry standards; the data grading subsystem utilizes industry standards to complete data grading. The scheme can realize data classification and can finish the refinement of data classification and classification according to industry standards. The specific scheme flow is shown in figure 2.
The technical scheme of the data classification system and method based on data security and privacy protection is that: (1) The scheme classification and grading method is implemented according to industry standards, and because the implementation of rule standards is mainly implemented by the system, part of the system can be changed along with the change of the rule standards, and the expandability is poor to a certain extent. (2) The manual dependency is large, and standard parameters need to be input manually. (3) The scene data of system attention and processing is relatively fixed, and is often tightly combined with industry standards.
Disclosure of Invention
Technical problem to be solved by the present application
The technical problems to be solved by the invention are summarized as follows, by integrating the defects existing in the prior art:
the existing data classification and classification method has the problems of manual judgment, overlong labeling processing time and the like, and the classification and classification frame construction process is long and time-consuming in the judgment process not only depends on manual judgment but also depends on data information and the like accessed by a user to a system for a long time.
And (II) the data is not fully utilized, the existing method mainly focuses on constructing a classification hierarchical model by utilizing data attributes, ignoring metadata of the attributes, and causing data information waste.
And thirdly, the method for extracting the attribute column data characteristics is single, and more models are mainly used for modeling and extracting attribute name characteristics by using the data attribute names.
The invention aims to solve the problem that the related method is applied to pain points on a data classification and classification flow, and combines a deep neural network method with a data classification and classification framework, so that data information is more fully utilized, and classification of data are completed.
The technical scheme of the invention is summarized as follows:
the structured data intelligent classification and classification system of the deep neural network model is characterized by comprising a module: i, a structured data processing module; II, a data labeling processing module; III, an attribute column data windowing conversion module; IV, constructing and training a module by a self-coding feature extractor; v, a windowed attribute data feature transformation module; VI, constructing and training a data classification neural network model; VII, constructing a regression model of the data grading multi-layer perceptron and training the model; and VIII, a data classification hierarchical prediction module.
Summary of the main inventive aspects of the present invention
In combination with the above problems, the main innovative technical points of the present invention are summarized as follows:
firstly, providing a keyword word library, carrying out category distinguishing judgment on the columns with the existing attribute names by a regularization method, and carrying out semi-manual labeling processing on the data of the to-be-input model, thereby solving the problems of long time of full-manual distinguishing and labeling processing.
And secondly, a method for instantiating the attribute column data based on a sliding window is provided, a single attribute column is converted into a multi-instance representation, a feature extraction scheme combining statistical features and a self-coding feature extractor of the attribute column data is constructed, the attribute column data information is fully utilized, and potential information of the attribute column data is mined.
And thirdly, designing and constructing a data classification deep neural network and a data classification multi-layer perceptron regression model to finish data classification.
The beneficial effects of this application:
1) For the problems of manual discrimination and labeling treatment of the existing data classification method, the invention utilizes the established regularization method and keyword word stock to carry out classification discrimination and judgment on attribute names and carries out labeling treatment on data trained by a simulated input model so as to solve the problem of relying on manual labeling treatment. The method is convenient and quick, reduces the time for manual discrimination and labeling, and reduces the burden of technicians.
2) For the problem of insufficient data utilization of the existing machine learning method, the invention designs a sliding window sampling method to instantiate attribute column data, trains a self-coding feature extractor on the basis of the examples, vectors the examples by using the pre-trained self-encoder, and fully excavates the internal data features of the examples by combining an example manual statistical feature extraction method.
3) Considering the actual application scene, the combination of the artificial features and the deep neural network automation features provides ideas for the feature extraction of the structured data attribute column, and can effectively complete the classification and grading of the data.
Drawings
FIG. 1 is a flow chart of a metadata hierarchical classification method based on machine learning algorithm in prior art 1
FIG. 2 is a flow chart of a data classification system and method scheme based on data security and privacy protection in accordance with prior art 2
FIG. 3 is a schematic diagram of a system module according to the general technical scheme of the present invention
FIG. 4 is a flow chart of the general technical scheme of the invention
FIG. 5 is a schematic diagram of a specific model of a related module for tagging, instantiating, and extracting features of structured data
FIG. 6 is a schematic diagram of a concrete model of a data classification hierarchical model building, training and prediction module
Detailed Description
The implementation process of the general technical scheme is described with reference to fig. 3 and 4
An intelligent classification and grading system for structured data of a deep neural network model, see fig. 3, wherein the scheme comprises modules: i, a structured data processing module; II, a data labeling processing module; III, an attribute column data windowing conversion module; IV, constructing and training a module by a self-coding feature extractor; v, a windowed attribute data feature transformation module; VI, constructing and training a data classification neural network model; VII, constructing a regression model of the data grading multi-layer perceptron and training the model; and VIII, a data classification hierarchical prediction module.
The implementation of each module is described below.
I, a structured data processing module: comprising steps S0, S1, wherein,
step S0: and (3) forming structured data, namely integrating the scattered and stored data into structured data by utilizing a construction domain original data set and adopting the modes of ETL, stream data processing and manual input to finish extraction and loading of the data.
Step S1: and checking the integrity and normalization of the structured data in the structured data set processing process, and processing the abnormal row data.
II, a data labeling processing module: comprising steps S2, S3, S4, S5, S6, S7, S8, wherein,
step S2: extracting attribute names of attribute column data in a dataset;
step S3: matching the extracted attribute names by utilizing a data attribute identification rule, and labeling attribute columns if the matching is successful;
step S4: if not, the attribute column is labeled by utilizing the attribute keyword word stock constructed in the step S5, if so, the step S8 is entered;
step S5: searching the extracted attribute names by utilizing a keyword word stock, and labeling attribute columns if the searching is successful;
step S6: if not, labeling the attribute column by using a manual method in the step S7; if tagged, then step S8 is performed to form a tagged attribute dataset.
In the data labeling processing module, the labeling attribute data set is formed in step S8 through steps S3, S4, S5, S6, and S7. The keyword word stock S5 of the attribute name is constructed, the keyword word stock is expanded by using a method which mainly uses natural language processing technology and is assisted by a brain storm, the near-meaning names and the similar names of the single attribute name are completed, the attribute names and the attribute security levels of all attribute groups are identified, and the keyword word stock is formed. The labeling rule S3 is constructed based on a keyword lexicon, all attribute families in the keyword lexicon are converted into regular expressions, and the attribute names and the attribute security levels of all attribute families are set for all the regular expressions.
III, an attribute column data windowing conversion module: steps S9, S10, S11, wherein,
step S9: setting a window comprises setting the window size and the window step size;
step S10: extracting attribute column data by using window size and window step length as sliding distance;
step S11: and combining the column labels to obtain a labeled instance set.
IV, a self-coding feature extractor construction and training module: comprising steps S14, S15, S16, S18, wherein,
step S14: performing example normalization on the examples in the step S10 according to attribute columns;
step S15: constructing a self-coding feature extractor;
step S16: training the self-encoding feature extractor constructed in step S15 using the normalized example in step S14;
step S18: a pre-trained self-encoding feature extractor model is obtained.
The step S15 is constructed from a coding feature extractor model, and specifically comprises the following steps:
1) The encoder is described as:
let h 0 =i', there are:
Figure SMS_1
the encoder output is expressed as
Figure SMS_2
2) Order the
Figure SMS_3
The decoder is expressed as:
Figure SMS_4
3) Order the
Figure SMS_5
The loss function is expressed as:
Figure SMS_6
(encoder section I' represents a normalized instance; W in the network parameters i e Representing coding weights, b i e Representing encoder bias, sigma i e Representing encoder activation function, L e Representing the number of layers of the encoder neural network; w in decoder part network parameters i d Represents decoding weights, b i d Representing decoding offset, sigma i d Representing encoder activation function, L d Representing the number of layers of the decoder neural network; ζ represents a penalty coefficient. )
V, a windowed attribute data feature transformation module: comprising steps S12, S13, and steps S17, S18, S19, wherein,
step S12: instance statistics extraction uses expert knowledge to design underlying statistics and their transformations to characterize the instance.
Step S13: and carrying out normalization processing on the statistical characteristics of the attribute column data examples for the statistical characteristics of the example conversion.
Step S17: before instance vectorization, carrying out normalization processing on the instance, and providing the instance to S18;
step S18: utilizing a pre-training self-coding feature extractor model to enter S19;
step S19: the normalized instances are encoded with the labeled and normalized instances using a pre-trained self-encoding feature extractor.
In the windowed attribute data feature transform module described above, steps S12, S13, S17, S19 transform the same instance into a statistical feature and a self-encoder vectorized feature representation. Wherein in step S12, the statistical characteristics are: arithmetic mean, median, mode, quartile (3 numbers), quartile, polar error, standard deviation (degree of freedom n), skewness, kurtosis, coefficient of variation, standard deviation (degree of freedom n-1), outlier ratio, median number.
VI, data classification neural network construction and training module: comprising steps S20, S21, S24, wherein,
step S20: constructing a data classification deep neural network model, and training for entering step S21;
step S21: and (3) inputting the example normalized statistical characteristics in the step (S13) into a statistical characteristic learning layer and the labeled example vectors in the step (S19) into a coding example learning layer, fusing the two types of vectors learned by the statistical characteristic learning layer and the coding example learning layer, inputting into a fusion characteristic learning layer, and adjusting model parameters under the condition that a loss function is cross entropy by using the prediction example attribute category and the actual example attribute category output by the fusion characteristic learning layer to finish the training of the data classification neural network model. And a temperature coefficient is introduced to act on the model in the training process, and the adaptation degree of the data classification neural network model to different classifications is adjusted.
Step S24: the trained data classification deep neural network model is obtained for proceeding to step S26.
The S20 data classification deep neural network model specifically comprises the following steps:
1) The statistical feature learning layer learning function is
Figure SMS_7
The model learns instance statistics and converts the m-dimensional instance statistics into k-dimensional vectors.
2) The learning function of the coding example learning layer is that
Figure SMS_8
The model here adjusts the influence of the encoding vector on the model.
3) Fusing the feature learning layer learning function into
Figure SMS_9
The model learns the classification function kappa from the fusion feature vector of the instance feature vector and the statistical feature vector, so that the learning and classification of the two conversion vectors are realized, and the classification judgment of the data is completed.
4) The model optimization problem is:
Figure SMS_10
Figure SMS_11
(y is the classifier prediction class set, θ 123 The super parameters of the corresponding models are respectively represented by y' which is the original class set, T which is the temperature coefficient, M which is the number of labels, and N which is the number of sample examples. )
VII, constructing a regression model of the data grading multi-layer perceptron: comprising steps S22, S23, S25, wherein,
step S22: constructing a data grading multi-layer perceptron regression model to enter step S23 for training;
step S23: and (3) splicing the instance normalized statistical features in the step (S13) and the labeled instance vectors in the step (S19) to represent instance conversion features, inputting the instance conversion features into a model, predicting the attribute instance security level and the actual attribute instance security level by using the model, completing training the data hierarchical multi-layer perceptron regression model constructed in the step (S22) under the condition that the loss function is a square error, and entering the step (S25).
Step S25: and obtaining a trained data grading multi-layer perceptron regression model, and using the model to enter step S26.
Wherein each layer of neural network of the S22 data hierarchical multi-layer perceptron regression model is defined as:
Figure SMS_12
(
Figure SMS_13
for the activation function, L is the number of layers of the neural network, n (l) For the number of neurons of the first layer, O i (l) Representing the output of the ith neural network of the first layer, w j,i (l) 、w 0,i (l) The weight parameters and the bias parameters corresponding to the neurons are respectively. )
VIII, a data classification prediction module: comprising a step S26, wherein,
step S26: taking the maximum number of times of outputting attribute categories by the S24 data classification neural network model as the final discrimination attribute category; in the judging of the attribute security level, the maximum number of times of outputting the attribute security level by the S25 data grading multi-layer perceptron regression model is taken as the final judging attribute security level, the result is output, and the program is ended.
For attribute column data needing to be predicted, generating attribute column examples according to steps S9 and S10, obtaining a normalized example by utilizing step S14, and inputting the normalized example into the S18 self-coding feature extractor to obtain a vectorized example. For the attribute column instances generated in steps S9, S10, instance statistics are generated using step S12, and normalized in step S13. And (3) respectively inputting the statistical characteristics of the normalized examples in the step (S13) and the vectorized examples generated by the self-coding characteristic extractor in the step (S18) into the data classification neural network model to finish the attribute type judgment of the examples, and inputting the data classification multi-layer perceptron regression model in the step (S25) to finish the attribute security level judgment of the examples. In the attribute type judgment, the maximum number of times of outputting the attribute type by the S24 data classification neural network model is taken as the final judgment attribute type; and in the judgment of the attribute security level, the maximum number of times of outputting the attribute security level by the S25 data grading multi-layer perceptron regression model is taken as the final judgment attribute security level.
The technical details of the important modules in the system are further detailed below.
As shown in fig. 5, the detailed implementation procedure of the structured data processing module and the data tagging processing module is as follows:
1) Adopting ETL, stream data processing and manual recording modes to integrate the scattered and stored data into an S0 structured data set SD= { (attr) 0 ,x′ 0 ),...,(attr d ,x′ d ) }. In step S1, attribute column data x ' = { x ' in structured data set SD ' 0 ,...,x′ d Detecting integrity and normalization, deleting abnormal data line and missing value data line to obtain x = { x 0 ,...,x d }. The post-processing structured dataset is d= { (attr) 0 ,x 0 ),...,(attr d ,x d )}。
2) Taking the attribute set a= { attr in the structured dataset D 0 ,...,attr d Matching the attributes in the A by sequentially utilizing an S3 labeling rule constructed by a regular expression, and if the matching is successful, adding the attributes and the labels tag corresponding to the labeling rule into an attribute labeling set AT part Otherwise, inputting the attribute into the S5 keyword lexicon to search, if the search is successful, adding the tag corresponding to the attribute and the keyword lexicon to the AT part In the method, otherwise, the identification is finished by using a manual method, and the AT is fused part Yield at= { (attr) 0 ,tag 0 ),...,(attr d ,tag d ) }. This stage combines x= { x 0 ,...,x d The sum AT gets the S8 tagged attribute dataset dtag= { (attr) 0 ,x 0 ,tag 0 ),...,(attr d ,x d ,tag d ) }. (wherein AT part ={(attr i ,tag i ),...,(attr j ,tag j ) And I0 is less than or equal to i, j is less than or equal to m, i is less than or equal to j, and the tag comprises two parts, wherein one part is an attribute family name corresponding to the structured attribute name, and the other part is an attribute family security level corresponding to the structured attribute name. )
3) In step S9, the window size is set to w, and the window step size is set to S. In step S10, attribute column data x in DTag is obtained i For x i Medium dataIntercepting window size w data as an example; moving the step length s, and repeatedly selecting an instance; the instance is labeled by combining the label in the DTag to obtain an S11 labeled instance set ITag= { (I) 0 ,tag 0 ),...,(I d ,tag d )}。
As shown in fig. 5, the detailed implementation process of the attribute column data windowed conversion module, the self-coding feature extractor construction and training module and the windowed attribute data feature transformation module comprises:
1) Instance set i= { I in tagged instance set ITag 0 ,...,I d The normalization method in steps S14, S17 is: get attribute column data x= { x 0 ,...,x d Maximum absolute value is calculated for each column of data to obtain CL= [ m ] 0 ,...,m d ]. Aggregating attribute column instances I 0 Dividing all data in (a) by the maximum absolute value m 0 Obtaining a normalized instance set I 0 '. Repeating the steps until all attribute column examples are normalized, and obtaining all attribute normalized example sets I '= { I' 0 ,....,I′ d }。
2) And S15, in the construction of the self-coding feature extractor model, the activation function of a coder and a decoder is tanh, the loss function is MSE, and the optimizer is SGD. And step S16, training a model on the training set, finding an optimal model on the verification set, and completing the verification of the optimal model on the test set. And a punishment coefficient zeta is introduced in the training process to adjust the model.
3) For any instance (I j,i ,tag j,i ) Will I j,i Normalized as described in step S17, I is normalized in step S19 using the trained S18 self-encoding feature extractor j,i Encoded as E j,i The method comprises the steps of carrying out a first treatment on the surface of the In step S12, the expert knowledge is used to extract instance I j,i Instance-based statistical feature addition and conversion to S j,i Setting instance vectorization feature E j,i And example basic feature S j,i Tag of (1) is tag j,i . The above steps are repeated until all tagged instances are converted into instance statistics and instance vectorization features, noted as estag= { (E) 0 ,S 0 ,tag 0 ),...,(E d ,S d ,tag d ) }. The expert knowledge is the basic statistical feature and conversion of the constructed description data distribution, the skewness, kurtosis, variation coefficient and abnormal ratio are taken as example conversion statistical features, and the result obtained by dividing the rest statistical values by two is taken as example conversion statistical features.
4) The normalization method in step S13 is: statistical features S of attribute columns in set ESTag i Longitudinal stitching is VS, and the maximum absolute value of each column of data in the VS is calculated to obtain SL= [ m ] 0 ,...,m v ]. Will S i Each row of data is divided by the corresponding maximum absolute value in SL to obtain a statistical feature normalization set S' i . Repeating the above steps until all attribute columns are normalized to obtain ES' Tag = { (E) 0 ,S′ 0 ,tag 0 ),...,(E d ,S′ d ,tag d )}。
As shown in fig. 6, the detailed implementation process of the data classification hierarchical model building, training and predicting module:
1) Step S20 is a statistical feature learning layer learning function in the data classification neural network model constructed in the step S20
Figure SMS_14
The coding example learning layer learning function phi and the fusion characteristic learning layer learning function kappa are formed by stacking a plurality of layers of fully-connected neural networks. Wherein the hidden layer activation function is relu, the classifier layer activation function is softmax, the loss function uses cross entropy, and the optimizer is SGD. The attribute column feature set ES' tag= { (E) obtained by combining step S13 and step S19 0 ,S′ 0 ,tag 0 ),...,(E d ,S′ d ,tag d ) And S20, classifying the deep neural network model by data, wherein the model training process in the step S21 is described as follows: the data set is divided into a training set, a verification set and a test set. For example features characterizing the attribute column (E i,j ,S′ i,j ,tag i,j ) Will E i,j Input function phi, S' i,j Input function->
Figure SMS_15
Results are obtainedφ(E i,j ) And->
Figure SMS_16
The input functions k are stitched. In the classifier layer, the model is adjusted by dividing the data by the temperature coefficient T before passing through the loss function softmax. Training a model on the training set, finding an optimal model on the verification set, and completing the verification of the optimal model on the test set.
2) In the data hierarchical multi-layer perceptron regression model constructed in the step S22, the activation function is relu, and the optimizer is Adam. And (3) combining the attribute column feature set ES 'Tag obtained in the step (S13) and the attribute column feature set ES' Tag obtained in the step (S19) with the S22 data hierarchical multi-layer perceptron regression model, wherein the model training process in the step (S23) is described as follows: the data set is divided into a training set and a test set. For example features characterizing the attribute column (E i,j ,S′ i,j ,tag i,j ) Will E i,j And S' i,j Splicing input models by using tags i,j And (3) updating the model parameters according to the corresponding attribute security level. Training the model on a training set, and finding an optimal model on a testing set.
3) The model prediction process in step S26 is:
(1) For predictive column data, a plurality of instance sets I characterizing the column are obtained by steps S9, S10 pre
(2) All examples I were performed using steps S12, S13, S14, S18, S19 pre Conversion to an example statistical feature normalization set S' pre And instance vectorization set E pre
(3) Will (E) pre ,S′ pre ) Inputting S24 data classification deep neural network model to obtain a prediction result C pre Statistics C pre And selecting the attribute corresponding to the most predicted classification as the final discrimination attribute.
(4) Will (E) pre ,S′ pre ) Inputting S25 data grading multi-layer perceptron regression model to obtain prediction result Level pre Rounding and rounding all data in the prediction grading result to obtain a Level' pre . Statistics Level' pre Selecting the most predictive classification as the classification condition of the middle predictionAnd finally judging the security level.
The key innovation technical points of the application include:
1) And the prior knowledge is utilized to complete the construction of domain data attribute name keywords and a regularization method, semi-automatic labeling is carried out on the data attribute columns, the manual labeling process is less depended, and the classification and grading of the data and the labeling of training model data are guided.
2) And establishing an attribute column data to instance, and an instance to vector and basic feature conversion model to provide a method for attribute column data conversion. The transformation model comprises attribute column data windowing, windowed instance vectorization and windowed instance statistical feature extraction. The windowed instance statistical feature extraction utilizes scene expert knowledge to enable feature extraction to be more consistent with a scene, and features can better represent data characteristics.
3) The artificial features and deep neural network automated feature extraction are combined to fuse scene knowledge and data latent information.
4) And constructing a data classification model and a data classification model, and effectively utilizing important information characteristics of the data to finish data classification.
Prior literature useful for understanding the technology of the present invention
[1]COVINGTON P,ADAMS J,SARGIN E.Deep neural networks for youtube recommendations.ACM conference on recommender systems.New York,NY,USA:Association for Computing Machinery.2016:191-198.
[2]LAROCHELLE H,BENGIO Y,LOURADOUR J,et al.Exploring strategies for training deep neural networks.Journal of Machine Learning Research,2009,10(1):1-40.
[3]MONTAVON G,SAMEK W,MüLLER K-R.Methods for interpreting and understanding deep neural networks.Digital Signal Processing,2018,73:1-15.
[4]MONTUFAR G F,PASCANU R,CHO K,et al.On the number of linear regions of deep neural networks.International Conference on Neural Information Processing Systems.Cambridge,MA,United States:MIT Press.2014:2924-2932
[5]MOOSAVI DEZFOOLI S M,FAWZI A,FROSSARD P.Deepfool:a simple and accurate method to fool deep neural networks.the IEEE conference on computer vision and pattern recognition.Las Vegas,NV,USA:IEEE Press.2016:2574-2582
[6]SZE V,CHEN Y-H,YANG T-J,et al.Efficient processing of deep neural networks:A tutorial and survey.Proceedings of the IEEE 2017,105(12):2295-2329.
[7]SZEGEDY C,TOSHEV A,ERHAN D.Deep neural networks for object detection.International Conference on Neural Information Processing Systems.Red Hook,NY,United States Curran Associates Inc.2013:2553-2561.
[8]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networksProceedings of the 27th International Conference on Neural Information Processing Systems.Cambridge,MA,United States:MIT Press.2014:3320-3328
[9] High-level epitaxy, zhao Zhangjie, lin Yeli, et al, data classification and classification method research based on data security, information security research, 2021,
7(10):933-940
[10] song Shaohong Chen Zhang A data classification method based on financial data security, china, application date CN202111539492.2021.12.15
[11] What is referred to as the sensitive attribute identification and ranking algorithm for structured data sets, peng Changgen, wang Maoni, et al, computer application research,
2020,37(10):3077-3082.
[12] lu Hongtai A classification and classification method for urban data based on a deep learning clustering algorithm, and 2021,8 (4) is implemented by industrial technology, namely 73-78.
[13] Wu Mingguang, guo Huiru, liu Qiong et al, a machine learning algorithm based metadata hierarchical classification method, china,
CN202210300625 and application day 2022.03.25
[14]ZHANG Q,ZHANG C,NI J,et al.Data Sensitivity Measurement and Classification Model of PowerIOT based on Information Entropy and BP Neural Network.International Conference on AdvancedAlgorithms and Control Engineering.IOP Publishing Ltd.2021.
[15] Yihan by Yu, wu Xiaoping. Based on Shannon information entropy and the privacy data metric and hierarchical model of BP neural network. Communication theory, 2018,39 (12): 10-17.
[16] Jin Huasong, he Ying, lai Xiaoyou, et al data classification system and method based on data security and privacy protection, china, CN202110923721, application date 2021.8.12
Abbreviation and key term definitions
ETL process: the data is processed by data extraction, conversion and loading to enter the whole process of the data warehouse.
Deep neural network: a neural network having two or more hidden layers is generally called a deep neural network.
Stream data: refers to data continuously generated by thousands of data sources, which is also typically sent in the form of data records at the same time, on a smaller scale.
The regular expression: a logical formula that operates on strings of characters, including common characters (e.g., letters between a and z) and special characters (referred to as "meta-characters").
Natural language processing: machine learning is used to parse the structure and meaning of text. With natural language processing applications, organizations may analyze text and extract information about people, places, and events to better understand the emotion of social media content and customer conversations.
Training set: refers to a sample set for training, which is mainly used for training parameters in a neural network.
Verification set: a sample set for verifying model performance.
Test set: a sample set for testing the performance of the model.
Feature extraction: starting from an initial measured data set, an informative and non-redundant derived value is then constructed, called a feature value.
Database: an organized collection of structured information or data (typically stored in electronic form in a computer system) is typically controlled by a database management system.
A classifier: the general term of the method for classifying the samples in the data mining comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.
Dense: the operation implemented by the common fully-connected layer is output=activation (dot (kernel) +bias), wherein activation is an activation function calculated element by element, kernel is a weight matrix of the layer, and bias is a bias vector.
Activation function: the function running on the neurons of the artificial neural network is responsible for mapping the inputs of the neurons to the outputs.
Loss function: the value of a random event or its related random variable is mapped to a non-negative real number to represent a function of the "risk" or "loss" of the random event.
An optimizer: in the deep learning back propagation process, each parameter of the loss function (objective function) is guided to update a proper size in a correct direction, so that each updated parameter enables the value of the loss function (objective function) to continuously approach to the global minimum.
Temperature coefficient: acting on the softmax activation function, adjusts the degree of interest in difficult samples: the smaller the temperature coefficient the more focused is on separating this sample from the most similar other samples.
Cross entropy: an important concept in Shannon information theory is mainly used for measuring the difference information between two probability distributions.
MSE: the mean square error is a measure reflecting the degree of difference between the estimated quantity and the estimated quantity. Let t be an estimate of the overall parameter θ determined from the subsamples, (θ -t) 2 Is called the mean square error of the estimator t. It is equal to sigma 2 +b 2 Wherein sigma 2 And b are the variance and bias of t, respectively.

Claims (9)

1. The structured data intelligent classification and classification system of the deep neural network model is characterized by comprising a module:
i, a structured data processing module;
II, a data labeling processing module;
III, an attribute column data windowing conversion module;
IV, constructing and training a module by a self-coding feature extractor;
v, a windowed attribute data feature transformation module;
VI, constructing and training a data classification neural network model;
VII, constructing and training a regression model of the data grading multi-layer perceptron;
and VIII, a data classification hierarchical prediction module.
2. The intelligent classification and ranking system of claim 1, wherein the structured data processing module: comprising steps S0, S1, wherein,
step S0: the method comprises the steps of forming structured data, integrating scattered and stored data into structured data by utilizing a construction domain original data set in an ETL (extract transform load) mode, stream data processing mode and manual input mode, and completing data extraction and loading;
step S1: and checking the integrity and normalization of the structured data in the structured data set processing process, and processing the abnormal row data.
3. The intelligent classification and classification system of claim 1, wherein the data tagging processing module: comprising steps S2, S3, S4, S5, S6, S7, S8, wherein,
step S2: extracting attribute names of attribute column data in a dataset;
step S3: matching the extracted attribute names by utilizing a data attribute identification rule, and labeling attribute columns if the matching is successful;
step S4: if not, the attribute column is labeled by utilizing the attribute keyword word stock constructed in the step S5, if so, the step S8 is entered;
step S5: searching the extracted attribute names by utilizing a keyword word stock, and labeling attribute columns if the searching is successful;
step S6: if not, labeling the attribute column by using a manual method in the step S7; if the label is formed, the step S8 is carried out to form a labeled attribute data set;
in the data labeling processing module, the labeling attribute data set is formed in step S8 through steps S3, S4, S5, S6, and S7.
4. The intelligent classification system of claim 1, wherein the attribute column data windowed transform module: steps S9, S10, S11, wherein,
step S9: setting a window comprises setting the window size and the window step size;
step S10: extracting attribute column data by using window size and window step length as sliding distance;
step S11: and combining the column labels to obtain a labeled instance set.
5. The intelligent classification and ranking system of claim 1 wherein the self-encoding feature extractor constructs and trains a module: comprising steps S14, S15, S16, S18, wherein,
step S14: performing example normalization on the examples in the step S10 according to attribute columns;
step S15: constructing a self-coding feature extractor;
step S16: training the self-encoding feature extractor constructed in step S15 using the normalized example in step S14;
step S18: obtaining a pre-training self-coding feature extractor model;
the step S15 is constructed from a coding feature extractor model, and specifically comprises the following steps:
the encoder is described as:
let h 0 =i', there are:
h i =σ i e (W i e *h i-1 +b i e ),1≤i≤L e <1>
the encoder output is expressed as
Figure QLYQS_1
Let o 0 =h Le The decoder is expressed as:
o i =σ i d (W i d *o i-1 +b i d ),1≤i≤L d <2>
order the
Figure QLYQS_2
The loss function is expressed as:
Figure QLYQS_3
the encoder section I' represents a normalized instance; w in network parameters i e Representing coding weights, b i e Representing encoder bias, sigma i e Representing encoder activation function, L e Representing the number of layers of the encoder neural network; w in decoder part network parameters i d Represents decoding weights, b i d Representing decoding offset, sigma i d Representing encoder activation function, L i d Representing the number of layers of the decoder neural network; ζ represents a penalty coefficient.
6. The intelligent classification system of claim 1, wherein the windowed attribute data feature transformation module: comprising steps S12, S13, and steps S17, S18, S19, wherein,
step S12: the example statistical feature extraction utilizes expert knowledge to design basic statistical features and conversion thereof to characterize the example;
step S13: for the example conversion statistical feature, carrying out normalization processing on the attribute column data example statistical feature;
step S17: before instance vectorization, carrying out normalization processing on the instance, and providing the instance to S18;
step S18: utilizing a pre-training self-coding feature extractor model to enter S19;
step S19: encoding the labeled and normalized instance with a pre-trained self-encoding feature extractor;
in the windowed attribute data feature transform module described above, steps S12, S13, S17, S19 transform the same instance into a statistical feature and a self-encoder vectorized feature representation.
7. The intelligent classification and grading system according to claim 1, wherein: the data classification neural network model building and training module comprises: comprising steps S20, S21, S24, wherein,
step S20: constructing a data classification deep neural network model, and training for entering step S21;
step S21: inputting the example normalization statistical feature in the step S13 into a statistical feature learning layer and the labeled example vector in the step S19 into a coding example learning layer, fusing the two types of vectors learned by the statistical feature learning layer and the coding example learning layer, inputting into a fusion feature learning layer, and adjusting model parameters under the condition that a loss function is a cross entropy by using a prediction example attribute category and an actual example attribute category output by the fusion feature learning layer to finish training of the data classification neural network model;
step S24: obtaining a trained data classification deep neural network model;
the step S20 of classifying the deep neural network model according to data specifically comprises the following steps:
1) The statistical feature learning layer learning function is
Figure QLYQS_4
The model learns the statistical characteristic features of the examples and converts the statistical characteristic features of the m-dimensional examples into k-dimensional vectors;
2) The learning function of the coding example learning layer is that
Figure QLYQS_5
The model adjusts here the influence of the encoding vector on the model;
3) Fusing the feature learning layer learning function into
Figure QLYQS_6
The model learns a classification function kappa from fusion feature vectors of the instance feature vector and the statistical feature vector, so that learning and classification of two types of conversion vectors are realized, and classification judgment of data is completed;
4) The model optimization problem is:
Figure QLYQS_7
/>
Figure QLYQS_8
y is the classifier predictive class set, θ 123 The super parameters of the corresponding models are respectively represented by y' which is the original class set, T which is the temperature coefficient, M which is the number of labels, and N which is the number of sample examples.
8. The intelligent classification and classification system of claim 1, wherein the data classification multi-layer perceptron regression model building module: comprising steps S22, S23, S25, wherein,
step S22: constructing a data grading multi-layer perceptron regression model to enter step S23 for training;
step S23: splicing the instance normalized statistical features in the step S13 and the labeled instance vectors in the step S19 to represent instance conversion features, inputting the instance conversion features into a model, predicting the security level of the attribute instance and the security level of the actual attribute instance by using the model, completing training the data hierarchical multi-layer perceptron regression model constructed in the step S22 under the condition that the loss function is a square error, and entering the step S25;
step S25: obtaining a trained regression model of the data grading multi-layer perceptron
Wherein each layer of neural network of the S22 data hierarchical multi-layer perceptron regression model is defined as:
Figure QLYQS_9
Figure QLYQS_10
for the activation function, L is the number of layers of the neural network, n (l) For the number of neurons of the first layer, O i (l) Representing the output of the ith neural network of the first layer, w j,i (l) 、w 0,i (l) The weight parameters and the bias parameters corresponding to the neurons are respectively.
9. The intelligent classification system according to claim 1, wherein the data classification prediction module comprises step S26, wherein,
step S26: taking the maximum number of times of outputting attribute categories by the S24 data classification neural network model as the final discrimination attribute category; in the judging of the attribute security level, the attribute security level is finally judged by taking the maximum number of times of outputting the attribute security level by the S25 data grading multi-layer perceptron regression model, a result is output, and the program is ended;
respectively inputting the normalized instance statistical characteristics in the step S13 and the vectorized instance generated by the self-coding characteristic extractor in the step S18 into an S24 data classification neural network model to finish attribute category judgment of the instance, and inputting an S25 data classification multi-layer perceptron regression model to finish attribute security level judgment of the instance;
in the attribute type judgment, the maximum number of times of outputting the attribute type by the S24 data classification neural network model is taken as the final judgment attribute type;
and in the judgment of the attribute security level, the maximum number of times of outputting the attribute security level by the S25 data grading multi-layer perceptron regression model is taken as the final judgment attribute security level.
CN202310215953.9A 2023-03-08 2023-03-08 Structured data intelligent classification grading system of deep neural network model Pending CN116257759A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310215953.9A CN116257759A (en) 2023-03-08 2023-03-08 Structured data intelligent classification grading system of deep neural network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310215953.9A CN116257759A (en) 2023-03-08 2023-03-08 Structured data intelligent classification grading system of deep neural network model

Publications (1)

Publication Number Publication Date
CN116257759A true CN116257759A (en) 2023-06-13

Family

ID=86687723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310215953.9A Pending CN116257759A (en) 2023-03-08 2023-03-08 Structured data intelligent classification grading system of deep neural network model

Country Status (1)

Country Link
CN (1) CN116257759A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117195253A (en) * 2023-08-24 2023-12-08 南京证券股份有限公司 Personal information security protection method and system
CN117539948A (en) * 2024-01-10 2024-02-09 西安羚控电子科技有限公司 Service data retrieval method and device based on deep neural network
CN117633605A (en) * 2024-01-25 2024-03-01 浙江鹏信信息科技股份有限公司 Data security classification capability maturity assessment method, system and readable medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117195253A (en) * 2023-08-24 2023-12-08 南京证券股份有限公司 Personal information security protection method and system
CN117539948A (en) * 2024-01-10 2024-02-09 西安羚控电子科技有限公司 Service data retrieval method and device based on deep neural network
CN117539948B (en) * 2024-01-10 2024-04-05 西安羚控电子科技有限公司 Service data retrieval method and device based on deep neural network
CN117633605A (en) * 2024-01-25 2024-03-01 浙江鹏信信息科技股份有限公司 Data security classification capability maturity assessment method, system and readable medium
CN117633605B (en) * 2024-01-25 2024-04-12 浙江鹏信信息科技股份有限公司 Data security classification capability maturity assessment method, system and readable medium

Similar Documents

Publication Publication Date Title
CN110597735B (en) Software defect prediction method for open-source software defect feature deep learning
CN110889556B (en) Enterprise operation risk characteristic data information extraction method and extraction system
CN109918671A (en) Electronic health record entity relation extraction method based on convolution loop neural network
CN113779272B (en) Knowledge graph-based data processing method, device, equipment and storage medium
CN116257759A (en) Structured data intelligent classification grading system of deep neural network model
CN110659742B (en) Method and device for acquiring sequence representation vector of user behavior sequence
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN116610816A (en) Personnel portrait knowledge graph analysis method and system based on graph convolution neural network
CN116975776A (en) Multi-mode data fusion method and device based on tensor and mutual information
CN114331122A (en) Key person risk level assessment method and related equipment
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
Xu et al. Data-driven causal knowledge graph construction for root cause analysis in quality problem solving
Ciaburro et al. Python Machine Learning Cookbook: Over 100 recipes to progress from smart data analytics to deep learning using real-world datasets
CN116610818A (en) Construction method and system of power transmission and transformation project knowledge base
CN113837266B (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN113920379B (en) Zero sample image classification method based on knowledge assistance
CN114662652A (en) Expert recommendation method based on multi-mode information learning
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
Sarang Thinking Data Science: A Data Science Practitioner’s Guide
CN117151222A (en) Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium
CN117033626A (en) Text auditing method, device, equipment and storage medium
CN115391548A (en) Retrieval knowledge graph library generation method based on combination of scene graph and concept network
CN113326371A (en) Event extraction method fusing pre-training language model and anti-noise interference remote monitoring information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination