CN111931229B

CN111931229B - Data identification method, device and storage medium

Info

Publication number: CN111931229B
Application number: CN202010664475.6A
Authority: CN
Inventors: 李可; 张盼
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2023-07-11
Anticipated expiration: 2040-07-10
Also published as: CN111931229A

Abstract

The invention discloses a data identification method, a device and a storage medium, wherein the method comprises the following steps: acquiring a first table; determining a first column feature vector set of a first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of at least one column; the feature vector includes character features of the corresponding columns; identifying a first column of feature vector set by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; the identification result characterizes a table category corresponding to the first table.

Description

Data identification method, device and storage medium

Technical Field

The present invention relates to data identification technology, and in particular, to a data identification method, apparatus, and computer readable storage medium.

Background

The form can be used as a means for organizing and organizing data and can comprise various data; sensitive data analysis techniques include sensitive form recognition. In the related art, the identification of table data mainly uses a keyword content matching scheme, the scheme requires a user to input a table file to be protected in advance, a summary and keyword matching technology is adopted to record specific contents in the table, and whether the table hits the same contents is analyzed. The above method has high recognition accuracy, but has poor detection capability for a case where the content is changed.

Disclosure of Invention

In view of the foregoing, a primary object of the present invention is to provide a data identification method, apparatus and computer readable storage medium.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a data identification method, which comprises the following steps:

acquiring a first table;

determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;

Identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;

determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.

In the above scheme, the method further comprises: training at least one classifier model; training the classifier model includes:

acquiring at least one sample table;

determining a sample column feature vector set corresponding to each sample table in the at least one sample table according to a preset feature acquisition strategy;

performing similar column combination according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set;

training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.

In the above solution, the performing similar column merging according to the sample column feature vector set corresponding to each sample table to obtain a training data set includes:

determining at least one column of corresponding feature vector according to the sample column feature vector set corresponding to each sample table;

clustering the at least one column of corresponding feature vectors to obtain at least one cluster serving as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.

In the above scheme, the identifying the first column of feature vector set by using a preset identification model to obtain a first analysis result vector includes:

performing similar column combination on each column in the first column of feature vector set to obtain a second column of feature vector set;

identifying the second column of feature vector sets to obtain a first analysis result vector;

the determining the similarity between the first table and each type of table in at least one type of table according to the first analysis result vector includes:

determining a first column correlation number, a second column correlation number, and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;

Determining a fourth column correlation number corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column categories in each table; the fourth column correlation number characterizes the number of similar columns in the first table, wherein the classification result of the similar columns is the corresponding column class;

determining the number of columns included in the cluster corresponding to the classification result in the corresponding class table as the fifth column correlation number;

and determining the similarity of the first table and the corresponding class table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.

In the above solution, the preset feature acquisition policy includes:

determining a content value of at least one column corresponding to at least one row in the table;

extracting at least one column of feature vectors according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character related features of corresponding columns in the table;

and obtaining a column feature vector set corresponding to the table according to the at least one column feature vector.

In the above solution, the determining the identification result according to the determined similarity includes:

Determining the similarity between the first table and at least one type of table;

determining the category of a table with similarity with the first table exceeding a preset similarity threshold;

and sorting the categories of the table with the determined similarity exceeding the preset similarity threshold value, and obtaining a recognition result based on the sorting result.

The embodiment of the invention provides a data identification device, which comprises: the device comprises an acquisition unit, a processing unit and an identification unit; wherein,,

the acquisition unit is used for acquiring a first table;

the processing unit is used for determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;

the identification unit is used for identifying the first column of feature vector set by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;

In the above scheme, the device further includes: a preprocessing unit for training at least one classifier model;

the preprocessing unit is specifically used for acquiring at least one sample table;

In the above scheme, the preprocessing unit is configured to determine at least one column of corresponding feature vector according to a sample column feature vector set corresponding to each sample table;

In the above scheme, the identifying unit is configured to perform similar column merging on each column in the first column feature vector set to obtain a second column feature vector set; identifying the second column of feature vector sets to obtain a first analysis result vector;

the identification unit is further used for determining a first column correlation number, a second column correlation number and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;

In the above solution, the preset feature acquisition policy includes:

In the above solution, the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;

The embodiment of the invention provides a data identification device, which comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein,,

the processor is configured to execute the steps of any of the data identification methods described above when the computer program is run.

Embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data identification method of any of the above.

The embodiment of the invention provides a data identification method, a data identification device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a first table; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; the identification result represents a form category corresponding to the first form; therefore, the character features of each column in the table are based on recognition, and the method has good recognition capability and good generalization capability and robustness for sensitive data scenes with the same type but different key information.

Drawings

Fig. 1 is a schematic flow chart of a data identification method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a training method of a classifier model according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a column-wise analysis feature according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a similar column merging method according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating another data identification method according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a data identification method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data identification device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of another data identification apparatus according to an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the embodiments of the present application, the technical solutions of the embodiments of the present application will be clearly described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.

The terms first, second, third and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a series of steps or elements. The method, system, article, or apparatus is not necessarily limited to those explicitly listed but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

The following describes a related art related to a related data identification method.

And combining the contents, acquiring the table file to be protected based on the keyword content matching scheme, recording the specific contents in the table by adopting the abstract and keyword matching technology, and analyzing whether the table hits the same contents or not in the matching stage. The above method has poor detection capability for the condition that the content is changed, and is difficult to detect when the content is different from the sensitive data of the same type, for example, the sensitive content is designated as 'Zhang San, abc@hotmail.com', and the sensitive data cannot be identified when 'Lifour, efg@hotmail.com'.

Based on this, in various embodiments of the present invention, a first table is acquired; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.

The present invention will be described in further detail with reference to examples.

Fig. 1 is a schematic flow chart of a data identification method according to an embodiment of the present invention; as shown in fig. 1, the data identification method is applied to a server, and the method includes:

step 101, acquiring a first table;

step 102, determining a first column feature vector set of the first table according to a preset feature acquisition strategy;

wherein the first table includes at least one column;

the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;

step 103, identifying the first column of feature vector set by using a preset identification model to obtain a first analysis result vector;

wherein the recognition model comprises at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively;

the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;

104, determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining an identification result according to the determined similarity;

The identification result characterizes a table category corresponding to the first table.

Here, the data recognition method is applicable to the recognition of structured data; the structured data refers to data logically expressed and implemented by a two-dimensional table structure, and is the most common structured data type represented by document class tables and database tables. I.e. the first table may be an office document class table, a database table, etc.

In some embodiments, the recognition model includes at least one classifier model;

the classifier model is a classifier for classifying the table; different classifier models (i.e., different classifiers) are used to identify different types of tables; the method may pre-train different classifier models to identify different types of tables.

Here, the type of the form may be set based on the needs of the user. For example, if a user needs to identify a form of a financial aspect, a classifier model for the form of the corresponding financial aspect may be trained. The financial form may be a form of a certain template (or a plurality of templates, where similarity exists between templates), that is, the user may specifically set the template of the form, including: the table may include specific columns, each for which categories.

The method further comprises the steps of: training at least one classifier model;

training a classifier model for each classifier model, comprising:

acquiring at least one sample table;

The classifier model (also referred to as a classifier) is a generic term of a method for classifying samples in data mining, and a classifier model is constructed by using a statistical method or a classification algorithm on the basis of existing data.

In the embodiment of the invention, corresponding classifier models are trained for different types of tables, so that when the classifier models are used, at least one classifier model obtained through training can be used for identifying different tables.

The step of performing similar column merging according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set includes:

Specifically, the similar column merging refers to merging columns with the same or highly similar content for the case that different columns exist in a table but the same or highly similar content; for example: the table will have two columns at the same time, the "start time" and "end time", which are often identical in content and merge against the two columns.

Here, the clustering the at least one column of corresponding feature vectors to obtain at least one cluster includes:

determining a feature vector corresponding to each column in the at least one column;

clustering analysis is carried out by adopting a clustering algorithm based on a similarity threshold value or a clustering algorithm based on a density threshold value to obtain at least one cluster; the columns in each cluster are labeled as the same class of labels, that is, the columns that are grouped into the same cluster are the same at a later stage for classifier training (where labels refer to labels of columns, such as the start time and end time described above may be in one cluster, and columns belonging to a time class).

In some embodiments, the identifying the first set of feature vectors to obtain a first analysis result vector using a preset identification model includes:

Specifically, a first table is identified based on a certain classifier model, the first table originally has A1 columns, and A2 column categories are obtained after similar column combination is carried out; the classifier model is provided with B1 columns originally, and B2 column categories are obtained after similar column combination is carried out;

wherein, the first column correlation number is A1, and the second column correlation number is B1; the third column correlation number is assumed to be C1 (i.e., common columns among A1 columns and B1 columns, the common columns refer to the same class as the class, such as the columns that are both time classes (i.e., the start time, the end time, etc.), etc.;

further, assuming that A1 is 10, there are 5 columns of time classes and 5 columns of other classes in the first table, and the columns of 6 classes after the similar classes are merged;

the specific number of columns of the time class corresponding to a certain classifier model is assumed to be 3;

then, for the columns of the time class (i.e., a sort result), the fourth column correlation number is 5, and the fifth column correlation number is the number of columns included in the cluster corresponding to the columns of the time class, i.e., 3;

For columns of other classes (i.e., other classification results), the fourth column correlation number is the number of columns for the respective classification result; the fifth column correlation number is the number of columns included for the cluster corresponding to the classification result;

and carrying out data statistics on various columns in the first table, and calculating to obtain the similarity.

The similarity can be calculated using the following formula:

wherein a corresponds to a classifier model, namely a table of a certain type; b represents a first table; k (k) _b Characterizing the first column correlation number, k _a Characterizing a second column correlation number; SC (b, a) characterizes a third column correlation number;

wherein (1)>

Characterizing a fourth column correlation number; />

Characterizing a fifth column correlation number; />

Then the minimum value of the two is taken; g is the total number of column categories after similar column merging.

In some embodiments, the preset feature acquisition policy includes:

The preset feature acquisition strategy can be adopted for the first identified table and the sample table during training.

The server may use an open source tool or a tool library for extracting data to read the content of the corresponding table, determine the number of rows and columns of the corresponding table, determine the content values of different rows and columns, and determine the feature vector of at least one column according to the content values of different rows and columns.

The feature vector includes: character features of the columns; the character features specifically refer to statistical feature values of character levels in each column.

For example, the feature vector includes character-level statistical feature values of at least one of:

average length (i.e., average the length of a content value in a certain column of a table), median length (i.e., determine an intermediate value in the length of each content value), maximum length (i.e., determine a maximum value in the length of each content value), minimum length (i.e., determine a minimum value in the length of each content value), length variance (i.e., variance in the length of each content value), average chinese character duty cycle (i.e., determine an average duty cycle of chinese in each content value), average uppercase english character duty cycle (i.e., determine an average duty cycle of uppercase english characters in each content value), average lowercase english character duty cycle (i.e., determine an average duty cycle of lowercase english characters in each content value), average number duty cycle (i.e., determine an average duty cycle of numbers in each content value), average other special character duty cycle (i.e., determine an average duty cycle of other special characters in each content value), and the like. Other special characters represent other characters except the English characters, chinese characters and digital characters; for example, a special symbol.

That is, the statistical feature value at the character level described above can be determined from the content values of the respective columns.

In some embodiments, the determining the identification result according to the determined similarity includes:

Here, the recognition result obtained based on the sorting result may be a category of a table with highest similarity determined according to the sorting result, as the category to which the first table belongs.

Specifically, in some embodiments, the method may further comprise: setting a similarity threshold for each classifier model;

the similarity threshold may be set by the developer based on his experience and user requirements; the classifier model can also be detected through the positive and negative sample set (namely, positive samples and negative samples are identified by the classifier model to obtain corresponding test identification results), and the similarity threshold value is adjusted based on the detection results to obtain the similarity threshold value for each classifier model.

Here, the same or different similarity threshold values may be set for different classifier models. When the same similarity threshold value is adopted, determining the category of the table with the highest similarity as the category to which the first table belongs; when different similarity thresholds are adopted, the category of the table which exceeds the corresponding similarity threshold and is most distant from the similarity threshold can be determined as the category to which the first table belongs.

FIG. 2 is a flow chart of a training method of a classifier model according to an embodiment of the present invention; as shown in fig. 2, the training method of the classifier model includes:

step 201, obtaining a sample table set;

here, the sample table set includes: at least one sample table;

the table category of the sample table set is marked as category l; the category l may be a certain category required by the user; a form of a certain template set by a user, wherein the template comprises: start time, end time, event condition, etc.

Step 202, extracting the content of a sample table;

for each sample table, an open source tool or a tool library for data extraction can be specifically used for reading the content of the corresponding sample table, determining the number of rows and columns of the sample table of the corresponding table, and then determining the content values of different rows and columns.

Step 203, analyzing the table features according to the columns to obtain a feature matrix of the table;

here, for each sample table, the feature vectors of the respective columns in the sample table are analyzed by column in units of columns of the sample table; and obtaining the feature matrix of the sample table according to the feature vectors of each column. The feature vector comprises character features of each column; the character features specifically refer to statistical feature values of character levels in each column.

In some embodiments, in conjunction with the illustration of fig. 3, the analyzing the table features by columns, results in a feature matrix of the table, including:

step 2031, randomly reading n row j record values;

here, assuming that the number of columns x of the currently read table is j, n rows are selected from the whole table;

the selection policy may be any of the first n rows, the last n rows, and the completely randomly selected n rows in the sample table, and the content values of j columns in the n row records are read, that is, the content values of n rows in the j-th column in the sample table are read;

here, the n may be set by a developer based on user requirements; n is greater than or equal to 1;

step 2032, performing character feature statistics on the j-column content values in the n-row records;

here, the character feature statistics are performed on the content values of j columns in the n-line records, and the character-level statistical feature values of the specific statistics include, but are not limited to: average length (i.e., average the length of each content value), median length (i.e., determine the median value in the length of each content value), maximum length (i.e., determine the maximum value in the length of each content value), minimum length (i.e., determine the minimum value in the length of each content value), length variance (i.e., variance in the length of each content value), average chinese character duty (i.e., determine the average duty of chinese in each content value), average uppercase english character duty (i.e., determine the average duty of uppercase english characters in each content value), average lowercase english character duty (i.e., determine the average duty of lowercase english characters in each content value), average number duty (i.e., determine the average duty of numbers in each content value), average other special character duty (i.e., determine the average duty of other special characters in each content value), etc.; other special characters represent other characters except the English characters, chinese characters and digital characters; for example, a special symbol.

Step 2033, generating a feature vector of the column j;

the feature vector of the column j of the sample table x is obtained through the dimension statistics and is marked as v _xj The method comprises the steps of carrying out a first treatment on the surface of the Namely, according to the statistical characteristic value of the character level of the column j, obtaining a characteristic vector of the column j; the obtained feature vector comprises: statistical features at each character levelA value;

step 2034, generating a feature matrix of the sample table;

specifically, all columns in the sample table x are counted according to the steps 2021 to 2023 to obtain the feature vector v of each column in the table x _xj Finally obtaining the feature matrix V of the sample table x according to the feature vectors of each column _x ，V _x ＝[v _x1 ,v _x2 ,…,v _xk ]Where k represents the original number of columns of the sample table x.

Here, the server may determine a sample column feature vector set corresponding to each sample table through

steps

202 and 203.

Step 204, merging similar columns;

here, consider that for each column of the table, there are cases where the contents of different columns are the same or highly similar, for example: some class tables have two columns of "start time" and "end time" at the same time, the contents of the two columns are always identical, and when the classifier model is trained by using the columns as the classes, the two columns should belong to the same column, so that the columns are subjected to merging pretreatment.

Specifically, the similar column merge includes: and carrying out similar column combination according to the sample column feature vector set corresponding to each sample table.

In some embodiments, as shown in connection with fig. 4, the similar column merge includes:

2041, extracting feature vectors of each column;

if there is only one file (noted as x) of the sample table in the class l provided by the user, the set of feature vectors of each column, i.e. the feature matrix is V _x ＝[v _x1 ,v _x2 ,v _x3 …,v _xj ,…v _xk ]；v _lj Feature vectors (v) for each column in category l _lj ＝v _xj )；

If there are multiple tables of the same structure, and the tables in the table files are considered to belong to the same class, the set of feature vectors, i.e. the feature matrix, of each column is

Wherein l represents the class number, i represents the sample table number in the class, and the feature vector v of each column _lj For the average value of the statistical characteristic values of the corresponding columns of the tables, i.e. +.>

m represents the number of tables.

Step 2042, clustering by adopting a clustering algorithm based on a similarity threshold/density threshold;

here, the set V of feature vectors of each column of class l _l ＝[v _l1 ,v _l2 ,v _l3 …,v _lj ,…]As the input of the clustering, a clustering algorithm (such as Agglimerable clustering hierarchical clustering) based on a similarity threshold or a clustering algorithm (such as DBSCAN (DBSCAN) based on a Density threshold (Density-Based Spatial Clustering of Applications with Noise)) is adopted for clustering analysis, and the algorithm has the advantages that the clustering can be automatically divided according to the measurement threshold without the need of designating the number of clusters in advance, and a clustering result C is obtained after the clustering _l ＝[c ₁ ,c ₂ ,…,c _g ]The method comprises the steps of carrying out a first treatment on the surface of the g represents the total number of clusters, each cluster may include at least one column of feature vectors therein.

2043, marking columns in the same cluster as the same class labels;

the original columns which are merged into the same cluster are the same in class labels in the subsequent classifier model training stage, wherein g is less than or equal to k, and the number of columns which represent column clustering merging is less than or equal to the number of the original columns; where g represents the number of columns after cluster merging.

Step 205, generating a training data set.

Feature matrix of summary class/all sample tables

Forming a feature vector set, which is marked as +.>

Wherein V is _i ＝[v _i1 ,v _i2 ,…,v _ik ]I represents the sample table number, k represents the number of original table columns, and m represents the number of tables in the class;

step 206, training a classifier model;

here, the step 206 includes: training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.

Specifically, here, in S _l As a feature vector set, C _l The labels of the columns (preset labels of each column) are input into a classifier model algorithm (such as LightGBM) for training to obtain a multi-classification model M _l The method comprises the steps of carrying out a first treatment on the surface of the The LightGBM is a gradient Boosting framework proposed by Microsoft, is a learning algorithm based on a decision tree, and can be used for classifying tasks;

The classifier model obtained by training can be used for inputting characteristic vectors (marked as v _y ) Judging and identifying the characteristic vector v _y Probability distribution belonging to merging clusters

Wherein g represents the number of columns after cluster merging,

the token column g belongs to the probability value of the corresponding cluster.

Step 207, determining a similarity threshold by using the positive and negative sample sets;

here, the classification model M _l After training, the positive and negative sample sets prepared in advance can be input into the model M through the similar feature extraction step _l Identifying, judging the similarity between each table and the category l, wherein the similarity calculation formula is as follows:

/>

wherein k is _r Representative ofThe original column number, k, of table r _l Representing the number of table original columns in category l, and SC (r, l) represents the number of common columns of the tables of sample table r and category l;

wherein (1)>

The presence of the classification result in the representative sample table r is column class +.>

The number of columns of (a); />

Cluster class obtained by merging tables representing class i (i.e., cluster obtained by merging similar classes of the tables of class i, which can be understood as column class of sample table)>

The number of original columns in (a);

representing taking the minimum value of the two; column-by-column category- >

Performing contrast accumulation, and finally calculating to obtain theta _r ^l The method comprises the steps of carrying out a first treatment on the surface of the g represents the total number of column categories of the sample table r; />

And (3) representing the h column category corresponding to the category l in the classification result, wherein h is used for representing a certain column category in the category l.

Comprehensively testing all detection results of the positive and negative sample sets, and dividing reasonable phases according to experience (which can be divided by a developer based on experience thereof or can be determined by a server based on a preset division rule and combined with the test results of the positive and negative sample sets)Similarity threshold value theta _lt Similarity threshold (θ) corresponding to the last sample/(i.e., corresponding to the corresponding classifier model) for making a similarity determination _lt ∈[0,1])；

For example, the maximum θ when the detection rate reaches 99% and the false positive rate is lower than 1% is satisfied, and the similarity analysis value is given to an unknown analysis table e

Theta or more _lt When this is the case, table e is considered to be highly similar to category l.

The positive and negative sample set includes: at least one test positive sample (e.g., a table identical to the above class l) table identical to the corresponding class (e.g., the above class l) table, at least one test negative sample (e.g., a table different from the above class l) different from the corresponding class table.

Step 208, saving a classifier model;

Here, the model M is trained from at least one sample table of class l _l And a set similarity threshold value theta _lt And (5) carrying out tray-falling preservation.

If there are multiple sample table categories, repeating the steps 301-307 to complete the storage of classifier models and similarity thresholds for all the categories.

According to the method provided by the embodiment of the invention, the classifier model is trained according to the content statistical information of each column of the sensitive table by using a machine learning algorithm, the to-be-identified table is analyzed column by column to obtain the similarity coefficient corresponding to the sensitive table template of the classifier, and the final judgment result is obtained by integrating the similarity results of all the sensitive table classifiers.

FIG. 5 is a flowchart illustrating a data identification method according to an embodiment of the present invention; as shown in fig. 5, the data identification method includes:

step 501, loading a plurality of pre-trained classifier models;

when the table to be analyzed is needed to be judged, firstly, a plurality of classifier models are read and loaded to obtain a sensitive table classifier model set;

step 502, extracting feature vectors of each column of a table to be analyzed to obtain a feature vector set of the table to be analyzed;

here, the content of the table to be analyzed e is read to obtain a feature vector set V of the table to be analyzed _e ＝[v _e1 ,v _e2 ,v _e3 …,v _ej ]；v _ej Characterizing a characteristic vector of a j-th row of the table e to be analyzed; the feature vector includes: statistical feature values of each character level of each column;

specifically, the feature vector of the column may be extracted according to the method for analyzing the features by column shown in fig. 3; and will not be described in detail here.

Step 503, analyzing the feature vector set of the form to be analyzed through each classifier model;

assuming that the sensitive form classifier model set is

(the sensitive form classifier model set comprises L classification models, and training is performed on forms of different types I respectively to obtain corresponding classifier models M _l ) For each classifier model M _l In other words, V _e Feature vector v of each column in (a) _ej Will be input into the model M alone _l Analysis is performed to obtain the category of the home column with the highest confidence as the classification result (marked +.>

) Classifier model M of class I characterizing column j of table e _l (l=I) and finally summarizing the analyzed and judged result to obtain an analyzed result vector +.>

Similarly, when V _e After all the model analyses in turn, an analysis result vector set is obtained

Step 504, calculating the similarity with various tables according to the classification result;

for table e, it is necessary to compare the analysis result vectors R of all sensitive table models _e ^I The similarity with the class I can not know whether the table is a sensitive table or not and which sensitive table belongs to the sensitive table.

The similarity calculation method is the same as the step of determining the similarity threshold using the test dataset described in the method shown in FIG. 2, i.e., applying the formula

Calculating the similarity; summarizing the similarity results of all classes to get +.>

Here the number of the elements is the number,

representing the similarity of the table b to be identified and the table of the category a; k (k) _b Representing the original column number, k, of the table b to be identified _a The number of original columns of the table representing category a; SC (b, a) represents the number of common columns of the table of category a and the table b;

Wherein (1)>

Representing that the classification result exists in the table b to be identified as +.>

(i.e.)>

Characterized by->

) The number of columns of (a); />

After merging the tables representing class a, the cluster it has +.>

The number of original columns in (a); />

Then take the minimum value of the two and classify the results (corresponding to the cluster after combining similar columns) one by one>

Performing contrast accumulation, and finally calculating to obtain +.>

Step 505, judging whether the similarity exceeds various thresholds;

i.e. judging

Whether or not there is +.>

If present, form e is considered to be a sensitive form, and the file to which it belongs is a sensitive formThe specific category of the file is judged by the subsequent steps; otherwise, the table e is considered not to belong to the sensitive table, and the judging flow is ended;

step 506, selecting the similarity class as the belonging class;

here, at

Selecting the maximum similarity value +.>

The corresponding category (z) is the sensitive category of the table e, and the judging flow is ended.

The method provided by the embodiment of the invention uses a clustering algorithm to perform preamble characteristic processing, and the characteristic mainly adopts the character characteristic of the content; and matching by using a classification algorithm and a discrete set similarity matching method. In the training stage of the classifier model, content character feature analysis is carried out on each type of pre-provided sensitive form according to columns, and the classifier is independently trained for each type of sensitive form by using a classification algorithm; in the reasoning stage, the characteristic vector of each column of the table is firstly extracted from the table file to be analyzed, the classification result is analyzed by all the sensitive table classifiers, the similarity is calculated between the classification result and the corresponding class sensitive table, and if the similarity is the highest value and is larger than the corresponding class threshold, the table is considered to belong to the corresponding sensitive class. The method has better generalization capability, does not depend on specific keyword content, has higher recognition capability on the conditions of deleting column names, exchanging column sequences, properly adding and deleting columns and the like besides accurate recognition on the sensitive tables with the same homologous list structure, and has better expandability.

Fig. 6 is a flowchart of another data identification method provided in an embodiment of the present invention, as shown in fig. 6, where the data identification method is applied to a server, and the method includes:

step 601, training a model by a sensitive form learning module;

the user needs to provide a sample file of the sensitive form for training, and according to category distinction, the sensitive form learning module counts the feature vectors of each column in various sensitive forms by reading form contents, and then trains to obtain a corresponding sensitive form classifier model.

The category may be a table of certain templates of the user's needs.

Step 602, a sensitive form identification module identifies a form;

when the security product audits to the form class file, analyzing the form content and judging whether the form content is sensitive content or not, loading a sensitive form classifier model by a sensitive form identification module, reading in a form to be identified, analyzing the content of each column of the form to be identified by utilizing each classifier model in the sensitive form classifier model, and finally summarizing the analysis results of all classifier models to obtain a final judgment result; if the non-sensitive form is considered, a release operation is performed; if the form is considered to be sensitive, an alarm is given and a file blocking strategy is performed.

Fig. 7 is a schematic structural diagram of a data identification device according to an embodiment of the present invention; as shown in fig. 7, the apparatus includes: the device comprises an acquisition unit, a processing unit and an identification unit; wherein,,

the acquisition unit is used for acquiring a first table;

In some embodiments, the apparatus further comprises: a preprocessing unit for training at least one classifier model;

In some embodiments, the preprocessing unit is configured to determine at least one column of corresponding feature vectors according to a sample column feature vector set corresponding to each sample table;

In some embodiments, the identifying unit is configured to perform similar column merging on each column in the first column feature vector set to obtain a second column feature vector set; identifying the second column of feature vector sets to obtain a first analysis result vector;

In some embodiments, the preset feature acquisition policy includes:

In some embodiments, the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;

It should be noted that: in the data recognition device provided in the above embodiment, only the division of each program module is used for illustration, and in practical application, the process allocation may be performed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules, so as to complete all or part of the processes described above. In addition, the data recognition device and the data recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the data recognition device and the data recognition method are detailed in the method embodiments and are not repeated herein.

Fig. 8 is a schematic structural diagram of another data identification apparatus according to an embodiment of the present invention. The apparatus 80 includes: a processor 801 and a memory 802 for storing a computer program capable of running on the processor; wherein the processor 801, when executing the computer program, performs: acquiring a first table; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.

It should be noted that: the data recognition device and the data recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein.

In practical applications, the apparatus 80 may further include: at least one network interface 803. The various components in the data recognition device 80 are coupled together by a bus system 804. It is to be appreciated that the bus system 804 is employed to enable connected communications between these components. The bus system 804 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 804 in fig. 8. The number of the processors 801 may be at least one. The network interface 803 is used for wired or wireless communication between the data recognition apparatus 80 and other devices.

The memory 802 in embodiments of the present invention is used to store various types of data to support the operation of the data recognition device 80.

The method disclosed in the above embodiment of the present invention may be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in the processor 801 or by instructions in software. The Processor 801 may be a general purpose Processor, a DiGital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 801 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the invention can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in a memory 802. The processor 801 reads information from the memory 802 and in combination with its hardware performs the steps of the method described above.

In an exemplary embodiment, the data recognition device 80 may be implemented by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field-programmable gate arrays (FPGA, field-Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCU, micro Controller Unit), microprocessors (Microprocessor), or other electronic components for performing the aforementioned methods.

The embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs: acquiring a first table; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.

The corresponding flow implemented by the server in each method of the embodiment of the present invention is implemented when the computer program is executed by the processor, and is not described herein for brevity.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The above description is not intended to limit the scope of the invention, but is intended to cover any modifications, equivalents, and improvements within the spirit and principles of the invention.

Claims

1. A method of data identification, the method comprising:

acquiring a first table;

2. The method according to claim 1, wherein the method further comprises: training at least one classifier model; training the classifier model includes:

acquiring at least one sample table;

3. The method according to claim 2, wherein the performing similar column merging according to the sample column feature vector set corresponding to each sample table to obtain a training data set includes:

4. The method of claim 1, wherein the identifying the first set of feature vectors using a predetermined identification model to obtain a first analysis result vector comprises:

5. The method according to claim 1 or 2, wherein the preset feature acquisition strategy comprises:

6. The method according to claim 1 or 4, wherein the determining the recognition result according to the determined similarity comprises:

7. A data recognition device, the device comprising: the device comprises an acquisition unit, a processing unit and an identification unit; wherein,,

the acquisition unit is used for acquiring a first table;

8. The apparatus of claim 7, wherein the apparatus further comprises: a preprocessing unit for training at least one classifier model;

9. The apparatus of claim 8, wherein the preprocessing unit is configured to determine at least one column of corresponding feature vectors from a set of sample column feature vectors corresponding to each sample table;

10. The apparatus of claim 7, wherein the identifying unit is configured to perform similar column merging on each column in the first set of column feature vectors to obtain a second set of column feature vectors; identifying the second column of feature vector sets to obtain a first analysis result vector;

11. The apparatus according to claim 7 or 8, wherein the preset feature acquisition policy comprises:

12. The apparatus according to claim 7 or 10, wherein the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;

13. A data recognition device, the device comprising: a processor and a memory for storing a computer program capable of running on the processor; wherein,,

the processor being adapted to perform the steps of the method of any of claims 1 to 6 when the computer program is run.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.