CN111931229B - Data identification method, device and storage medium - Google Patents
Data identification method, device and storage medium Download PDFInfo
- Publication number
- CN111931229B CN111931229B CN202010664475.6A CN202010664475A CN111931229B CN 111931229 B CN111931229 B CN 111931229B CN 202010664475 A CN202010664475 A CN 202010664475A CN 111931229 B CN111931229 B CN 111931229B
- Authority
- CN
- China
- Prior art keywords
- column
- determining
- feature vector
- feature
- correlation number
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 80
- 239000013598 vector Substances 0.000 claims abstract description 224
- 238000004458 analytical method Methods 0.000 claims abstract description 67
- 238000012549 training Methods 0.000 claims description 61
- 238000004590 computer program Methods 0.000 claims description 11
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 description 10
- 239000011159 matrix material Substances 0.000 description 8
- 238000012360 testing method Methods 0.000 description 6
- 238000001514 detection method Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000007635 classification algorithm Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6227—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/174—Form filling; Merging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/186—Templates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Bioethics (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data identification method, a device and a storage medium, wherein the method comprises the following steps: acquiring a first table; determining a first column feature vector set of a first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of at least one column; the feature vector includes character features of the corresponding columns; identifying a first column of feature vector set by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; the identification result characterizes a table category corresponding to the first table.
Description
Technical Field
The present invention relates to data identification technology, and in particular, to a data identification method, apparatus, and computer readable storage medium.
Background
The form can be used as a means for organizing and organizing data and can comprise various data; sensitive data analysis techniques include sensitive form recognition. In the related art, the identification of table data mainly uses a keyword content matching scheme, the scheme requires a user to input a table file to be protected in advance, a summary and keyword matching technology is adopted to record specific contents in the table, and whether the table hits the same contents is analyzed. The above method has high recognition accuracy, but has poor detection capability for a case where the content is changed.
Disclosure of Invention
In view of the foregoing, a primary object of the present invention is to provide a data identification method, apparatus and computer readable storage medium.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a data identification method, which comprises the following steps:
acquiring a first table;
determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;
Identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;
determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
In the above scheme, the method further comprises: training at least one classifier model; training the classifier model includes:
acquiring at least one sample table;
determining a sample column feature vector set corresponding to each sample table in the at least one sample table according to a preset feature acquisition strategy;
performing similar column combination according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set;
training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.
In the above solution, the performing similar column merging according to the sample column feature vector set corresponding to each sample table to obtain a training data set includes:
determining at least one column of corresponding feature vector according to the sample column feature vector set corresponding to each sample table;
clustering the at least one column of corresponding feature vectors to obtain at least one cluster serving as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.
In the above scheme, the identifying the first column of feature vector set by using a preset identification model to obtain a first analysis result vector includes:
performing similar column combination on each column in the first column of feature vector set to obtain a second column of feature vector set;
identifying the second column of feature vector sets to obtain a first analysis result vector;
the determining the similarity between the first table and each type of table in at least one type of table according to the first analysis result vector includes:
determining a first column correlation number, a second column correlation number, and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;
Determining a fourth column correlation number corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column categories in each table; the fourth column correlation number characterizes the number of similar columns in the first table, wherein the classification result of the similar columns is the corresponding column class;
determining the number of columns included in the cluster corresponding to the classification result in the corresponding class table as the fifth column correlation number;
and determining the similarity of the first table and the corresponding class table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.
In the above solution, the preset feature acquisition policy includes:
determining a content value of at least one column corresponding to at least one row in the table;
extracting at least one column of feature vectors according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character related features of corresponding columns in the table;
and obtaining a column feature vector set corresponding to the table according to the at least one column feature vector.
In the above solution, the determining the identification result according to the determined similarity includes:
Determining the similarity between the first table and at least one type of table;
determining the category of a table with similarity with the first table exceeding a preset similarity threshold;
and sorting the categories of the table with the determined similarity exceeding the preset similarity threshold value, and obtaining a recognition result based on the sorting result.
The embodiment of the invention provides a data identification device, which comprises: the device comprises an acquisition unit, a processing unit and an identification unit; wherein,,
the acquisition unit is used for acquiring a first table;
the processing unit is used for determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;
the identification unit is used for identifying the first column of feature vector set by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;
Determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
In the above scheme, the device further includes: a preprocessing unit for training at least one classifier model;
the preprocessing unit is specifically used for acquiring at least one sample table;
determining a sample column feature vector set corresponding to each sample table in the at least one sample table according to a preset feature acquisition strategy;
performing similar column combination according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set;
training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.
In the above scheme, the preprocessing unit is configured to determine at least one column of corresponding feature vector according to a sample column feature vector set corresponding to each sample table;
clustering the at least one column of corresponding feature vectors to obtain at least one cluster serving as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.
In the above scheme, the identifying unit is configured to perform similar column merging on each column in the first column feature vector set to obtain a second column feature vector set; identifying the second column of feature vector sets to obtain a first analysis result vector;
the identification unit is further used for determining a first column correlation number, a second column correlation number and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;
determining a fourth column correlation number corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column categories in each table; the fourth column correlation number characterizes the number of similar columns in the first table, wherein the classification result of the similar columns is the corresponding column class;
determining the number of columns included in the cluster corresponding to the classification result in the corresponding class table as the fifth column correlation number;
and determining the similarity of the first table and the corresponding class table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.
In the above solution, the preset feature acquisition policy includes:
determining a content value of at least one column corresponding to at least one row in the table;
extracting at least one column of feature vectors according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character related features of corresponding columns in the table;
and obtaining a column feature vector set corresponding to the table according to the at least one column feature vector.
In the above solution, the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;
determining the category of a table with similarity with the first table exceeding a preset similarity threshold;
and sorting the categories of the table with the determined similarity exceeding the preset similarity threshold value, and obtaining a recognition result based on the sorting result.
The embodiment of the invention provides a data identification device, which comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein,,
the processor is configured to execute the steps of any of the data identification methods described above when the computer program is run.
Embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data identification method of any of the above.
The embodiment of the invention provides a data identification method, a data identification device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a first table; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; the identification result represents a form category corresponding to the first form; therefore, the character features of each column in the table are based on recognition, and the method has good recognition capability and good generalization capability and robustness for sensitive data scenes with the same type but different key information.
Drawings
Fig. 1 is a schematic flow chart of a data identification method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a training method of a classifier model according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a column-wise analysis feature according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a similar column merging method according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating another data identification method according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a data identification method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a data identification device according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of another data identification apparatus according to an embodiment of the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the embodiments of the present application, the technical solutions of the embodiments of the present application will be clearly described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.
The terms first, second, third and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a series of steps or elements. The method, system, article, or apparatus is not necessarily limited to those explicitly listed but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.
The following describes a related art related to a related data identification method.
And combining the contents, acquiring the table file to be protected based on the keyword content matching scheme, recording the specific contents in the table by adopting the abstract and keyword matching technology, and analyzing whether the table hits the same contents or not in the matching stage. The above method has poor detection capability for the condition that the content is changed, and is difficult to detect when the content is different from the sensitive data of the same type, for example, the sensitive content is designated as 'Zhang San, abc@hotmail.com', and the sensitive data cannot be identified when 'Lifour, efg@hotmail.com'.
Based on this, in various embodiments of the present invention, a first table is acquired; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
The present invention will be described in further detail with reference to examples.
Fig. 1 is a schematic flow chart of a data identification method according to an embodiment of the present invention; as shown in fig. 1, the data identification method is applied to a server, and the method includes:
wherein the first table includes at least one column;
the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;
wherein the recognition model comprises at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively;
the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;
104, determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining an identification result according to the determined similarity;
The identification result characterizes a table category corresponding to the first table.
Here, the data recognition method is applicable to the recognition of structured data; the structured data refers to data logically expressed and implemented by a two-dimensional table structure, and is the most common structured data type represented by document class tables and database tables. I.e. the first table may be an office document class table, a database table, etc.
In some embodiments, the recognition model includes at least one classifier model;
the classifier model is a classifier for classifying the table; different classifier models (i.e., different classifiers) are used to identify different types of tables; the method may pre-train different classifier models to identify different types of tables.
Here, the type of the form may be set based on the needs of the user. For example, if a user needs to identify a form of a financial aspect, a classifier model for the form of the corresponding financial aspect may be trained. The financial form may be a form of a certain template (or a plurality of templates, where similarity exists between templates), that is, the user may specifically set the template of the form, including: the table may include specific columns, each for which categories.
The method further comprises the steps of: training at least one classifier model;
training a classifier model for each classifier model, comprising:
acquiring at least one sample table;
determining a sample column feature vector set corresponding to each sample table in the at least one sample table according to a preset feature acquisition strategy;
performing similar column combination according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set;
training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.
The classifier model (also referred to as a classifier) is a generic term of a method for classifying samples in data mining, and a classifier model is constructed by using a statistical method or a classification algorithm on the basis of existing data.
In the embodiment of the invention, corresponding classifier models are trained for different types of tables, so that when the classifier models are used, at least one classifier model obtained through training can be used for identifying different tables.
The step of performing similar column merging according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set includes:
Determining at least one column of corresponding feature vector according to the sample column feature vector set corresponding to each sample table;
clustering the at least one column of corresponding feature vectors to obtain at least one cluster serving as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.
Specifically, the similar column merging refers to merging columns with the same or highly similar content for the case that different columns exist in a table but the same or highly similar content; for example: the table will have two columns at the same time, the "start time" and "end time", which are often identical in content and merge against the two columns.
Here, the clustering the at least one column of corresponding feature vectors to obtain at least one cluster includes:
determining a feature vector corresponding to each column in the at least one column;
clustering analysis is carried out by adopting a clustering algorithm based on a similarity threshold value or a clustering algorithm based on a density threshold value to obtain at least one cluster; the columns in each cluster are labeled as the same class of labels, that is, the columns that are grouped into the same cluster are the same at a later stage for classifier training (where labels refer to labels of columns, such as the start time and end time described above may be in one cluster, and columns belonging to a time class).
In some embodiments, the identifying the first set of feature vectors to obtain a first analysis result vector using a preset identification model includes:
performing similar column combination on each column in the first column of feature vector set to obtain a second column of feature vector set;
identifying the second column of feature vector sets to obtain a first analysis result vector;
the determining the similarity between the first table and each type of table in at least one type of table according to the first analysis result vector includes:
determining a first column correlation number, a second column correlation number, and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;
determining a fourth column correlation number corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column categories in each table; the fourth column correlation number characterizes the number of similar columns in the first table, wherein the classification result of the similar columns is the corresponding column class;
determining the number of columns included in the cluster corresponding to the classification result in the corresponding class table as the fifth column correlation number;
And determining the similarity of the first table and the corresponding class table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.
Specifically, a first table is identified based on a certain classifier model, the first table originally has A1 columns, and A2 column categories are obtained after similar column combination is carried out; the classifier model is provided with B1 columns originally, and B2 column categories are obtained after similar column combination is carried out;
wherein, the first column correlation number is A1, and the second column correlation number is B1; the third column correlation number is assumed to be C1 (i.e., common columns among A1 columns and B1 columns, the common columns refer to the same class as the class, such as the columns that are both time classes (i.e., the start time, the end time, etc.), etc.;
further, assuming that A1 is 10, there are 5 columns of time classes and 5 columns of other classes in the first table, and the columns of 6 classes after the similar classes are merged;
the specific number of columns of the time class corresponding to a certain classifier model is assumed to be 3;
then, for the columns of the time class (i.e., a sort result), the fourth column correlation number is 5, and the fifth column correlation number is the number of columns included in the cluster corresponding to the columns of the time class, i.e., 3;
For columns of other classes (i.e., other classification results), the fourth column correlation number is the number of columns for the respective classification result; the fifth column correlation number is the number of columns included for the cluster corresponding to the classification result;
and carrying out data statistics on various columns in the first table, and calculating to obtain the similarity.
The similarity can be calculated using the following formula:
wherein a corresponds to a classifier model, namely a table of a certain type; b represents a first table; k (k) b Characterizing the first column correlation number, k a Characterizing a second column correlation number; SC (b, a) characterizes a third column correlation number;
wherein (1)>Characterizing a fourth column correlation number; />Characterizing a fifth column correlation number; />Then the minimum value of the two is taken; g is the total number of column categories after similar column merging.
In some embodiments, the preset feature acquisition policy includes:
determining a content value of at least one column corresponding to at least one row in the table;
extracting at least one column of feature vectors according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character related features of corresponding columns in the table;
and obtaining a column feature vector set corresponding to the table according to the at least one column feature vector.
The preset feature acquisition strategy can be adopted for the first identified table and the sample table during training.
The server may use an open source tool or a tool library for extracting data to read the content of the corresponding table, determine the number of rows and columns of the corresponding table, determine the content values of different rows and columns, and determine the feature vector of at least one column according to the content values of different rows and columns.
The feature vector includes: character features of the columns; the character features specifically refer to statistical feature values of character levels in each column.
For example, the feature vector includes character-level statistical feature values of at least one of:
average length (i.e., average the length of a content value in a certain column of a table), median length (i.e., determine an intermediate value in the length of each content value), maximum length (i.e., determine a maximum value in the length of each content value), minimum length (i.e., determine a minimum value in the length of each content value), length variance (i.e., variance in the length of each content value), average chinese character duty cycle (i.e., determine an average duty cycle of chinese in each content value), average uppercase english character duty cycle (i.e., determine an average duty cycle of uppercase english characters in each content value), average lowercase english character duty cycle (i.e., determine an average duty cycle of lowercase english characters in each content value), average number duty cycle (i.e., determine an average duty cycle of numbers in each content value), average other special character duty cycle (i.e., determine an average duty cycle of other special characters in each content value), and the like. Other special characters represent other characters except the English characters, chinese characters and digital characters; for example, a special symbol.
That is, the statistical feature value at the character level described above can be determined from the content values of the respective columns.
In some embodiments, the determining the identification result according to the determined similarity includes:
determining the similarity between the first table and at least one type of table;
determining the category of a table with similarity with the first table exceeding a preset similarity threshold;
and sorting the categories of the table with the determined similarity exceeding the preset similarity threshold value, and obtaining a recognition result based on the sorting result.
Here, the recognition result obtained based on the sorting result may be a category of a table with highest similarity determined according to the sorting result, as the category to which the first table belongs.
Specifically, in some embodiments, the method may further comprise: setting a similarity threshold for each classifier model;
the similarity threshold may be set by the developer based on his experience and user requirements; the classifier model can also be detected through the positive and negative sample set (namely, positive samples and negative samples are identified by the classifier model to obtain corresponding test identification results), and the similarity threshold value is adjusted based on the detection results to obtain the similarity threshold value for each classifier model.
Here, the same or different similarity threshold values may be set for different classifier models. When the same similarity threshold value is adopted, determining the category of the table with the highest similarity as the category to which the first table belongs; when different similarity thresholds are adopted, the category of the table which exceeds the corresponding similarity threshold and is most distant from the similarity threshold can be determined as the category to which the first table belongs.
FIG. 2 is a flow chart of a training method of a classifier model according to an embodiment of the present invention; as shown in fig. 2, the training method of the classifier model includes:
here, the sample table set includes: at least one sample table;
the table category of the sample table set is marked as category l; the category l may be a certain category required by the user; a form of a certain template set by a user, wherein the template comprises: start time, end time, event condition, etc.
for each sample table, an open source tool or a tool library for data extraction can be specifically used for reading the content of the corresponding sample table, determining the number of rows and columns of the sample table of the corresponding table, and then determining the content values of different rows and columns.
here, for each sample table, the feature vectors of the respective columns in the sample table are analyzed by column in units of columns of the sample table; and obtaining the feature matrix of the sample table according to the feature vectors of each column. The feature vector comprises character features of each column; the character features specifically refer to statistical feature values of character levels in each column.
In some embodiments, in conjunction with the illustration of fig. 3, the analyzing the table features by columns, results in a feature matrix of the table, including:
here, assuming that the number of columns x of the currently read table is j, n rows are selected from the whole table;
the selection policy may be any of the first n rows, the last n rows, and the completely randomly selected n rows in the sample table, and the content values of j columns in the n row records are read, that is, the content values of n rows in the j-th column in the sample table are read;
here, the n may be set by a developer based on user requirements; n is greater than or equal to 1;
here, the character feature statistics are performed on the content values of j columns in the n-line records, and the character-level statistical feature values of the specific statistics include, but are not limited to: average length (i.e., average the length of each content value), median length (i.e., determine the median value in the length of each content value), maximum length (i.e., determine the maximum value in the length of each content value), minimum length (i.e., determine the minimum value in the length of each content value), length variance (i.e., variance in the length of each content value), average chinese character duty (i.e., determine the average duty of chinese in each content value), average uppercase english character duty (i.e., determine the average duty of uppercase english characters in each content value), average lowercase english character duty (i.e., determine the average duty of lowercase english characters in each content value), average number duty (i.e., determine the average duty of numbers in each content value), average other special character duty (i.e., determine the average duty of other special characters in each content value), etc.; other special characters represent other characters except the English characters, chinese characters and digital characters; for example, a special symbol.
the feature vector of the column j of the sample table x is obtained through the dimension statistics and is marked as v xj The method comprises the steps of carrying out a first treatment on the surface of the Namely, according to the statistical characteristic value of the character level of the column j, obtaining a characteristic vector of the column j; the obtained feature vector comprises: statistical features at each character levelA value;
specifically, all columns in the sample table x are counted according to the steps 2021 to 2023 to obtain the feature vector v of each column in the table x xj Finally obtaining the feature matrix V of the sample table x according to the feature vectors of each column x ,V x =[v x1 ,v x2 ,…,v xk ]Where k represents the original number of columns of the sample table x.
Here, the server may determine a sample column feature vector set corresponding to each sample table through steps 202 and 203.
here, consider that for each column of the table, there are cases where the contents of different columns are the same or highly similar, for example: some class tables have two columns of "start time" and "end time" at the same time, the contents of the two columns are always identical, and when the classifier model is trained by using the columns as the classes, the two columns should belong to the same column, so that the columns are subjected to merging pretreatment.
Specifically, the similar column merge includes: and carrying out similar column combination according to the sample column feature vector set corresponding to each sample table.
In some embodiments, as shown in connection with fig. 4, the similar column merge includes:
2041, extracting feature vectors of each column;
if there is only one file (noted as x) of the sample table in the class l provided by the user, the set of feature vectors of each column, i.e. the feature matrix is V x =[v x1 ,v x2 ,v x3 …,v xj ,…v xk ];v lj Feature vectors (v) for each column in category l lj =v xj );
If there are multiple tables of the same structure, and the tables in the table files are considered to belong to the same class, the set of feature vectors, i.e. the feature matrix, of each column isWherein l represents the class number, i represents the sample table number in the class, and the feature vector v of each column lj For the average value of the statistical characteristic values of the corresponding columns of the tables, i.e. +.>m represents the number of tables.
here, the set V of feature vectors of each column of class l l =[v l1 ,v l2 ,v l3 …,v lj ,…]As the input of the clustering, a clustering algorithm (such as Agglimerable clustering hierarchical clustering) based on a similarity threshold or a clustering algorithm (such as DBSCAN (DBSCAN) based on a Density threshold (Density-Based Spatial Clustering of Applications with Noise)) is adopted for clustering analysis, and the algorithm has the advantages that the clustering can be automatically divided according to the measurement threshold without the need of designating the number of clusters in advance, and a clustering result C is obtained after the clustering l =[c 1 ,c 2 ,…,c g ]The method comprises the steps of carrying out a first treatment on the surface of the g represents the total number of clusters, each cluster may include at least one column of feature vectors therein.
2043, marking columns in the same cluster as the same class labels;
the original columns which are merged into the same cluster are the same in class labels in the subsequent classifier model training stage, wherein g is less than or equal to k, and the number of columns which represent column clustering merging is less than or equal to the number of the original columns; where g represents the number of columns after cluster merging.
Feature matrix of summary class/all sample tablesForming a feature vector set, which is marked as +.>Wherein V is i =[v i1 ,v i2 ,…,v ik ]I represents the sample table number, k represents the number of original table columns, and m represents the number of tables in the class;
here, the step 206 includes: training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.
Specifically, here, in S l As a feature vector set, C l The labels of the columns (preset labels of each column) are input into a classifier model algorithm (such as LightGBM) for training to obtain a multi-classification model M l The method comprises the steps of carrying out a first treatment on the surface of the The LightGBM is a gradient Boosting framework proposed by Microsoft, is a learning algorithm based on a decision tree, and can be used for classifying tasks;
The classifier model obtained by training can be used for inputting characteristic vectors (marked as v y ) Judging and identifying the characteristic vector v y Probability distribution belonging to merging clustersWherein g represents the number of columns after cluster merging,the token column g belongs to the probability value of the corresponding cluster.
here, the classification model M l After training, the positive and negative sample sets prepared in advance can be input into the model M through the similar feature extraction step l Identifying, judging the similarity between each table and the category l, wherein the similarity calculation formula is as follows:
wherein k is r Representative ofThe original column number, k, of table r l Representing the number of table original columns in category l, and SC (r, l) represents the number of common columns of the tables of sample table r and category l;
wherein (1)>The presence of the classification result in the representative sample table r is column class +.>The number of columns of (a); />Cluster class obtained by merging tables representing class i (i.e., cluster obtained by merging similar classes of the tables of class i, which can be understood as column class of sample table)>The number of original columns in (a);representing taking the minimum value of the two; column-by-column category- >Performing contrast accumulation, and finally calculating to obtain theta r l The method comprises the steps of carrying out a first treatment on the surface of the g represents the total number of column categories of the sample table r; />And (3) representing the h column category corresponding to the category l in the classification result, wherein h is used for representing a certain column category in the category l.
Comprehensively testing all detection results of the positive and negative sample sets, and dividing reasonable phases according to experience (which can be divided by a developer based on experience thereof or can be determined by a server based on a preset division rule and combined with the test results of the positive and negative sample sets)Similarity threshold value theta lt Similarity threshold (θ) corresponding to the last sample/(i.e., corresponding to the corresponding classifier model) for making a similarity determination lt ∈[0,1]);
For example, the maximum θ when the detection rate reaches 99% and the false positive rate is lower than 1% is satisfied, and the similarity analysis value is given to an unknown analysis table eTheta or more lt When this is the case, table e is considered to be highly similar to category l.
The positive and negative sample set includes: at least one test positive sample (e.g., a table identical to the above class l) table identical to the corresponding class (e.g., the above class l) table, at least one test negative sample (e.g., a table different from the above class l) different from the corresponding class table.
Here, the model M is trained from at least one sample table of class l l And a set similarity threshold value theta lt And (5) carrying out tray-falling preservation.
If there are multiple sample table categories, repeating the steps 301-307 to complete the storage of classifier models and similarity thresholds for all the categories.
According to the method provided by the embodiment of the invention, the classifier model is trained according to the content statistical information of each column of the sensitive table by using a machine learning algorithm, the to-be-identified table is analyzed column by column to obtain the similarity coefficient corresponding to the sensitive table template of the classifier, and the final judgment result is obtained by integrating the similarity results of all the sensitive table classifiers.
FIG. 5 is a flowchart illustrating a data identification method according to an embodiment of the present invention; as shown in fig. 5, the data identification method includes:
when the table to be analyzed is needed to be judged, firstly, a plurality of classifier models are read and loaded to obtain a sensitive table classifier model set;
here, the content of the table to be analyzed e is read to obtain a feature vector set V of the table to be analyzed e =[v e1 ,v e2 ,v e3 …,v ej ];v ej Characterizing a characteristic vector of a j-th row of the table e to be analyzed; the feature vector includes: statistical feature values of each character level of each column;
specifically, the feature vector of the column may be extracted according to the method for analyzing the features by column shown in fig. 3; and will not be described in detail here.
assuming that the sensitive form classifier model set is(the sensitive form classifier model set comprises L classification models, and training is performed on forms of different types I respectively to obtain corresponding classifier models M l ) For each classifier model M l In other words, V e Feature vector v of each column in (a) ej Will be input into the model M alone l Analysis is performed to obtain the category of the home column with the highest confidence as the classification result (marked +.>) Classifier model M of class I characterizing column j of table e l (l=I) and finally summarizing the analyzed and judged result to obtain an analyzed result vector +.>
for table e, it is necessary to compare the analysis result vectors R of all sensitive table models e I The similarity with the class I can not know whether the table is a sensitive table or not and which sensitive table belongs to the sensitive table.
The similarity calculation method is the same as the step of determining the similarity threshold using the test dataset described in the method shown in FIG. 2, i.e., applying the formulaCalculating the similarity; summarizing the similarity results of all classes to get +.>
Here the number of the elements is the number,representing the similarity of the table b to be identified and the table of the category a; k (k) b Representing the original column number, k, of the table b to be identified a The number of original columns of the table representing category a; SC (b, a) represents the number of common columns of the table of category a and the table b;
Wherein (1)>Representing that the classification result exists in the table b to be identified as +.>(i.e.)>Characterized by->) The number of columns of (a); />After merging the tables representing class a, the cluster it has +.>The number of original columns in (a); />Then take the minimum value of the two and classify the results (corresponding to the cluster after combining similar columns) one by one>Performing contrast accumulation, and finally calculating to obtain +.>
i.e. judgingWhether or not there is +.>If present, form e is considered to be a sensitive form, and the file to which it belongs is a sensitive formThe specific category of the file is judged by the subsequent steps; otherwise, the table e is considered not to belong to the sensitive table, and the judging flow is ended;
here, atSelecting the maximum similarity value +.>The corresponding category (z) is the sensitive category of the table e, and the judging flow is ended.
The method provided by the embodiment of the invention uses a clustering algorithm to perform preamble characteristic processing, and the characteristic mainly adopts the character characteristic of the content; and matching by using a classification algorithm and a discrete set similarity matching method. In the training stage of the classifier model, content character feature analysis is carried out on each type of pre-provided sensitive form according to columns, and the classifier is independently trained for each type of sensitive form by using a classification algorithm; in the reasoning stage, the characteristic vector of each column of the table is firstly extracted from the table file to be analyzed, the classification result is analyzed by all the sensitive table classifiers, the similarity is calculated between the classification result and the corresponding class sensitive table, and if the similarity is the highest value and is larger than the corresponding class threshold, the table is considered to belong to the corresponding sensitive class. The method has better generalization capability, does not depend on specific keyword content, has higher recognition capability on the conditions of deleting column names, exchanging column sequences, properly adding and deleting columns and the like besides accurate recognition on the sensitive tables with the same homologous list structure, and has better expandability.
Fig. 6 is a flowchart of another data identification method provided in an embodiment of the present invention, as shown in fig. 6, where the data identification method is applied to a server, and the method includes:
the user needs to provide a sample file of the sensitive form for training, and according to category distinction, the sensitive form learning module counts the feature vectors of each column in various sensitive forms by reading form contents, and then trains to obtain a corresponding sensitive form classifier model.
The category may be a table of certain templates of the user's needs.
when the security product audits to the form class file, analyzing the form content and judging whether the form content is sensitive content or not, loading a sensitive form classifier model by a sensitive form identification module, reading in a form to be identified, analyzing the content of each column of the form to be identified by utilizing each classifier model in the sensitive form classifier model, and finally summarizing the analysis results of all classifier models to obtain a final judgment result; if the non-sensitive form is considered, a release operation is performed; if the form is considered to be sensitive, an alarm is given and a file blocking strategy is performed.
Fig. 7 is a schematic structural diagram of a data identification device according to an embodiment of the present invention; as shown in fig. 7, the apparatus includes: the device comprises an acquisition unit, a processing unit and an identification unit; wherein,,
the acquisition unit is used for acquiring a first table;
the processing unit is used for determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;
the identification unit is used for identifying the first column of feature vector set by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;
determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
In some embodiments, the apparatus further comprises: a preprocessing unit for training at least one classifier model;
the preprocessing unit is specifically used for acquiring at least one sample table;
determining a sample column feature vector set corresponding to each sample table in the at least one sample table according to a preset feature acquisition strategy;
performing similar column combination according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set;
training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.
In some embodiments, the preprocessing unit is configured to determine at least one column of corresponding feature vectors according to a sample column feature vector set corresponding to each sample table;
clustering the at least one column of corresponding feature vectors to obtain at least one cluster serving as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.
In some embodiments, the identifying unit is configured to perform similar column merging on each column in the first column feature vector set to obtain a second column feature vector set; identifying the second column of feature vector sets to obtain a first analysis result vector;
The identification unit is further used for determining a first column correlation number, a second column correlation number and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;
determining a fourth column correlation number corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column categories in each table; the fourth column correlation number characterizes the number of similar columns in the first table, wherein the classification result of the similar columns is the corresponding column class;
determining the number of columns included in the cluster corresponding to the classification result in the corresponding class table as the fifth column correlation number;
and determining the similarity of the first table and the corresponding class table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.
In some embodiments, the preset feature acquisition policy includes:
determining a content value of at least one column corresponding to at least one row in the table;
extracting at least one column of feature vectors according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character related features of corresponding columns in the table;
And obtaining a column feature vector set corresponding to the table according to the at least one column feature vector.
In some embodiments, the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;
determining the category of a table with similarity with the first table exceeding a preset similarity threshold;
and sorting the categories of the table with the determined similarity exceeding the preset similarity threshold value, and obtaining a recognition result based on the sorting result.
It should be noted that: in the data recognition device provided in the above embodiment, only the division of each program module is used for illustration, and in practical application, the process allocation may be performed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules, so as to complete all or part of the processes described above. In addition, the data recognition device and the data recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the data recognition device and the data recognition method are detailed in the method embodiments and are not repeated herein.
Fig. 8 is a schematic structural diagram of another data identification apparatus according to an embodiment of the present invention. The apparatus 80 includes: a processor 801 and a memory 802 for storing a computer program capable of running on the processor; wherein the processor 801, when executing the computer program, performs: acquiring a first table; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
It should be noted that: the data recognition device and the data recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein.
In practical applications, the apparatus 80 may further include: at least one network interface 803. The various components in the data recognition device 80 are coupled together by a bus system 804. It is to be appreciated that the bus system 804 is employed to enable connected communications between these components. The bus system 804 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 804 in fig. 8. The number of the processors 801 may be at least one. The network interface 803 is used for wired or wireless communication between the data recognition apparatus 80 and other devices.
The memory 802 in embodiments of the present invention is used to store various types of data to support the operation of the data recognition device 80.
The method disclosed in the above embodiment of the present invention may be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in the processor 801 or by instructions in software. The Processor 801 may be a general purpose Processor, a DiGital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 801 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the invention can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in a memory 802. The processor 801 reads information from the memory 802 and in combination with its hardware performs the steps of the method described above.
In an exemplary embodiment, the data recognition device 80 may be implemented by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field-programmable gate arrays (FPGA, field-Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCU, micro Controller Unit), microprocessors (Microprocessor), or other electronic components for performing the aforementioned methods.
The embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs: acquiring a first table; determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns; identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column; determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
The corresponding flow implemented by the server in each method of the embodiment of the present invention is implemented when the computer program is executed by the processor, and is not described herein for brevity.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.
Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.
The above description is not intended to limit the scope of the invention, but is intended to cover any modifications, equivalents, and improvements within the spirit and principles of the invention.
Claims (14)
1. A method of data identification, the method comprising:
acquiring a first table;
determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;
identifying the first column of feature vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;
determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
2. The method according to claim 1, wherein the method further comprises: training at least one classifier model; training the classifier model includes:
acquiring at least one sample table;
determining a sample column feature vector set corresponding to each sample table in the at least one sample table according to a preset feature acquisition strategy;
performing similar column combination according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set;
training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.
3. The method according to claim 2, wherein the performing similar column merging according to the sample column feature vector set corresponding to each sample table to obtain a training data set includes:
determining at least one column of corresponding feature vector according to the sample column feature vector set corresponding to each sample table;
clustering the at least one column of corresponding feature vectors to obtain at least one cluster serving as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.
4. The method of claim 1, wherein the identifying the first set of feature vectors using a predetermined identification model to obtain a first analysis result vector comprises:
performing similar column combination on each column in the first column of feature vector set to obtain a second column of feature vector set;
identifying the second column of feature vector sets to obtain a first analysis result vector;
the determining the similarity between the first table and each type of table in at least one type of table according to the first analysis result vector includes:
determining a first column correlation number, a second column correlation number, and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;
determining a fourth column correlation number corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column categories in each table; the fourth column correlation number characterizes the number of similar columns in the first table, wherein the classification result of the similar columns is the corresponding column class;
Determining the number of columns included in the cluster corresponding to the classification result in the corresponding class table as the fifth column correlation number;
and determining the similarity of the first table and the corresponding class table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.
5. The method according to claim 1 or 2, wherein the preset feature acquisition strategy comprises:
determining a content value of at least one column corresponding to at least one row in the table;
extracting at least one column of feature vectors according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character related features of corresponding columns in the table;
and obtaining a column feature vector set corresponding to the table according to the at least one column feature vector.
6. The method according to claim 1 or 4, wherein the determining the recognition result according to the determined similarity comprises:
determining the similarity between the first table and at least one type of table;
determining the category of a table with similarity with the first table exceeding a preset similarity threshold;
and sorting the categories of the table with the determined similarity exceeding the preset similarity threshold value, and obtaining a recognition result based on the sorting result.
7. A data recognition device, the device comprising: the device comprises an acquisition unit, a processing unit and an identification unit; wherein,,
the acquisition unit is used for acquiring a first table;
the processing unit is used for determining a first column feature vector set of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first set of column feature vectors includes feature vectors for each of the at least one column; the feature vector comprises character features of corresponding columns;
the identification unit is used for identifying the first column of feature vector set by using a preset identification model to obtain a first analysis result vector; the recognition model includes at least one classifier model; each classifier model of the at least one classifier model is used for identifying a corresponding class table respectively; the first analysis result vector comprises analysis results of corresponding classifier models in each column of at least one column;
determining the similarity between the first table and various tables in at least one type of tables according to the first analysis result vector, and determining the identification result according to the determined similarity; and the identification result represents the form category corresponding to the first form.
8. The apparatus of claim 7, wherein the apparatus further comprises: a preprocessing unit for training at least one classifier model;
the preprocessing unit is specifically used for acquiring at least one sample table;
determining a sample column feature vector set corresponding to each sample table in the at least one sample table according to a preset feature acquisition strategy;
performing similar column combination according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set;
training according to the training data set and the labels corresponding to each column in the training data set to obtain a classifier model.
9. The apparatus of claim 8, wherein the preprocessing unit is configured to determine at least one column of corresponding feature vectors from a set of sample column feature vectors corresponding to each sample table;
clustering the at least one column of corresponding feature vectors to obtain at least one cluster serving as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.
10. The apparatus of claim 7, wherein the identifying unit is configured to perform similar column merging on each column in the first set of column feature vectors to obtain a second set of column feature vectors; identifying the second column of feature vector sets to obtain a first analysis result vector;
The identification unit is further used for determining a first column correlation number, a second column correlation number and a third column correlation number; the first column correlation number represents the column number of the first table, the second column correlation number represents the column number of the corresponding class table, and the third column correlation number represents the common column number of the first table and the corresponding class table;
determining a fourth column correlation number corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column categories in each table; the fourth column correlation number characterizes the number of similar columns in the first table, wherein the classification result of the similar columns is the corresponding column class;
determining the number of columns included in the cluster corresponding to the classification result in the corresponding class table as the fifth column correlation number;
and determining the similarity of the first table and the corresponding class table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.
11. The apparatus according to claim 7 or 8, wherein the preset feature acquisition policy comprises:
determining a content value of at least one column corresponding to at least one row in the table;
Extracting at least one column of feature vectors according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character related features of corresponding columns in the table;
and obtaining a column feature vector set corresponding to the table according to the at least one column feature vector.
12. The apparatus according to claim 7 or 10, wherein the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;
determining the category of a table with similarity with the first table exceeding a preset similarity threshold;
and sorting the categories of the table with the determined similarity exceeding the preset similarity threshold value, and obtaining a recognition result based on the sorting result.
13. A data recognition device, the device comprising: a processor and a memory for storing a computer program capable of running on the processor; wherein,,
the processor being adapted to perform the steps of the method of any of claims 1 to 6 when the computer program is run.
14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664475.6A CN111931229B (en) | 2020-07-10 | 2020-07-10 | Data identification method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664475.6A CN111931229B (en) | 2020-07-10 | 2020-07-10 | Data identification method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111931229A CN111931229A (en) | 2020-11-13 |
CN111931229B true CN111931229B (en) | 2023-07-11 |
Family
ID=73312419
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010664475.6A Active CN111931229B (en) | 2020-07-10 | 2020-07-10 | Data identification method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111931229B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023157074A1 (en) * | 2022-02-15 | 2023-08-24 | 日本電気株式会社 | Teaching data generation assistance device, teaching data generation assistance system, teaching data generation method, and non-transitory computer-readable medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012006509A1 (en) * | 2010-07-09 | 2012-01-12 | Google Inc. | Table search using recovered semantic information |
CN108021565A (en) * | 2016-11-01 | 2018-05-11 | 中国移动通信有限公司研究院 | A kind of analysis method and device of the user satisfaction based on linguistic level |
CN109492640A (en) * | 2017-09-12 | 2019-03-19 | 中国移动通信有限公司研究院 | Licence plate recognition method, device and computer readable storage medium |
CN109635633A (en) * | 2018-10-26 | 2019-04-16 | 平安科技(深圳)有限公司 | Electronic device, bank slip recognition method and storage medium |
CN109710725A (en) * | 2018-12-13 | 2019-05-03 | 中国科学院信息工程研究所 | A kind of Chinese table column label restoration methods and system based on text classification |
CN110222171A (en) * | 2019-05-08 | 2019-09-10 | 新华三大数据技术有限公司 | A kind of application of disaggregated model, disaggregated model training method and device |
WO2019174130A1 (en) * | 2018-03-14 | 2019-09-19 | 平安科技(深圳)有限公司 | Bill recognition method, server, and computer readable storage medium |
CN110647795A (en) * | 2019-07-30 | 2020-01-03 | 正和智能网络科技(广州)有限公司 | Form recognition method |
CN111144282A (en) * | 2019-12-25 | 2020-05-12 | 北京同邦卓益科技有限公司 | Table recognition method and device, and computer-readable storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11593376B2 (en) * | 2015-10-09 | 2023-02-28 | Informatica Llc | Method, apparatus, and computer-readable medium to extract a referentially intact subset from a database |
US10331947B2 (en) * | 2017-04-26 | 2019-06-25 | International Business Machines Corporation | Automatic detection on string and column delimiters in tabular data files |
US11288297B2 (en) * | 2017-11-29 | 2022-03-29 | Oracle International Corporation | Explicit semantic analysis-based large-scale classification |
-
2020
- 2020-07-10 CN CN202010664475.6A patent/CN111931229B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012006509A1 (en) * | 2010-07-09 | 2012-01-12 | Google Inc. | Table search using recovered semantic information |
CN108021565A (en) * | 2016-11-01 | 2018-05-11 | 中国移动通信有限公司研究院 | A kind of analysis method and device of the user satisfaction based on linguistic level |
CN109492640A (en) * | 2017-09-12 | 2019-03-19 | 中国移动通信有限公司研究院 | Licence plate recognition method, device and computer readable storage medium |
WO2019174130A1 (en) * | 2018-03-14 | 2019-09-19 | 平安科技(深圳)有限公司 | Bill recognition method, server, and computer readable storage medium |
CN109635633A (en) * | 2018-10-26 | 2019-04-16 | 平安科技(深圳)有限公司 | Electronic device, bank slip recognition method and storage medium |
CN109710725A (en) * | 2018-12-13 | 2019-05-03 | 中国科学院信息工程研究所 | A kind of Chinese table column label restoration methods and system based on text classification |
CN110222171A (en) * | 2019-05-08 | 2019-09-10 | 新华三大数据技术有限公司 | A kind of application of disaggregated model, disaggregated model training method and device |
CN110647795A (en) * | 2019-07-30 | 2020-01-03 | 正和智能网络科技(广州)有限公司 | Form recognition method |
CN111144282A (en) * | 2019-12-25 | 2020-05-12 | 北京同邦卓益科技有限公司 | Table recognition method and device, and computer-readable storage medium |
Non-Patent Citations (1)
Title |
---|
《Column Concept Determination for Chinese Web Tables via Convolutional Neural Network》;谢洁;《Column Concept Determination for Chinese Web Tables via Convolutional Neural Network》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111931229A (en) | 2020-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
De Stefano et al. | Reliable writer identification in medieval manuscripts through page layout features: The “Avila” Bible case | |
CN106156766B (en) | Method and device for generating text line classifier | |
US8315465B1 (en) | Effective feature classification in images | |
US20070065003A1 (en) | Real-time recognition of mixed source text | |
CN103761221B (en) | System and method for identifying sensitive text messages | |
Huang et al. | Isolated Handwritten Pashto Character Recognition Using a K‐NN Classification Tool based on Zoning and HOG Feature Extraction Techniques | |
WO2020164278A1 (en) | Image processing method and device, electronic equipment and readable storage medium | |
US20220277174A1 (en) | Evaluation method, non-transitory computer-readable storage medium, and information processing device | |
US11600088B2 (en) | Utilizing machine learning and image filtering techniques to detect and analyze handwritten text | |
Al-Maadeed | Text‐Dependent Writer Identification for Arabic Handwriting | |
CN110110325B (en) | Repeated case searching method and device and computer readable storage medium | |
CN111353491A (en) | Character direction determining method, device, equipment and storage medium | |
CN109408636A (en) | File classification method and device | |
CN113486664A (en) | Text data visualization analysis method, device, equipment and storage medium | |
Sah et al. | Text and non-text recognition using modified HOG descriptor | |
Lu et al. | Retrieval of machine-printed latin documents through word shape coding | |
CN115953123A (en) | Method, device and equipment for generating robot automation flow and storage medium | |
CN111931229B (en) | Data identification method, device and storage medium | |
Barnouti et al. | An efficient character recognition technique using K-nearest neighbor classifier | |
Ghanmi et al. | A new descriptor for pattern matching: application to identity document verification | |
CN113987243A (en) | Image file gathering method, image file gathering device and computer readable storage medium | |
CN115526173A (en) | Feature word extraction method and system based on computer information technology | |
CN113065010B (en) | Label image management method, apparatus, computer device and storage medium | |
CN115842645A (en) | UMAP-RF-based network attack traffic detection method and device and readable storage medium | |
Suresan et al. | Comparison of machine learning algorithms for smart license number plate detection system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |