CN109408555B

CN109408555B - Data type identification method and device and data storage method and device

Info

Publication number: CN109408555B
Application number: CN201811096054.7A
Authority: CN
Inventors: 王海波; 李晓宇
Original assignee: Cognitive Computing Nanjing Information Technology Co ltd
Current assignee: Cognitive Computing Nanjing Information Technology Co ltd
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2022-11-11
Anticipated expiration: 2038-09-19
Also published as: CN109408555A

Abstract

The invention discloses a data type identification method and device and a data storage method and device, wherein the data type identification method comprises the following steps: s1, acquiring column data to be identified, wherein the column data comprises a column header and data contents; s2, extracting features of the line data to obtain a feature vector, wherein the feature vector comprises line head features and data content features; and S3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data. The method obtains the feature vectors according to the column headers and the data contents of the column data, inputs the feature vectors into a pre-trained classification model to classify the column headers and the data contents to obtain the semantic attributes of the column headers and the data contents, completes the identification of the structured data types, is simple and convenient, has high efficiency, does not need manual intervention, and greatly saves manpower and material resources.

Description

Data type identification method and device and data storage method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a data type identification method and device and a data storage method and device.

Background

Structured data analysis is one of the important links in data mining. Structured data stored in files with formats such as Excel and csv are difficult to directly analyze. An analyst usually performs a complicated analysis operation by using a relational database or a graph database, i.e., the analyst needs to store data in a file into the relational database or the graph database and then perform an analysis operation by using other analysis frameworks. In the process of warehousing, an analyst needs to map column data in a file with a format such as Excel, csv and the like with one field in a database.

At present, the field matching problem in the data warehousing process generally has two modes: one is that the analyst manually completes the mapping, which requires a lot of manual intervention, and is time-consuming, labor-consuming and inefficient. The other is to achieve the effect of automatic mapping by means of a policy, and the effect can be achieved by the following two ways: 1. recording the result of the previous manual mapping, and quickly matching the mapping if the current file column (usually taking the column header as a standard) is processed before; 2. and the mapping is completed through strategies such as hard matching or regular matching of the column header and the database field. Both of these approaches suffer from the problem of being inflexible enough to still require human intervention when a list of similar data is present that has not been processed.

Disclosure of Invention

The invention aims to provide a data type identification method and device and a data storage method and device, and effectively solves the technical problems that the structured data type identification is not flexible enough and the efficiency is low in the prior art.

The technical scheme provided by the invention is as follows:

a data type identification method, comprising:

s1, acquiring column data to be identified, wherein the column data comprises column headers and data contents;

s2, extracting features of the line data to obtain a feature vector, wherein the feature vector comprises line head features and data content features;

and S3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data.

Further preferably, in step S2, the method includes:

s21, extracting a column head in the column data to obtain column head characteristics;

s22, extracting a first preset feature of single data in the data content;

s23, extracting second preset features aiming at all data contents;

s24, splicing the column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data.

Further preferably, in step S21, the word embedding model is used to convert the column headers into feature vectors of preset dimensions;

and/or, in step S22, extracting character string length, format and constituent element characteristics of a single piece of data in the data content;

and/or, in step S23, the dispersion, continuity and variance features are extracted for all data contents.

Further preferably, before step S1, a step of training the classification model is further included, including:

s01, selecting a training corpus and carrying out preprocessing operation on the training corpus;

s02, selecting a classification model;

s03, extracting training samples from the training corpus after the preprocessing operation;

s04, labeling classification categories of the extracted training samples;

and S05, inputting the training samples marked with the classification categories into a classification model, and training the classification model.

The invention also provides a data storage method, which comprises the data type identification method and further comprises the following steps:

s4, obtaining the semantic attributes of the classification categories output by the classification model according to the classification categories, wherein a mapping relation is prestored between the classification categories output by the classification model and the semantic attributes to which the classification categories belong;

and S5, matching the semantic attribute to which the obtained column data belongs with the semantic attribute of the database field to finish the warehousing operation of the column data, wherein a mapping relation is prestored between the semantic attribute to which the column data output by the classification model belongs and the semantic attribute of the database field.

The invention also provides a data type identification device, comprising:

the data acquisition module is used for acquiring column data to be identified, wherein the column data comprises a column header and data contents;

the characteristic extraction module is used for extracting the characteristics of the line data acquired by the data acquisition module to obtain a characteristic vector, and the characteristic vector comprises a line head characteristic and a data content characteristic;

and the data classification module is used for inputting the feature vectors extracted by the feature extraction module into a pre-trained classification model to classify the feature vectors so as to complete the identification of the column data.

Further preferably, the feature extraction module comprises:

the characteristic extraction unit is used for extracting column heads in the column data to obtain column head characteristics; extracting a first preset characteristic of single data in the data content; extracting second preset characteristics aiming at all data contents;

and the characteristic splicing unit is used for splicing the column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data.

Further preferably, in the feature extraction unit, the word embedding model is used to convert the column headers into feature vectors of preset dimensions; extracting character string length, format and constituent element characteristics of single data in data content; and extracting the features of dispersion, continuity and variance for all data contents.

Further preferably, the recognition device further comprises a training module, configured to train the classification model; the training module comprises:

the corpus preprocessing unit is used for selecting the training corpus and carrying out preprocessing operation on the training corpus;

the sample extraction unit is used for extracting training samples from the training corpus after the preprocessing operation;

the labeling unit is used for labeling the classification category of the extracted training sample;

and the training unit is used for inputting the selected classification model into the training sample labeled with the classification category to train the training sample.

The invention also provides a data storage device, which comprises the data type identification device and further comprises:

and the matching module is used for obtaining the semantic attribute of the classification type according to the classification type output by the classification model, matching the identification result of the data type identification device with the semantic attribute of the database field and finishing the warehousing operation of the column data, wherein a mapping relation is prestored between the classification type output by the classification model and the semantic attribute to which the column data belongs, and mapping relations are prestored between the semantic attribute to which the column data belongs and the semantic attribute of the database field and are stored in the storage module.

According to the data type identification method and device provided by the invention, the characteristic vector is obtained according to the column head and the data content of the column data, and is input into a pre-trained classification model to classify the column head and the data content to obtain the semantic attributes of the column head and the data content, so that the identification of the structured data type is completed, the method and device are simple and convenient, the efficiency is high, manual intervention is not needed, and manpower and material resources are greatly saved; in addition, corresponding classification models can be trained aiming at different application scenes, and the application is wide. In the structured data storage process, the mapping can be quickly, flexibly and accurately recommended only by establishing the mapping between the semantic attributes of the column data and the semantic attributes of the database fields.

Drawings

The above features, technical features, advantages and implementations of a log processing method and system will be further described in the following detailed description of preferred embodiments in a clearly understandable manner with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of a data type identification method according to the present invention;

FIG. 2 is a schematic diagram of a training process of a classification model according to the present invention;

FIG. 3 is a schematic diagram of a data type identifier according to the present invention.

Description of reference numerals:

100-graph data structure converter, 110-entity splitting module, 120-entity merging module and 130-link splitting module.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, without inventive effort, other drawings and embodiments can be derived from them.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".

As shown in fig. 1, a schematic flow chart of a data type identification method provided by the present invention is shown, and as can be seen from the diagram, the identification method includes: s1, acquiring column data to be identified, wherein the column data comprises column headers and data contents; s2, extracting features of the line data to obtain a feature vector, wherein the feature vector comprises line head features and data content features; and S3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data.

In this method, the column data refers to a column of data in a file with a format such as Excel, csv, etc., and the data format is usually: and the column header is the first row of data in the file and is used for describing the content of the current column. Word embedding (word embedding) is a type representation of words, words with similar meanings have similar representations, and is a general term for a method for mapping words to real number vectors. In the data type identification process, a column of data in a file is regarded as an analysis object, and the purpose is realized by three stages: feature extraction, classification model training and data classification.

In the process of feature extraction, the column data is divided into a column head part and a data content part, so the feature extraction is also divided into a column head feature extraction part and a data content feature extraction part. The column header is typically a feature description of the column data, so here the column header is converted to a feature vector of specified dimensions by a Word embedding Model (e.g., word2Vec, CBoW, skip-Gram Model, etc.). For a column data content part, firstly, acquiring a data sample by a sampling technology, and then extracting characteristics such as character string length, format, constituent elements and the like from single data (one row) in the sample to obtain a first preset characteristic; extracting features such as dispersion, continuity, variance and the like from all samples to obtain second preset features; and finally, splicing the obtained column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data, wherein the characteristic vector is used as the characteristic description of the current analysis object (column data).

After the feature extraction is finished, the feature vectors are sent into the trained classification model for classification to obtain the classification result of the feature vectors, so that the purpose of automatically identifying the data type is achieved. The specific form of the classification model is not limited herein, and the classification model can be used in the present invention, such as svm (support vector machine), decision tree, random forest, neural network (deep learning), and the like.

As shown in fig. 2, in the process of training the classification model, the method includes: s01, selecting a training corpus and carrying out preprocessing operation on the training corpus; s02, selecting a classification model; s03, extracting training samples from the training corpus after the preprocessing operation; s04, labeling classification categories of the extracted training samples; and S05, inputting the training samples marked with the classification categories into a classification model, and training the classification model.

Specifically, the word embedding model has a high requirement on the corpus, and when the corpus is selected, articles capable of covering the field related to the data to be analyzed (data in Excel and CSV files) should be selected as much as possible. Then, it is preprocessed according to specific usage scenarios, such as: deleting English, deleting special symbols, converting simplified and traditional characters and the like, and then selecting word segmentation algorithms such as jieba, hanLP and the like to perform word segmentation processing on the speech.

Before training, the required classification classes are determined according to the service scene, and the corresponding number of classification results are mapped according to the number of the classification classes, for example, if n classes are included, each class is mapped to be 0,1, \ 8230; (n-1). Then, extracting training samples (specifically, feature vectors after feature extraction of the selected training corpus (column data)) from the training corpus, and labeling classification categories for each training sample, where the labeled content is specifically a classification result mapped according to the classification type, and if the mapping relationship is a number, the labeled information is a corresponding number. The selected training samples should cover all classification categories, and the number of training samples corresponding to each category should not be too different, and should be divided equally as much as possible.

For the classification model, the Word2vec model, as a popular Word embedding model, has been integrated by various open source frameworks. The invention trains the preprocessed corpus by using a word2vec model by means of a genim open source framework. The classification algorithm selected may be svm, decision trees, random forests, neural networks, and the like. After supervised training of the classification model based on the selected training samples, the classification model can be used for identification of column data types.

Based on the data type identification method, the invention also provides a data storage method, and in the method, besides the data type identification method, the method also comprises the following steps: s4, obtaining the semantic attribute of the classification type output by the classification model according to the classification type output by the classification model, wherein a mapping relation is prestored between the classification type output by the classification model and the semantic attribute of the classification type output by the classification model; and S5, matching the semantic attribute to which the obtained column data belongs with the semantic attribute of the database field to finish the warehousing operation of the column data, wherein a mapping relation is prestored between the semantic attribute to which the column data output by the classification model belongs and the semantic attribute of the database field.

In the method, a column header is first row data in a file, is used for describing the content of a current column, and is different expression modes of semantic attributes; the semantic attribute is used for describing the characteristics of a list of data, and is a high-level description established on the bottom-level characteristics, such as an identity card number, a mobile phone number and the like. Generally, structured data (including column data) has corresponding semantic attributes, and database fields in a database also have corresponding semantic attributes. Since the column header and the database field of the column data are both one expression of the semantic attribute, and the same semantic attribute can be expressed by multiple expressions, it is difficult to directly complete mapping by matching the column header and the database field, for example: the database field is phone _ num, and the column head is a mobile phone number, a calling number and the like, so the mapping from the column data to the database field is completed through the matching of semantic attributes in the method.

After the classification model outputs a classification result (corresponding to a certain classification type), the semantic attribute to which the classification result belongs is obtained by searching the mapping relation between the stored classification type and the semantic attribute to which the classification type belongs; and then, further searching the mapping relation between the semantic attribute of the column data and the semantic attribute of the database field, namely matching the semantic attribute with the database field in the database, and storing the column data into the corresponding position in the database. In other embodiments, during the training of the classification model, a corresponding number of classification results are mapped to the required semantic attributes (covering the database fields) determined from the business scenario, and similarly, the classification is assumed to include n semantic attributes, with the classification mapping being 0,1, \ 8230; (n-1). After the characteristic vector of the column data is input into the classification model, the semantic attribute of the column vector is directly obtained according to the mapping relation between the classification result and the semantic attribute, and then the semantic attribute is matched with the semantic attribute of the database field.

As shown in fig. 3, a schematic diagram of a data type identification apparatus 100 provided by the present invention is shown, and as can be seen from the diagram, the data type identification apparatus 100 includes: the system comprises a data acquisition module 110, a feature extraction module 120 and a data classification module 130, wherein the feature extraction module 120 is respectively connected with the data acquisition module 110 and the data classification module 130. In the working process, firstly, the data obtaining module 110 obtains column data to be identified, wherein the column data comprises a column header and data content; then, the feature extraction module 120 extracts features of the line data acquired by the data acquisition module 110 to obtain a feature vector, where the feature vector includes a line head feature and a data content feature; finally, the data classification module 130 inputs the feature vectors extracted by the feature extraction module 120 into a pre-trained classification model to classify the feature vectors, thereby completing the identification of the column data.

Specifically, the column data refers to a column of data in a file with a format such as Excel, csv, and the like, and the data format is usually: and the column header is the first row of data in the file and is used for describing the content of the current column. Word embedding (word embedding) is a type representation of words, words with similar meanings have similar representations, and is a general term for a method for mapping words to real number vectors. In the data type identification process, a column of data in a file is regarded as an analysis object, and the purpose is realized by three stages: feature extraction, classification model training and data classification.

Specifically, the feature extraction module 120 includes a feature extraction unit and a feature concatenation unit. In the process of feature extraction, the column data is divided into a column head part and a data content part, so the feature extraction is also divided into a column head feature extraction part and a data content feature extraction part. The column header is generally a feature description of the column data, so here the feature extraction unit converts the column header into a feature vector of a specified dimension through a Word embedding Model (such as Word2Vec, CBoW, skip-Gram Model, etc.). For a column data content part, firstly, acquiring a data sample by a sampling technology, and then extracting characteristics such as character string length, format, constituent elements and the like from single data (one row) in the sample by a characteristic extraction unit to obtain a first preset characteristic; extracting features such as dispersion, continuity, variance and the like from all samples to obtain second preset features; finally, the feature splicing unit splices the obtained column header feature, the first preset feature and the second preset feature to obtain a feature vector of the column data, and the feature vector is used as feature description of the current analysis object (column data).

The training module comprises: the system comprises a corpus preprocessing unit, a sample extracting unit, a labeling unit and a training unit, wherein the sample extracting unit is connected with the corpus preprocessing unit, the labeling unit is connected with the sample extracting unit, and the training unit is connected with the labeling unit. In the process of training the classification model, the corpus preprocessing unit selects training corpuses and preprocesses the training corpuses; the sample extraction unit extracts training samples from the training corpus after the preprocessing operation; then, the labeling unit labels the classification category of the extracted training sample; and finally, inputting the selected classification model into the training sample labeled with the classification category by a training unit for training.

Specifically, the word embedding model has a high requirement on the corpus, and when the corpus is selected, articles capable of covering the field related to the data to be analyzed (data in Excel and CSV files) should be selected as much as possible. Then, the corpus preprocessing unit preprocesses the corpus according to a specific use scenario, such as: deleting English, deleting special symbols, converting simplified and traditional characters and the like, and then selecting word segmentation algorithms such as jieba, hanLP and the like to perform word segmentation processing on the speech.

Before training, the required classification classes are determined according to the service scene, and the corresponding number of classification results are mapped according to the number of the classification classes, for example, if n classes are included, each class is mapped to be 0,1, \ 8230; (n-1). Then, the sample extracting unit extracts training samples (specifically, feature vectors after feature extraction of the selected training corpus (column data)) from the training corpus, and labels classification categories for the training samples through the labeling unit, where the labeled content is specifically a classification result mapped according to the classification type, and if the mapping relationship is a number, the labeled information is a corresponding number. The selected training samples should cover all classification categories, and the number of training samples corresponding to each category should not be too different, and should be divided equally as much as possible.

For the classification model, the Word2vec model, as a popular Word embedding model, has been integrated by various open source frameworks. The invention trains the preprocessed corpus by using a word2vec model by means of a genesis open source framework. The classification algorithm selected may be svm, decision trees, random forests, neural networks, and the like. The training unit can be used for identifying column data types after supervised training is carried out on the classification model based on the selected training samples.

Based on this, the present invention further provides a data warehousing device, which comprises the data type identification device, and further comprises: and the matching module is used for obtaining the semantic attributes of the classification categories output by the classification model according to the classification categories, matching the identification result of the data type identification device with the semantic attributes of the database fields and finishing the warehousing operation of the column data, wherein a mapping relation is pre-stored between the classification categories output by the classification model and the semantic attributes of the column data, and the mapping relation is pre-stored between the semantic attributes of the column data and the semantic attributes of the database fields and is stored in the storage module.

In the data storage device, after a classification model outputs a classification result (corresponding to a certain classification type), the semantic attribute of the classification result is obtained by searching the mapping relation between the stored classification category and the semantic attribute to which the classification category belongs; and then, further searching the mapping relation between the semantic attribute of the column data and the semantic attribute of the database field, namely matching the semantic attribute with the database field in the database, and storing the column data into the corresponding position in the database.

It should be noted that the above embodiments can be freely combined as necessary. The above is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention.

Claims

1. A data warehousing method is characterized by comprising the following steps:

s1, acquiring column data to be identified, wherein the column data comprises a column header and data contents;

s3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data;

2. The data warehousing method of claim 1, characterized in that in step S2, it comprises:

s22, extracting a first preset feature of single data in the data content;

s23, extracting second preset features aiming at all data contents;

3. The data warehousing method of claim 2,

in step S21, a word embedding model is used to convert the column headers into feature vectors of preset dimensions;

4. A method as claimed in any one of claims 1 to 3, further comprising, before step S1, the step of training a classification model, including:

s02, selecting a classification model;

s04, labeling classification categories of the extracted training samples;

5. A data warehousing apparatus, characterized in that the data warehousing apparatus comprises:

the characteristic extraction module is used for extracting the characteristics of the line data acquired by the data acquisition module to obtain a characteristic vector, and the characteristic vector comprises line head characteristics and data content characteristics;

the data classification module is used for inputting the feature vectors extracted by the feature extraction module into a pre-trained classification model to classify the feature vectors so as to finish the identification of the column data;

and the matching module is used for obtaining the semantic attributes of the classification categories output by the classification model according to the classification categories, matching the identification result of the data type identification device with the semantic attributes of the database fields and finishing the warehousing operation of the column data, wherein a mapping relation is prestored between the classification categories output by the classification model and the semantic attributes of the column data, and the mapping relation is prestored between the semantic attributes of the column data and the semantic attributes of the database fields and is stored in the storage module.

6. The data warehousing device of claim 5, characterized in that the feature extraction module comprises:

the characteristic extraction unit is used for extracting a column head in the column data to obtain column head characteristics; extracting a first preset characteristic of single data in the data content; extracting a second preset characteristic aiming at all data contents;

7. The data warehousing device of claim 6,

in a feature extraction unit, converting the column headers into feature vectors with preset dimensions by using a word embedding model; extracting character string length, format and constituent element characteristics of single data in data content; and extracting the features of dispersion, continuity and variance for all data contents.

8. The data warehousing device of any of claims 5-7, wherein the recognition device further comprises a training module for training a classification model; the training module comprises: