CN109408555A

CN109408555A - Data type recognition methods and device, data storage method and device

Info

Publication number: CN109408555A
Application number: CN201811096054.7A
Authority: CN
Inventors: 王海波; 李晓宇
Original assignee: Yunnan Smartq Beijing Mdt Infotech Ltd
Current assignee: Yunnan Smartq Beijing Mdt Infotech Ltd
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2019-03-01
Anticipated expiration: 2038-09-19
Also published as: CN109408555B

Abstract

The invention discloses a kind of data type recognition methods and devices, data storage method and device, wherein in the data type recognition methods, comprising: S1 obtains column data to be identified, includes column head and data content in column data；The feature that S2 extracts column data obtains feature vector, includes column head feature and data content feature in feature vector；S3 will classify to it in the disaggregated model of feature vector input pre-training, complete the identification to column data.It obtains feature vector according to the column head and data content of column data, inputs in the disaggregated model of pre-training and classifies to it, obtains its affiliated semantic attribute, the identification to structured data type is completed, it is simple and convenient and high-efficient, without artificial intervention, manpower and material resources are greatly saved.

Description

Data type recognition methods and device, data storage method and device

Technical field

The present invention relates to technical field of data processing, in particular to a kind of data type recognition methods and device, data enter Library method and device.

Background technique

Structured data analysis is one of important link in data mining.For the storage of the formatted files such as Excel, csv Structural data is difficult directly to be analyzed.Analyst would generally complete complexity by relevant database or chart database Analysis operation, that is, need by the data deposit relevant database or chart database in file, then by other analysis blocks Frame completes analysis work.During storage, analyst needs the column data and data in the file of the formats such as Excel, csv A field in library is mapped.

Currently, usually there are two types of modes for the fields match problem during data loading: one is analysts to manually complete Mapping, needs a large amount of manpower intervention, time-consuming and laborious, inefficiency.Another kind is the effect for reaching automatic mapping by strategy Fruit can be realized by following two mode: manually mapping before 1. records as a result, if current file arranges (usually with column head As standard) it is processed before, then Rapid matching maps；2. being matched by the hardness of column head and Database field or canonical It completes to map with equal strategies.Both modes all there is a problem of it is inflexible, when appearance one arrange it is no processed similar Data, it is still desirable to manpower intervention.

Summary of the invention

The object of the present invention is to provide a kind of data type recognition methods and devices, data storage method and device, effectively The technical issues of structured data type identification is inflexible in the prior art for solution, inefficiency.

Technical solution provided by the invention is as follows:

A kind of data type recognition methods, comprising:

S1 obtains column data to be identified, includes column head and data content in the column data；

The feature that S2 extracts the column data obtains feature vector, includes in column head feature and data in described eigenvector Hold feature；

S3 will classify to it in the disaggregated model of feature vector input pre-training, complete the identification to column data.

It is further preferred that in step s 2, comprising:

S21 extracts the column head in column data, obtains column head feature；

S22 extracts the first default feature of single data in data content；

S23 extracts the second default feature for all data contents；

S24 splicing column head feature, the first default feature and the second default feature obtain the feature vector of the column data.

It is further preferred that in the step s 21, using word incorporation model by column head be converted to the feature of default dimension to Amount；

And/or in step S22, it is special to extract the string length of single data, format and constitution element in data content Sign；

And/or in step S23, dispersion, continuity and Variance feature are extracted for all data contents.

It is further preferred that further including the steps that being trained disaggregated model before step S1, comprising:

S01 selectes training corpus, and carries out pretreatment operation to it；

S02 selectes disaggregated model；

S03 extracts training sample from the training corpus after pretreatment operation；

S04 marks class categories to the training sample of extraction；

S05 will be labelled in the training sample of class categories and input disaggregated model, be trained to it.

The present invention also provides a kind of data storage methods, including above-mentioned data type recognition methods, further includes:

S4 obtains its affiliated semantic attribute, the classification of the disaggregated model output according to the class categories that disaggregated model exports Mapping relations are prestored between classification and semantic attribute belonging to it；

S5 matches semantic attribute belonging to obtained column data with the semantic attribute of Database field, completes to column The in-stockroom operation of data, the semantic attribute of semantic attribute and Database field belonging to the column data of disaggregated model output it Between prestore mapping relations.

The present invention also provides a kind of data type identification devices, comprising:

Data acquisition module includes column head and data content in the column data for obtaining column data to be identified；

Characteristic extracting module obtains feature vector for extracting the feature of column data of data acquisition module acquisition, described It include column head feature and data content feature in feature vector；

Data categorization module, it is right in the disaggregated model of pre-training that the feature vector for extracting characteristic extracting module inputs It is classified, and the identification to column data is completed.

It is further preferred that including: in characteristic extracting module

Feature extraction unit obtains column head feature for extracting the column head in column data；Extract single number in data content According to the first default feature；And the second default feature is extracted for all data contents；

Merging features unit obtains the columns for splicing column head feature, the first default feature and the second default feature According to feature vector.

It is further preferred that column head to be converted to the spy of default dimension using word incorporation model in feature extraction unit Levy vector；Extract string length, format and the constitution element feature of single data in data content；And in all data Hold and extracts dispersion, continuity and Variance feature.

It is further preferred that the identification device further includes training module, for being trained to disaggregated model；The instruction Practice in module and includes:

Corpus pretreatment unit carries out pretreatment operation for selecting training corpus, and to it；

Sample extraction unit, for extracting training sample from the training corpus after pretreatment operation；

Unit is marked, for marking class categories to the training sample of extraction；

Training unit inputs selected disaggregated model and instructs to it for will be labelled in the training sample of class categories Practice.

The present invention also provides a kind of data loading devices, including above-mentioned data type identification device, further includes:

Matching module, the class categories for being exported according to disaggregated model obtain its affiliated semantic attribute, and for that will count It is matched according to the recognition result of type identification device with the semantic attribute of Database field, completes to grasp the storage of column data Make, wherein prestore mapping relations, column data between the class categories of the disaggregated model output and the semantic attribute belonging to it Mapping relations are prestored between the semantic attribute of affiliated semantic attribute and Database field, are stored in memory module.

In data type recognition methods provided by the invention and device, obtained according to the column head and data content of column data Feature vector inputs in the disaggregated model of pre-training and classifies to it, obtains its affiliated semantic attribute, completes to structuring number It is simple and convenient according to the identification of type and high-efficient, without artificial intervention, manpower and material resources are greatly saved；In addition, can be directed to The different corresponding disaggregated models of application scenarios training, is widely used.It, only need to be by columns during structural data storage According to semantic attribute and Database field semantic attribute establish map, can be realized it is quick, flexible, accurately mapping recommend.

Detailed description of the invention

Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, to a kind of log processing side Above-mentioned characteristic, technical characteristic, advantage and its implementation of method and system are further described.

Fig. 1 is data type recognition methods flow diagram in the present invention；

Fig. 2 is disaggregated model training flow diagram in the present invention；

Fig. 3 is data type identification device schematic diagram in the present invention.

Description of symbols:

100- graph data structure converter, 110- entity split module, 120- entity merging module, and 130- link splits mould Block.

Specific embodiment

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing, and obtain other embodiments.

To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated " only this ", can also indicate the situation of " more than one ".

It is as shown in Figure 1 data type recognition methods flow diagram provided by the invention, it can be seen from the figure that at this Include: that S1 obtains column data to be identified in recognition methods, includes column head and data content in column data；S2 extracts column data Feature obtains feature vector, includes column head feature and data content feature in feature vector；Feature vector is inputted pre-training by S3 Disaggregated model in classify to it, complete identification to column data.

In the method, column data refers to the column data in the file of the formats such as Excel, csv, under normal conditions its data Format are as follows: column head+data content, wherein column head is the first row data in file, for describing the content when forefront.Word is embedding Entering the type that (word embedding) is a kind of word indicates that it is by vocabulary that the word with similar import, which has similar expression, It is mapped to the method general name of real vector.In data type identification process, the column data in file is considered as an analysis Object, and be divided into three phases and realize purpose: feature extraction, disaggregated model training and data classification.

During feature extraction, since column data is divided into column head and data content two parts, therefore feature extraction also divides For the feature extraction of column head and data Content Feature Extraction two parts.It is generally the feature description of column data by column head, therefore leads to here Cross word incorporation model (such as Word2Vec, CBoW, Skip-Gram Model) by column head be converted into the feature of specified dimension to Amount.For column data content part, data sample is obtained by sampling technique first, later to the single data (one in sample Row) it extracts string length, format, the features such as constitution element and obtains the first default feature；All samples of sampling are mentioned The features such as dispersion, continuity, variance are taken to obtain the second default feature；Finally, by obtained column head feature, the first default feature And second default feature spliced the feature vector for obtaining column data, retouched as the feature to present analysis object (column data) It states.

After feature extraction is completed, it will classify in the disaggregated model after feature vector feeding training, obtain belonging to it Classification results, data type automatic identification is achieved the purpose that with this.For disaggregated model concrete form here without limitation, As long as it is able to achieve goal of the invention, be included in the contents of the present invention, such as can be used svm (support vector machines), decision tree, The disaggregated models such as random forest, neural network (deep learning).

As shown in Fig. 2, during being trained to disaggregated model, comprising: S01 selectes training corpus, and to its into Row pretreatment operation；S02 selectes disaggregated model；S03 extracts training sample from the training corpus after pretreatment operation；S04 Class categories are marked to the training sample of extraction；S05 will be labelled in the training sample of class categories and input disaggregated model, to it It is trained.

Specifically, more demanding to training corpus by word incorporation model, should select as far as possible when selecting training corpus can cover Cover the article in the field that data to be analyzed (data in Excel, csv file) are related to.Later, according to specific usage scenario pair It is pre-processed, such as: it deletes English, delete additional character, simple complex form of Chinese characters conversion, then select such as jieba, HanLP Segmentation methods carry out word segmentation processing to corpus.

Before training, the class categories needed are determined according to business scenario, and according to the scalar mapping phase of class categories The classification results of quantity are answered, such as, it is assumed that classify including n, is then 0,1 by each classification map ..., (n-1).Later, from training Training sample (feature vector after training corpus (column data) feature extraction specially chosen) is extracted in corpus, and to each instruction Practice sample and mark class categories, the content marked here is specially the classification results mapped according to classification type, if mapping relations For number, then markup information is corresponding number.Selected training sample should cover all class categories, and of all categories Corresponding training samples number should not have big difference, and should divide equally as far as possible.

For disaggregated model, word incorporation model of the Word2vec model as a kind of prevalence, by a variety of open sources Frame is integrated.The present invention is trained pretreated corpus by gensim Open Framework, using word2vec model.Choosing Sorting algorithm can be svm, decision tree, random forest, neural network etc..Based on selected training sample to disaggregated model Have supervision after training, can be used to the identification to column data type.

Based on above-mentioned data type recognition methods, the present invention also provides a kind of data storage methods, in the method, remove It include except above-mentioned data type recognition methods, further includes: belonging to S4 obtains it according to the class categories that disaggregated model exports Semantic attribute prestores mapping relations between the class categories of disaggregated model output and the semantic attribute belonging to it；S5 will be obtained Column data belonging to semantic attribute matched with the semantic attribute of Database field, complete to the in-stockroom operation of column data, Mapping relations are prestored between semantic attribute and the semantic attribute of Database field belonging to the column data of disaggregated model output.

In the method, column head is the first row data in file, and the content for describing to work as forefront, is semantic attribute Different expression ways；Semantic attribute is used to describe the feature of a column data, is built upon a kind of high level on low-level image feature and retouches It states, such as identification card number, phone number etc..In general, all there is corresponding semantic attribute in structural data (including column data), Database field in database equally exists its corresponding semantic attribute.Since the column head and Database field of column data are all A kind of statement of semantic attribute, and same semantic attribute can be difficult directly to pass through column head and data by a variety of form of presentation Mapping is completed in the matching of library field, such as: Database field phone_num, column head are phone number, cell-phone number, caller number Code etc., therefore the mapping of column data to Database field is completed in the matching in this method by semantic attribute.

After disaggregated model output category result (corresponding a certain classification type), divide class categories by searching for storage With its belonging to semantic attribute between mapping relations, obtain semantic attribute belonging to classification results；Later, it further searches for arranging Mapping relations between semantic attribute belonging to data and the semantic attribute of Database field, i.e., with the database in database Field is matched, and column data is stored in corresponding position in database.In other embodiments, it is instructed to disaggregated model In experienced process, the classification knot of semantic attribute needed for being determined according to business scenario (covering Database field) mapping respective numbers Fruit, it is similar, it is assumed that including n semantic attribute, classification map 0,1 ..., (n-1).It is inputted by the feature vector of column data After disaggregated model, the semantic attribute of the column vector directly is obtained according to the mapping relations of classification results and semantic attribute, later It is matched with the semantic attribute of Database field again.

It is illustrated in figure 3 100 schematic diagram of data type identification device provided by the invention, it can be seen from the figure that at this It include: data acquisition module 110, characteristic extracting module 120 and data categorization module 130 in data type identification device 100, In, characteristic extracting module 120 is connect with data acquisition module 110 and data categorization module 130 respectively.During the work time, first First, data acquisition module 110 obtains column data to be identified, includes column head and data content in column data；Later, feature extraction The feature that module 120 extracts the column data that data acquisition module 110 obtains obtains feature vector, includes that column head is special in feature vector Sign and data content feature；Finally, the pre- instruction of feature vector input that data categorization module 130 extracts characteristic extracting module 120 Classify in experienced disaggregated model to it, completes the identification to column data.

Specifically, column data refers to the column data in the file of the formats such as Excel, csv, under normal conditions its data format Are as follows: column head+data content, wherein column head is the first row data in file, for describing the content when forefront.Word insertion (word embedding) is that a kind of type of word indicates that it is to reflect vocabulary that the word with similar import, which has similar expression, It is mapped to the method general name of real vector.In data type identification process, the column data in file is considered as an analysis pair As, and be divided into three phases and realize purpose: feature extraction, disaggregated model training and data classification.

It specifically, include feature extraction unit and merging features unit in characteristic extracting module 120.In the process of feature extraction In, since column data is divided into column head and data content two parts, therefore feature extraction is also classified into the feature extraction of column head and data content Feature extraction two parts.The feature description of column data is generally by column head, therefore feature extraction unit passes through word incorporation model here Column head is converted into the feature vector of specified dimension by (such as Word2Vec, CBoW, Skip-Gram Model).For column data Content part obtains data sample by sampling technique first, and feature extraction unit is to the single data (one in sample later Row) it extracts string length, format, the features such as constitution element and obtains the first default feature；All samples of sampling are mentioned The features such as dispersion, continuity, variance are taken to obtain the second default feature；Finally, merging features unit by obtained column head feature, First default feature and the second default feature are spliced the feature vector for obtaining column data, as to present analysis object (column Data) feature description.

It include: corpus pretreatment unit, sample extraction unit, mark unit and training unit in training module, wherein Sample extraction unit is connect with corpus pretreatment unit, and mark unit is connect with sample extraction unit, and training unit and mark are single Member connection.During being trained to disaggregated model, corpus pretreatment unit selectes training corpus, and is located in advance to it After reason operation；Sample extraction unit extracts training sample from the training corpus after pretreatment operation；Then, unit is marked Class categories are marked to the training sample of extraction；Finally, training unit, which will be labelled in the training sample of class categories, inputs choosing Fixed disaggregated model is trained it.

Specifically, more demanding to training corpus by word incorporation model, should select as far as possible when selecting training corpus can cover Cover the article in the field that data to be analyzed (data in Excel, csv file) are related to.Later, corpus pretreatment unit is according to spy Fixed usage scenario pre-processes it, such as: it deletes English, delete additional character, simple complex form of Chinese characters conversion, then select such as The segmentation methods such as jieba, HanLP carry out word segmentation processing to corpus.

Before training, the class categories needed are determined according to business scenario, and according to the scalar mapping phase of class categories The classification results of quantity are answered, such as, it is assumed that classify including n, is then 0,1 by each classification map ..., (n-1).Later, sample mentions Take unit from extracted in training corpus training sample (feature after training corpus (column data) feature extraction specially chosen to Amount), and class categories are marked to each training sample by mark unit, the content marked here is specially to be reflected according to classification type The classification results penetrated, if mapping relations are number, markup information is corresponding number.Selected training sample should cover All class categories, and corresponding training samples number of all categories should not have big difference, and should divide equally as far as possible.

For disaggregated model, word incorporation model of the Word2vec model as a kind of prevalence, by a variety of open sources Frame is integrated.The present invention is trained pretreated corpus by gensim Open Framework, using word2vec model.Choosing Sorting algorithm can be svm, decision tree, random forest, neural network etc..Training unit is based on selected training sample pair Disaggregated model have supervision after training, can be used to the identification to column data type.

Based on this, the present invention also provides a kind of data loading devices, in addition to including above-mentioned data type identification device, also It include: matching module, the class categories for being exported according to disaggregated model obtain its affiliated semantic attribute, and are used for data class The recognition result of type identification device is matched with the semantic attribute of Database field, completes the in-stockroom operation to column data, In, mapping relations, language belonging to column data are prestored between the class categories of disaggregated model output and the semantic attribute belonging to it Mapping relations are prestored between adopted attribute and the semantic attribute of Database field, are stored in memory module.

In the data loading device, after disaggregated model output category result (corresponding a certain classification type), by looking into The mapping relations between point class categories of storage and the semantic attribute belonging to it are looked for, the category of semanteme belonging to classification results is obtained Property；Later, the mapping relations between semantic attribute belonging to column data and the semantic attribute of Database field are further searched for, It is matched with the Database field in database, column data is stored in corresponding position in database.

It should be noted that above-described embodiment can be freely combined as needed.The above is only preferred implementations of the invention Mode, it is noted that for those skilled in the art, without departing from the principle of the present invention, also Several improvements and modifications can be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.

Claims

1. a kind of data type recognition methods, which is characterized in that the recognition methods includes:

The feature that S2 extracts the column data obtains feature vector, includes that column head feature and data content are special in described eigenvector Sign；

2. recognition methods as described in claim 1, which is characterized in that in step s 2, comprising:

S21 extracts the column head in column data, obtains column head feature；

S22 extracts the first default feature of single data in data content；

S23 extracts the second default feature for all data contents；

3. recognition methods as claimed in claim 2, which is characterized in that

In the step s 21, column head is converted to the feature vector of default dimension using word incorporation model；

And/or in step S22, string length, format and the constitution element feature of single data in data content are extracted；

4. recognition methods as claimed in any one of claims 1-3, which is characterized in that before step S1, further include to point The step of class model is trained, comprising:

S01 selectes training corpus, and carries out pretreatment operation to it；

S02 selectes disaggregated model；

S04 marks class categories to the training sample of extraction；

5. a kind of data storage method, which is characterized in that include such as claim 1-4 any one in the data storage method The data type recognition methods, further includes:

S4 obtains its affiliated semantic attribute, the class categories of the disaggregated model output according to the class categories that disaggregated model exports With its belonging to semantic attribute between prestore mapping relations；

S5 matches semantic attribute belonging to obtained column data with the semantic attribute of Database field, completes to column data In-stockroom operation, it is pre- between semantic attribute and the semantic attribute of Database field belonging to the column data of disaggregated model output There are mapping relations.

6. a kind of data type identification device, which is characterized in that the identification device includes:

Characteristic extracting module obtains feature vector, the feature for extracting the feature of column data of data acquisition module acquisition It include column head feature and data content feature in vector；

Data categorization module, for characteristic extracting module to be extracted feature vector input pre-training disaggregated model in its into Row classification, completes the identification to column data.

7. identification device as claimed in claim 6, which is characterized in that include: in characteristic extracting module

Feature extraction unit obtains column head feature for extracting the column head in column data；Extract single data in data content First default feature；And the second default feature is extracted for all data contents；

Merging features unit obtains the column data for splicing column head feature, the first default feature and the second default feature Feature vector.

8. identification device as claimed in claim 7, which is characterized in that

In feature extraction unit, column head is converted to the feature vector of default dimension using word incorporation model；It extracts in data The string length of single data, format and constitution element feature in appearance；And dispersion, continuous is extracted for all data contents Property and Variance feature.

9. the identification device as described in claim 6-8 any one, which is characterized in that the identification device further includes trained mould Block, for being trained to disaggregated model；Include: in the training module

Training unit inputs selected disaggregated model and is trained to it for will be labelled in the training sample of class categories.

10. a kind of data loading device, which is characterized in that include as claim 6-8 is any one in the data loading device Data type identification device described in, further includes:

Matching module, the class categories for being exported according to disaggregated model obtain its affiliated semantic attribute, and are used for data class The recognition result of type identification device is matched with the semantic attribute of Database field, completes the in-stockroom operation to column data, In, mapping relations are prestored between the class categories of disaggregated model output and the semantic attribute belonging to it, belonging to column data Semantic attribute and Database field semantic attribute between prestore mapping relations, be stored in memory module.