CN109408555A - Data type recognition methods and device, data storage method and device - Google Patents
Data type recognition methods and device, data storage method and device Download PDFInfo
- Publication number
- CN109408555A CN109408555A CN201811096054.7A CN201811096054A CN109408555A CN 109408555 A CN109408555 A CN 109408555A CN 201811096054 A CN201811096054 A CN 201811096054A CN 109408555 A CN109408555 A CN 109408555A
- Authority
- CN
- China
- Prior art keywords
- data
- feature
- column
- training
- disaggregated model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000013500 data storage Methods 0.000 title claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 74
- 239000000284 extract Substances 0.000 claims abstract description 23
- 238000000605 extraction Methods 0.000 claims description 34
- 238000013507 mapping Methods 0.000 claims description 28
- 238000010348 incorporation Methods 0.000 claims description 10
- 239000006185 dispersion Substances 0.000 claims description 6
- 239000000463 material Substances 0.000 abstract description 2
- 238000004458 analytical method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000003066 decision tree Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Abstract
The invention discloses a kind of data type recognition methods and devices, data storage method and device, wherein in the data type recognition methods, comprising: S1 obtains column data to be identified, includes column head and data content in column data;The feature that S2 extracts column data obtains feature vector, includes column head feature and data content feature in feature vector;S3 will classify to it in the disaggregated model of feature vector input pre-training, complete the identification to column data.It obtains feature vector according to the column head and data content of column data, inputs in the disaggregated model of pre-training and classifies to it, obtains its affiliated semantic attribute, the identification to structured data type is completed, it is simple and convenient and high-efficient, without artificial intervention, manpower and material resources are greatly saved.
Description
Technical field
The present invention relates to technical field of data processing, in particular to a kind of data type recognition methods and device, data enter
Library method and device.
Background technique
Structured data analysis is one of important link in data mining.For the storage of the formatted files such as Excel, csv
Structural data is difficult directly to be analyzed.Analyst would generally complete complexity by relevant database or chart database
Analysis operation, that is, need by the data deposit relevant database or chart database in file, then by other analysis blocks
Frame completes analysis work.During storage, analyst needs the column data and data in the file of the formats such as Excel, csv
A field in library is mapped.
Currently, usually there are two types of modes for the fields match problem during data loading: one is analysts to manually complete
Mapping, needs a large amount of manpower intervention, time-consuming and laborious, inefficiency.Another kind is the effect for reaching automatic mapping by strategy
Fruit can be realized by following two mode: manually mapping before 1. records as a result, if current file arranges (usually with column head
As standard) it is processed before, then Rapid matching maps;2. being matched by the hardness of column head and Database field or canonical
It completes to map with equal strategies.Both modes all there is a problem of it is inflexible, when appearance one arrange it is no processed similar
Data, it is still desirable to manpower intervention.
Summary of the invention
The object of the present invention is to provide a kind of data type recognition methods and devices, data storage method and device, effectively
The technical issues of structured data type identification is inflexible in the prior art for solution, inefficiency.
Technical solution provided by the invention is as follows:
A kind of data type recognition methods, comprising:
S1 obtains column data to be identified, includes column head and data content in the column data;
The feature that S2 extracts the column data obtains feature vector, includes in column head feature and data in described eigenvector
Hold feature;
S3 will classify to it in the disaggregated model of feature vector input pre-training, complete the identification to column data.
It is further preferred that in step s 2, comprising:
S21 extracts the column head in column data, obtains column head feature;
S22 extracts the first default feature of single data in data content;
S23 extracts the second default feature for all data contents;
S24 splicing column head feature, the first default feature and the second default feature obtain the feature vector of the column data.
It is further preferred that in the step s 21, using word incorporation model by column head be converted to the feature of default dimension to
Amount;
And/or in step S22, it is special to extract the string length of single data, format and constitution element in data content
Sign;
And/or in step S23, dispersion, continuity and Variance feature are extracted for all data contents.
It is further preferred that further including the steps that being trained disaggregated model before step S1, comprising:
S01 selectes training corpus, and carries out pretreatment operation to it;
S02 selectes disaggregated model;
S03 extracts training sample from the training corpus after pretreatment operation;
S04 marks class categories to the training sample of extraction;
S05 will be labelled in the training sample of class categories and input disaggregated model, be trained to it.
The present invention also provides a kind of data storage methods, including above-mentioned data type recognition methods, further includes:
S4 obtains its affiliated semantic attribute, the classification of the disaggregated model output according to the class categories that disaggregated model exports
Mapping relations are prestored between classification and semantic attribute belonging to it;
S5 matches semantic attribute belonging to obtained column data with the semantic attribute of Database field, completes to column
The in-stockroom operation of data, the semantic attribute of semantic attribute and Database field belonging to the column data of disaggregated model output it
Between prestore mapping relations.
The present invention also provides a kind of data type identification devices, comprising:
Data acquisition module includes column head and data content in the column data for obtaining column data to be identified;
Characteristic extracting module obtains feature vector for extracting the feature of column data of data acquisition module acquisition, described
It include column head feature and data content feature in feature vector;
Data categorization module, it is right in the disaggregated model of pre-training that the feature vector for extracting characteristic extracting module inputs
It is classified, and the identification to column data is completed.
It is further preferred that including: in characteristic extracting module
Feature extraction unit obtains column head feature for extracting the column head in column data;Extract single number in data content
According to the first default feature;And the second default feature is extracted for all data contents;
Merging features unit obtains the columns for splicing column head feature, the first default feature and the second default feature
According to feature vector.
It is further preferred that column head to be converted to the spy of default dimension using word incorporation model in feature extraction unit
Levy vector;Extract string length, format and the constitution element feature of single data in data content;And in all data
Hold and extracts dispersion, continuity and Variance feature.
It is further preferred that the identification device further includes training module, for being trained to disaggregated model;The instruction
Practice in module and includes:
Corpus pretreatment unit carries out pretreatment operation for selecting training corpus, and to it;
Sample extraction unit, for extracting training sample from the training corpus after pretreatment operation;
Unit is marked, for marking class categories to the training sample of extraction;
Training unit inputs selected disaggregated model and instructs to it for will be labelled in the training sample of class categories
Practice.
The present invention also provides a kind of data loading devices, including above-mentioned data type identification device, further includes:
Matching module, the class categories for being exported according to disaggregated model obtain its affiliated semantic attribute, and for that will count
It is matched according to the recognition result of type identification device with the semantic attribute of Database field, completes to grasp the storage of column data
Make, wherein prestore mapping relations, column data between the class categories of the disaggregated model output and the semantic attribute belonging to it
Mapping relations are prestored between the semantic attribute of affiliated semantic attribute and Database field, are stored in memory module.
In data type recognition methods provided by the invention and device, obtained according to the column head and data content of column data
Feature vector inputs in the disaggregated model of pre-training and classifies to it, obtains its affiliated semantic attribute, completes to structuring number
It is simple and convenient according to the identification of type and high-efficient, without artificial intervention, manpower and material resources are greatly saved;In addition, can be directed to
The different corresponding disaggregated models of application scenarios training, is widely used.It, only need to be by columns during structural data storage
According to semantic attribute and Database field semantic attribute establish map, can be realized it is quick, flexible, accurately mapping recommend.
Detailed description of the invention
Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, to a kind of log processing side
Above-mentioned characteristic, technical characteristic, advantage and its implementation of method and system are further described.
Fig. 1 is data type recognition methods flow diagram in the present invention;
Fig. 2 is disaggregated model training flow diagram in the present invention;
Fig. 3 is data type identification device schematic diagram in the present invention.
Description of symbols:
100- graph data structure converter, 110- entity split module, 120- entity merging module, and 130- link splits mould
Block.
Specific embodiment
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below
A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing, and obtain other embodiments.
To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented
Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand
Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated
" only this ", can also indicate the situation of " more than one ".
It is as shown in Figure 1 data type recognition methods flow diagram provided by the invention, it can be seen from the figure that at this
Include: that S1 obtains column data to be identified in recognition methods, includes column head and data content in column data;S2 extracts column data
Feature obtains feature vector, includes column head feature and data content feature in feature vector;Feature vector is inputted pre-training by S3
Disaggregated model in classify to it, complete identification to column data.
In the method, column data refers to the column data in the file of the formats such as Excel, csv, under normal conditions its data
Format are as follows: column head+data content, wherein column head is the first row data in file, for describing the content when forefront.Word is embedding
Entering the type that (word embedding) is a kind of word indicates that it is by vocabulary that the word with similar import, which has similar expression,
It is mapped to the method general name of real vector.In data type identification process, the column data in file is considered as an analysis
Object, and be divided into three phases and realize purpose: feature extraction, disaggregated model training and data classification.
During feature extraction, since column data is divided into column head and data content two parts, therefore feature extraction also divides
For the feature extraction of column head and data Content Feature Extraction two parts.It is generally the feature description of column data by column head, therefore leads to here
Cross word incorporation model (such as Word2Vec, CBoW, Skip-Gram Model) by column head be converted into the feature of specified dimension to
Amount.For column data content part, data sample is obtained by sampling technique first, later to the single data (one in sample
Row) it extracts string length, format, the features such as constitution element and obtains the first default feature;All samples of sampling are mentioned
The features such as dispersion, continuity, variance are taken to obtain the second default feature;Finally, by obtained column head feature, the first default feature
And second default feature spliced the feature vector for obtaining column data, retouched as the feature to present analysis object (column data)
It states.
After feature extraction is completed, it will classify in the disaggregated model after feature vector feeding training, obtain belonging to it
Classification results, data type automatic identification is achieved the purpose that with this.For disaggregated model concrete form here without limitation,
As long as it is able to achieve goal of the invention, be included in the contents of the present invention, such as can be used svm (support vector machines), decision tree,
The disaggregated models such as random forest, neural network (deep learning).
As shown in Fig. 2, during being trained to disaggregated model, comprising: S01 selectes training corpus, and to its into
Row pretreatment operation;S02 selectes disaggregated model;S03 extracts training sample from the training corpus after pretreatment operation;S04
Class categories are marked to the training sample of extraction;S05 will be labelled in the training sample of class categories and input disaggregated model, to it
It is trained.
Specifically, more demanding to training corpus by word incorporation model, should select as far as possible when selecting training corpus can cover
Cover the article in the field that data to be analyzed (data in Excel, csv file) are related to.Later, according to specific usage scenario pair
It is pre-processed, such as: it deletes English, delete additional character, simple complex form of Chinese characters conversion, then select such as jieba, HanLP
Segmentation methods carry out word segmentation processing to corpus.
Before training, the class categories needed are determined according to business scenario, and according to the scalar mapping phase of class categories
The classification results of quantity are answered, such as, it is assumed that classify including n, is then 0,1 by each classification map ..., (n-1).Later, from training
Training sample (feature vector after training corpus (column data) feature extraction specially chosen) is extracted in corpus, and to each instruction
Practice sample and mark class categories, the content marked here is specially the classification results mapped according to classification type, if mapping relations
For number, then markup information is corresponding number.Selected training sample should cover all class categories, and of all categories
Corresponding training samples number should not have big difference, and should divide equally as far as possible.
For disaggregated model, word incorporation model of the Word2vec model as a kind of prevalence, by a variety of open sources
Frame is integrated.The present invention is trained pretreated corpus by gensim Open Framework, using word2vec model.Choosing
Sorting algorithm can be svm, decision tree, random forest, neural network etc..Based on selected training sample to disaggregated model
Have supervision after training, can be used to the identification to column data type.
Based on above-mentioned data type recognition methods, the present invention also provides a kind of data storage methods, in the method, remove
It include except above-mentioned data type recognition methods, further includes: belonging to S4 obtains it according to the class categories that disaggregated model exports
Semantic attribute prestores mapping relations between the class categories of disaggregated model output and the semantic attribute belonging to it;S5 will be obtained
Column data belonging to semantic attribute matched with the semantic attribute of Database field, complete to the in-stockroom operation of column data,
Mapping relations are prestored between semantic attribute and the semantic attribute of Database field belonging to the column data of disaggregated model output.
In the method, column head is the first row data in file, and the content for describing to work as forefront, is semantic attribute
Different expression ways;Semantic attribute is used to describe the feature of a column data, is built upon a kind of high level on low-level image feature and retouches
It states, such as identification card number, phone number etc..In general, all there is corresponding semantic attribute in structural data (including column data),
Database field in database equally exists its corresponding semantic attribute.Since the column head and Database field of column data are all
A kind of statement of semantic attribute, and same semantic attribute can be difficult directly to pass through column head and data by a variety of form of presentation
Mapping is completed in the matching of library field, such as: Database field phone_num, column head are phone number, cell-phone number, caller number
Code etc., therefore the mapping of column data to Database field is completed in the matching in this method by semantic attribute.
After disaggregated model output category result (corresponding a certain classification type), divide class categories by searching for storage
With its belonging to semantic attribute between mapping relations, obtain semantic attribute belonging to classification results;Later, it further searches for arranging
Mapping relations between semantic attribute belonging to data and the semantic attribute of Database field, i.e., with the database in database
Field is matched, and column data is stored in corresponding position in database.In other embodiments, it is instructed to disaggregated model
In experienced process, the classification knot of semantic attribute needed for being determined according to business scenario (covering Database field) mapping respective numbers
Fruit, it is similar, it is assumed that including n semantic attribute, classification map 0,1 ..., (n-1).It is inputted by the feature vector of column data
After disaggregated model, the semantic attribute of the column vector directly is obtained according to the mapping relations of classification results and semantic attribute, later
It is matched with the semantic attribute of Database field again.
It is illustrated in figure 3 100 schematic diagram of data type identification device provided by the invention, it can be seen from the figure that at this
It include: data acquisition module 110, characteristic extracting module 120 and data categorization module 130 in data type identification device 100,
In, characteristic extracting module 120 is connect with data acquisition module 110 and data categorization module 130 respectively.During the work time, first
First, data acquisition module 110 obtains column data to be identified, includes column head and data content in column data;Later, feature extraction
The feature that module 120 extracts the column data that data acquisition module 110 obtains obtains feature vector, includes that column head is special in feature vector
Sign and data content feature;Finally, the pre- instruction of feature vector input that data categorization module 130 extracts characteristic extracting module 120
Classify in experienced disaggregated model to it, completes the identification to column data.
Specifically, column data refers to the column data in the file of the formats such as Excel, csv, under normal conditions its data format
Are as follows: column head+data content, wherein column head is the first row data in file, for describing the content when forefront.Word insertion
(word embedding) is that a kind of type of word indicates that it is to reflect vocabulary that the word with similar import, which has similar expression,
It is mapped to the method general name of real vector.In data type identification process, the column data in file is considered as an analysis pair
As, and be divided into three phases and realize purpose: feature extraction, disaggregated model training and data classification.
It specifically, include feature extraction unit and merging features unit in characteristic extracting module 120.In the process of feature extraction
In, since column data is divided into column head and data content two parts, therefore feature extraction is also classified into the feature extraction of column head and data content
Feature extraction two parts.The feature description of column data is generally by column head, therefore feature extraction unit passes through word incorporation model here
Column head is converted into the feature vector of specified dimension by (such as Word2Vec, CBoW, Skip-Gram Model).For column data
Content part obtains data sample by sampling technique first, and feature extraction unit is to the single data (one in sample later
Row) it extracts string length, format, the features such as constitution element and obtains the first default feature;All samples of sampling are mentioned
The features such as dispersion, continuity, variance are taken to obtain the second default feature;Finally, merging features unit by obtained column head feature,
First default feature and the second default feature are spliced the feature vector for obtaining column data, as to present analysis object (column
Data) feature description.
After feature extraction is completed, it will classify in the disaggregated model after feature vector feeding training, obtain belonging to it
Classification results, data type automatic identification is achieved the purpose that with this.For disaggregated model concrete form here without limitation,
As long as it is able to achieve goal of the invention, be included in the contents of the present invention, such as can be used svm (support vector machines), decision tree,
The disaggregated models such as random forest, neural network (deep learning).
It include: corpus pretreatment unit, sample extraction unit, mark unit and training unit in training module, wherein
Sample extraction unit is connect with corpus pretreatment unit, and mark unit is connect with sample extraction unit, and training unit and mark are single
Member connection.During being trained to disaggregated model, corpus pretreatment unit selectes training corpus, and is located in advance to it
After reason operation;Sample extraction unit extracts training sample from the training corpus after pretreatment operation;Then, unit is marked
Class categories are marked to the training sample of extraction;Finally, training unit, which will be labelled in the training sample of class categories, inputs choosing
Fixed disaggregated model is trained it.
Specifically, more demanding to training corpus by word incorporation model, should select as far as possible when selecting training corpus can cover
Cover the article in the field that data to be analyzed (data in Excel, csv file) are related to.Later, corpus pretreatment unit is according to spy
Fixed usage scenario pre-processes it, such as: it deletes English, delete additional character, simple complex form of Chinese characters conversion, then select such as
The segmentation methods such as jieba, HanLP carry out word segmentation processing to corpus.
Before training, the class categories needed are determined according to business scenario, and according to the scalar mapping phase of class categories
The classification results of quantity are answered, such as, it is assumed that classify including n, is then 0,1 by each classification map ..., (n-1).Later, sample mentions
Take unit from extracted in training corpus training sample (feature after training corpus (column data) feature extraction specially chosen to
Amount), and class categories are marked to each training sample by mark unit, the content marked here is specially to be reflected according to classification type
The classification results penetrated, if mapping relations are number, markup information is corresponding number.Selected training sample should cover
All class categories, and corresponding training samples number of all categories should not have big difference, and should divide equally as far as possible.
For disaggregated model, word incorporation model of the Word2vec model as a kind of prevalence, by a variety of open sources
Frame is integrated.The present invention is trained pretreated corpus by gensim Open Framework, using word2vec model.Choosing
Sorting algorithm can be svm, decision tree, random forest, neural network etc..Training unit is based on selected training sample pair
Disaggregated model have supervision after training, can be used to the identification to column data type.
Based on this, the present invention also provides a kind of data loading devices, in addition to including above-mentioned data type identification device, also
It include: matching module, the class categories for being exported according to disaggregated model obtain its affiliated semantic attribute, and are used for data class
The recognition result of type identification device is matched with the semantic attribute of Database field, completes the in-stockroom operation to column data,
In, mapping relations, language belonging to column data are prestored between the class categories of disaggregated model output and the semantic attribute belonging to it
Mapping relations are prestored between adopted attribute and the semantic attribute of Database field, are stored in memory module.
In the data loading device, after disaggregated model output category result (corresponding a certain classification type), by looking into
The mapping relations between point class categories of storage and the semantic attribute belonging to it are looked for, the category of semanteme belonging to classification results is obtained
Property;Later, the mapping relations between semantic attribute belonging to column data and the semantic attribute of Database field are further searched for,
It is matched with the Database field in database, column data is stored in corresponding position in database.
It should be noted that above-described embodiment can be freely combined as needed.The above is only preferred implementations of the invention
Mode, it is noted that for those skilled in the art, without departing from the principle of the present invention, also
Several improvements and modifications can be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.
Claims (10)
1. a kind of data type recognition methods, which is characterized in that the recognition methods includes:
S1 obtains column data to be identified, includes column head and data content in the column data;
The feature that S2 extracts the column data obtains feature vector, includes that column head feature and data content are special in described eigenvector
Sign;
S3 will classify to it in the disaggregated model of feature vector input pre-training, complete the identification to column data.
2. recognition methods as described in claim 1, which is characterized in that in step s 2, comprising:
S21 extracts the column head in column data, obtains column head feature;
S22 extracts the first default feature of single data in data content;
S23 extracts the second default feature for all data contents;
S24 splicing column head feature, the first default feature and the second default feature obtain the feature vector of the column data.
3. recognition methods as claimed in claim 2, which is characterized in that
In the step s 21, column head is converted to the feature vector of default dimension using word incorporation model;
And/or in step S22, string length, format and the constitution element feature of single data in data content are extracted;
And/or in step S23, dispersion, continuity and Variance feature are extracted for all data contents.
4. recognition methods as claimed in any one of claims 1-3, which is characterized in that before step S1, further include to point
The step of class model is trained, comprising:
S01 selectes training corpus, and carries out pretreatment operation to it;
S02 selectes disaggregated model;
S03 extracts training sample from the training corpus after pretreatment operation;
S04 marks class categories to the training sample of extraction;
S05 will be labelled in the training sample of class categories and input disaggregated model, be trained to it.
5. a kind of data storage method, which is characterized in that include such as claim 1-4 any one in the data storage method
The data type recognition methods, further includes:
S4 obtains its affiliated semantic attribute, the class categories of the disaggregated model output according to the class categories that disaggregated model exports
With its belonging to semantic attribute between prestore mapping relations;
S5 matches semantic attribute belonging to obtained column data with the semantic attribute of Database field, completes to column data
In-stockroom operation, it is pre- between semantic attribute and the semantic attribute of Database field belonging to the column data of disaggregated model output
There are mapping relations.
6. a kind of data type identification device, which is characterized in that the identification device includes:
Data acquisition module includes column head and data content in the column data for obtaining column data to be identified;
Characteristic extracting module obtains feature vector, the feature for extracting the feature of column data of data acquisition module acquisition
It include column head feature and data content feature in vector;
Data categorization module, for characteristic extracting module to be extracted feature vector input pre-training disaggregated model in its into
Row classification, completes the identification to column data.
7. identification device as claimed in claim 6, which is characterized in that include: in characteristic extracting module
Feature extraction unit obtains column head feature for extracting the column head in column data;Extract single data in data content
First default feature;And the second default feature is extracted for all data contents;
Merging features unit obtains the column data for splicing column head feature, the first default feature and the second default feature
Feature vector.
8. identification device as claimed in claim 7, which is characterized in that
In feature extraction unit, column head is converted to the feature vector of default dimension using word incorporation model;It extracts in data
The string length of single data, format and constitution element feature in appearance;And dispersion, continuous is extracted for all data contents
Property and Variance feature.
9. the identification device as described in claim 6-8 any one, which is characterized in that the identification device further includes trained mould
Block, for being trained to disaggregated model;Include: in the training module
Corpus pretreatment unit carries out pretreatment operation for selecting training corpus, and to it;
Sample extraction unit, for extracting training sample from the training corpus after pretreatment operation;
Unit is marked, for marking class categories to the training sample of extraction;
Training unit inputs selected disaggregated model and is trained to it for will be labelled in the training sample of class categories.
10. a kind of data loading device, which is characterized in that include as claim 6-8 is any one in the data loading device
Data type identification device described in, further includes:
Matching module, the class categories for being exported according to disaggregated model obtain its affiliated semantic attribute, and are used for data class
The recognition result of type identification device is matched with the semantic attribute of Database field, completes the in-stockroom operation to column data,
In, mapping relations are prestored between the class categories of disaggregated model output and the semantic attribute belonging to it, belonging to column data
Semantic attribute and Database field semantic attribute between prestore mapping relations, be stored in memory module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811096054.7A CN109408555B (en) | 2018-09-19 | 2018-09-19 | Data type identification method and device and data storage method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811096054.7A CN109408555B (en) | 2018-09-19 | 2018-09-19 | Data type identification method and device and data storage method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109408555A true CN109408555A (en) | 2019-03-01 |
CN109408555B CN109408555B (en) | 2022-11-11 |
Family
ID=65465012
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811096054.7A Active CN109408555B (en) | 2018-09-19 | 2018-09-19 | Data type identification method and device and data storage method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408555B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109993235A (en) * | 2019-04-10 | 2019-07-09 | 苏州浪潮智能科技有限公司 | A kind of multivariate data classification method and device |
CN110134957A (en) * | 2019-05-14 | 2019-08-16 | 云南电网有限责任公司电力科学研究院 | A kind of scientific and technological achievement storage method and system based on semantic analysis |
CN110232150A (en) * | 2019-05-21 | 2019-09-13 | 平安科技(深圳)有限公司 | A kind of Users'Data Analysis method, apparatus, readable storage medium storing program for executing and terminal device |
CN111046632A (en) * | 2019-11-29 | 2020-04-21 | 智器云南京信息科技有限公司 | Data extraction and conversion method, system, storage medium and electronic equipment |
CN111104466A (en) * | 2019-12-25 | 2020-05-05 | 航天科工网络信息发展有限公司 | Method for rapidly classifying massive database tables |
CN113312354A (en) * | 2021-06-10 | 2021-08-27 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
CN114781471A (en) * | 2021-06-02 | 2022-07-22 | 清华大学 | Entity record matching method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970736A (en) * | 2013-01-25 | 2014-08-06 | 苏州精易会信息技术有限公司 | Method for converting Excel sheet to database table |
CN105825138A (en) * | 2015-01-04 | 2016-08-03 | 北京神州泰岳软件股份有限公司 | Sensitive data identification method and device |
CN106503222A (en) * | 2016-11-04 | 2017-03-15 | 上海轻维软件有限公司 | Batch based on Excel imports the method and device of management data base |
CN106776843A (en) * | 2016-11-28 | 2017-05-31 | 浪潮软件集团有限公司 | Method for importing excel file based on xml analysis |
CN107527070A (en) * | 2017-08-25 | 2017-12-29 | 江苏赛睿信息科技股份有限公司 | Recognition methods, storage medium and the server of dimension data and achievement data |
-
2018
- 2018-09-19 CN CN201811096054.7A patent/CN109408555B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970736A (en) * | 2013-01-25 | 2014-08-06 | 苏州精易会信息技术有限公司 | Method for converting Excel sheet to database table |
CN105825138A (en) * | 2015-01-04 | 2016-08-03 | 北京神州泰岳软件股份有限公司 | Sensitive data identification method and device |
CN106503222A (en) * | 2016-11-04 | 2017-03-15 | 上海轻维软件有限公司 | Batch based on Excel imports the method and device of management data base |
CN106776843A (en) * | 2016-11-28 | 2017-05-31 | 浪潮软件集团有限公司 | Method for importing excel file based on xml analysis |
CN107527070A (en) * | 2017-08-25 | 2017-12-29 | 江苏赛睿信息科技股份有限公司 | Recognition methods, storage medium and the server of dimension data and achievement data |
Non-Patent Citations (1)
Title |
---|
姚泱: "导入Excel时对字段自动匹配", 《ACCESS》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109993235A (en) * | 2019-04-10 | 2019-07-09 | 苏州浪潮智能科技有限公司 | A kind of multivariate data classification method and device |
CN110134957A (en) * | 2019-05-14 | 2019-08-16 | 云南电网有限责任公司电力科学研究院 | A kind of scientific and technological achievement storage method and system based on semantic analysis |
CN110134957B (en) * | 2019-05-14 | 2023-06-13 | 云南电网有限责任公司电力科学研究院 | Scientific and technological achievement warehousing method and system based on semantic analysis |
CN110232150A (en) * | 2019-05-21 | 2019-09-13 | 平安科技(深圳)有限公司 | A kind of Users'Data Analysis method, apparatus, readable storage medium storing program for executing and terminal device |
CN110232150B (en) * | 2019-05-21 | 2023-04-14 | 平安科技(深圳)有限公司 | User data analysis method and device, readable storage medium and terminal equipment |
CN111046632A (en) * | 2019-11-29 | 2020-04-21 | 智器云南京信息科技有限公司 | Data extraction and conversion method, system, storage medium and electronic equipment |
CN111046632B (en) * | 2019-11-29 | 2023-11-10 | 智器云南京信息科技有限公司 | Data extraction and conversion method, system, storage medium and electronic equipment |
CN111104466A (en) * | 2019-12-25 | 2020-05-05 | 航天科工网络信息发展有限公司 | Method for rapidly classifying massive database tables |
CN114781471A (en) * | 2021-06-02 | 2022-07-22 | 清华大学 | Entity record matching method and system |
CN114781471B (en) * | 2021-06-02 | 2022-12-27 | 清华大学 | Entity record matching method and system |
CN113312354A (en) * | 2021-06-10 | 2021-08-27 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
CN113312354B (en) * | 2021-06-10 | 2023-07-28 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109408555B (en) | 2022-11-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109408555A (en) | Data type recognition methods and device, data storage method and device | |
CN106156365B (en) | A kind of generation method and device of knowledge mapping | |
CN107766371B (en) | Text information classification method and device | |
KR101657495B1 (en) | Image recognition method using deep learning analysis modular systems | |
CN107943911A (en) | Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing | |
US20170337260A1 (en) | Method and device for storing data | |
CN105243055B (en) | Based on multilingual segmenting method and device | |
CN112232058A (en) | False news identification method and system based on deep learning three-layer semantic extraction framework | |
US11243971B2 (en) | System and method of database creation through form design | |
CN108399157B (en) | Dynamic extraction method of entity and attribute relationship, server and readable storage medium | |
CN110750977B (en) | Text similarity calculation method and system | |
CN110209828A (en) | Case querying method and case inquiry unit, computer equipment and storage medium | |
KR20210106372A (en) | New category tag mining method and device, electronic device and computer-readable medium | |
CN109933671A (en) | Construct method, apparatus, computer equipment and the storage medium of personal knowledge map | |
CN109033282A (en) | A kind of Web page text extracting method and device based on extraction template | |
CN108536673B (en) | News event extraction method and device | |
CN114239588A (en) | Article processing method and device, electronic equipment and medium | |
CN109635125B (en) | Vocabulary atlas building method and electronic equipment | |
CN114970514A (en) | Artificial intelligence based Chinese word segmentation method, device, computer equipment and medium | |
CN110321557A (en) | A kind of file classification method, device, electronic equipment and storage medium | |
CN115759293A (en) | Model training method, image retrieval device and electronic equipment | |
CN110197175A (en) | A kind of method and system of books title positioning and part-of-speech tagging | |
CN109522407A (en) | Business connection prediction technique, device, computer equipment and storage medium | |
CN111401047A (en) | Method and device for generating dispute focus of legal document and computer equipment | |
CN115563278A (en) | Question classification processing method and device for sentence text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Data type identification method and device, data entry method and device Effective date of registration: 20231027 Granted publication date: 20221111 Pledgee: Bank of Hangzhou Limited by Share Ltd. Nanjing branch Pledgor: COGNITIVE COMPUTING NANJING INFORMATION TECHNOLOGY Co.,Ltd. Registration number: Y2023980062710 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right |