CN109408555B - Data type identification method and device and data storage method and device - Google Patents

Data type identification method and device and data storage method and device Download PDF

Info

Publication number
CN109408555B
CN109408555B CN201811096054.7A CN201811096054A CN109408555B CN 109408555 B CN109408555 B CN 109408555B CN 201811096054 A CN201811096054 A CN 201811096054A CN 109408555 B CN109408555 B CN 109408555B
Authority
CN
China
Prior art keywords
data
column
classification
training
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811096054.7A
Other languages
Chinese (zh)
Other versions
CN109408555A (en
Inventor
王海波
李晓宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cognitive Computing Nanjing Information Technology Co ltd
Original Assignee
Cognitive Computing Nanjing Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cognitive Computing Nanjing Information Technology Co ltd filed Critical Cognitive Computing Nanjing Information Technology Co ltd
Priority to CN201811096054.7A priority Critical patent/CN109408555B/en
Publication of CN109408555A publication Critical patent/CN109408555A/en
Application granted granted Critical
Publication of CN109408555B publication Critical patent/CN109408555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a data type identification method and device and a data storage method and device, wherein the data type identification method comprises the following steps: s1, acquiring column data to be identified, wherein the column data comprises a column header and data contents; s2, extracting features of the line data to obtain a feature vector, wherein the feature vector comprises line head features and data content features; and S3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data. The method obtains the feature vectors according to the column headers and the data contents of the column data, inputs the feature vectors into a pre-trained classification model to classify the column headers and the data contents to obtain the semantic attributes of the column headers and the data contents, completes the identification of the structured data types, is simple and convenient, has high efficiency, does not need manual intervention, and greatly saves manpower and material resources.

Description

Data type identification method and device and data storage method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a data type identification method and device and a data storage method and device.
Background
Structured data analysis is one of the important links in data mining. Structured data stored in files with formats such as Excel and csv are difficult to directly analyze. An analyst usually performs a complicated analysis operation by using a relational database or a graph database, i.e., the analyst needs to store data in a file into the relational database or the graph database and then perform an analysis operation by using other analysis frameworks. In the process of warehousing, an analyst needs to map column data in a file with a format such as Excel, csv and the like with one field in a database.
At present, the field matching problem in the data warehousing process generally has two modes: one is that the analyst manually completes the mapping, which requires a lot of manual intervention, and is time-consuming, labor-consuming and inefficient. The other is to achieve the effect of automatic mapping by means of a policy, and the effect can be achieved by the following two ways: 1. recording the result of the previous manual mapping, and quickly matching the mapping if the current file column (usually taking the column header as a standard) is processed before; 2. and the mapping is completed through strategies such as hard matching or regular matching of the column header and the database field. Both of these approaches suffer from the problem of being inflexible enough to still require human intervention when a list of similar data is present that has not been processed.
Disclosure of Invention
The invention aims to provide a data type identification method and device and a data storage method and device, and effectively solves the technical problems that the structured data type identification is not flexible enough and the efficiency is low in the prior art.
The technical scheme provided by the invention is as follows:
a data type identification method, comprising:
s1, acquiring column data to be identified, wherein the column data comprises column headers and data contents;
s2, extracting features of the line data to obtain a feature vector, wherein the feature vector comprises line head features and data content features;
and S3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data.
Further preferably, in step S2, the method includes:
s21, extracting a column head in the column data to obtain column head characteristics;
s22, extracting a first preset feature of single data in the data content;
s23, extracting second preset features aiming at all data contents;
s24, splicing the column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data.
Further preferably, in step S21, the word embedding model is used to convert the column headers into feature vectors of preset dimensions;
and/or, in step S22, extracting character string length, format and constituent element characteristics of a single piece of data in the data content;
and/or, in step S23, the dispersion, continuity and variance features are extracted for all data contents.
Further preferably, before step S1, a step of training the classification model is further included, including:
s01, selecting a training corpus and carrying out preprocessing operation on the training corpus;
s02, selecting a classification model;
s03, extracting training samples from the training corpus after the preprocessing operation;
s04, labeling classification categories of the extracted training samples;
and S05, inputting the training samples marked with the classification categories into a classification model, and training the classification model.
The invention also provides a data storage method, which comprises the data type identification method and further comprises the following steps:
s4, obtaining the semantic attributes of the classification categories output by the classification model according to the classification categories, wherein a mapping relation is prestored between the classification categories output by the classification model and the semantic attributes to which the classification categories belong;
and S5, matching the semantic attribute to which the obtained column data belongs with the semantic attribute of the database field to finish the warehousing operation of the column data, wherein a mapping relation is prestored between the semantic attribute to which the column data output by the classification model belongs and the semantic attribute of the database field.
The invention also provides a data type identification device, comprising:
the data acquisition module is used for acquiring column data to be identified, wherein the column data comprises a column header and data contents;
the characteristic extraction module is used for extracting the characteristics of the line data acquired by the data acquisition module to obtain a characteristic vector, and the characteristic vector comprises a line head characteristic and a data content characteristic;
and the data classification module is used for inputting the feature vectors extracted by the feature extraction module into a pre-trained classification model to classify the feature vectors so as to complete the identification of the column data.
Further preferably, the feature extraction module comprises:
the characteristic extraction unit is used for extracting column heads in the column data to obtain column head characteristics; extracting a first preset characteristic of single data in the data content; extracting second preset characteristics aiming at all data contents;
and the characteristic splicing unit is used for splicing the column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data.
Further preferably, in the feature extraction unit, the word embedding model is used to convert the column headers into feature vectors of preset dimensions; extracting character string length, format and constituent element characteristics of single data in data content; and extracting the features of dispersion, continuity and variance for all data contents.
Further preferably, the recognition device further comprises a training module, configured to train the classification model; the training module comprises:
the corpus preprocessing unit is used for selecting the training corpus and carrying out preprocessing operation on the training corpus;
the sample extraction unit is used for extracting training samples from the training corpus after the preprocessing operation;
the labeling unit is used for labeling the classification category of the extracted training sample;
and the training unit is used for inputting the selected classification model into the training sample labeled with the classification category to train the training sample.
The invention also provides a data storage device, which comprises the data type identification device and further comprises:
and the matching module is used for obtaining the semantic attribute of the classification type according to the classification type output by the classification model, matching the identification result of the data type identification device with the semantic attribute of the database field and finishing the warehousing operation of the column data, wherein a mapping relation is prestored between the classification type output by the classification model and the semantic attribute to which the column data belongs, and mapping relations are prestored between the semantic attribute to which the column data belongs and the semantic attribute of the database field and are stored in the storage module.
According to the data type identification method and device provided by the invention, the characteristic vector is obtained according to the column head and the data content of the column data, and is input into a pre-trained classification model to classify the column head and the data content to obtain the semantic attributes of the column head and the data content, so that the identification of the structured data type is completed, the method and device are simple and convenient, the efficiency is high, manual intervention is not needed, and manpower and material resources are greatly saved; in addition, corresponding classification models can be trained aiming at different application scenes, and the application is wide. In the structured data storage process, the mapping can be quickly, flexibly and accurately recommended only by establishing the mapping between the semantic attributes of the column data and the semantic attributes of the database fields.
Drawings
The above features, technical features, advantages and implementations of a log processing method and system will be further described in the following detailed description of preferred embodiments in a clearly understandable manner with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a data type identification method according to the present invention;
FIG. 2 is a schematic diagram of a training process of a classification model according to the present invention;
FIG. 3 is a schematic diagram of a data type identifier according to the present invention.
Description of reference numerals:
100-graph data structure converter, 110-entity splitting module, 120-entity merging module and 130-link splitting module.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, without inventive effort, other drawings and embodiments can be derived from them.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
As shown in fig. 1, a schematic flow chart of a data type identification method provided by the present invention is shown, and as can be seen from the diagram, the identification method includes: s1, acquiring column data to be identified, wherein the column data comprises column headers and data contents; s2, extracting features of the line data to obtain a feature vector, wherein the feature vector comprises line head features and data content features; and S3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data.
In this method, the column data refers to a column of data in a file with a format such as Excel, csv, etc., and the data format is usually: and the column header is the first row of data in the file and is used for describing the content of the current column. Word embedding (word embedding) is a type representation of words, words with similar meanings have similar representations, and is a general term for a method for mapping words to real number vectors. In the data type identification process, a column of data in a file is regarded as an analysis object, and the purpose is realized by three stages: feature extraction, classification model training and data classification.
In the process of feature extraction, the column data is divided into a column head part and a data content part, so the feature extraction is also divided into a column head feature extraction part and a data content feature extraction part. The column header is typically a feature description of the column data, so here the column header is converted to a feature vector of specified dimensions by a Word embedding Model (e.g., word2Vec, CBoW, skip-Gram Model, etc.). For a column data content part, firstly, acquiring a data sample by a sampling technology, and then extracting characteristics such as character string length, format, constituent elements and the like from single data (one row) in the sample to obtain a first preset characteristic; extracting features such as dispersion, continuity, variance and the like from all samples to obtain second preset features; and finally, splicing the obtained column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data, wherein the characteristic vector is used as the characteristic description of the current analysis object (column data).
After the feature extraction is finished, the feature vectors are sent into the trained classification model for classification to obtain the classification result of the feature vectors, so that the purpose of automatically identifying the data type is achieved. The specific form of the classification model is not limited herein, and the classification model can be used in the present invention, such as svm (support vector machine), decision tree, random forest, neural network (deep learning), and the like.
As shown in fig. 2, in the process of training the classification model, the method includes: s01, selecting a training corpus and carrying out preprocessing operation on the training corpus; s02, selecting a classification model; s03, extracting training samples from the training corpus after the preprocessing operation; s04, labeling classification categories of the extracted training samples; and S05, inputting the training samples marked with the classification categories into a classification model, and training the classification model.
Specifically, the word embedding model has a high requirement on the corpus, and when the corpus is selected, articles capable of covering the field related to the data to be analyzed (data in Excel and CSV files) should be selected as much as possible. Then, it is preprocessed according to specific usage scenarios, such as: deleting English, deleting special symbols, converting simplified and traditional characters and the like, and then selecting word segmentation algorithms such as jieba, hanLP and the like to perform word segmentation processing on the speech.
Before training, the required classification classes are determined according to the service scene, and the corresponding number of classification results are mapped according to the number of the classification classes, for example, if n classes are included, each class is mapped to be 0,1, \ 8230; (n-1). Then, extracting training samples (specifically, feature vectors after feature extraction of the selected training corpus (column data)) from the training corpus, and labeling classification categories for each training sample, where the labeled content is specifically a classification result mapped according to the classification type, and if the mapping relationship is a number, the labeled information is a corresponding number. The selected training samples should cover all classification categories, and the number of training samples corresponding to each category should not be too different, and should be divided equally as much as possible.
For the classification model, the Word2vec model, as a popular Word embedding model, has been integrated by various open source frameworks. The invention trains the preprocessed corpus by using a word2vec model by means of a genim open source framework. The classification algorithm selected may be svm, decision trees, random forests, neural networks, and the like. After supervised training of the classification model based on the selected training samples, the classification model can be used for identification of column data types.
Based on the data type identification method, the invention also provides a data storage method, and in the method, besides the data type identification method, the method also comprises the following steps: s4, obtaining the semantic attribute of the classification type output by the classification model according to the classification type output by the classification model, wherein a mapping relation is prestored between the classification type output by the classification model and the semantic attribute of the classification type output by the classification model; and S5, matching the semantic attribute to which the obtained column data belongs with the semantic attribute of the database field to finish the warehousing operation of the column data, wherein a mapping relation is prestored between the semantic attribute to which the column data output by the classification model belongs and the semantic attribute of the database field.
In the method, a column header is first row data in a file, is used for describing the content of a current column, and is different expression modes of semantic attributes; the semantic attribute is used for describing the characteristics of a list of data, and is a high-level description established on the bottom-level characteristics, such as an identity card number, a mobile phone number and the like. Generally, structured data (including column data) has corresponding semantic attributes, and database fields in a database also have corresponding semantic attributes. Since the column header and the database field of the column data are both one expression of the semantic attribute, and the same semantic attribute can be expressed by multiple expressions, it is difficult to directly complete mapping by matching the column header and the database field, for example: the database field is phone _ num, and the column head is a mobile phone number, a calling number and the like, so the mapping from the column data to the database field is completed through the matching of semantic attributes in the method.
After the classification model outputs a classification result (corresponding to a certain classification type), the semantic attribute to which the classification result belongs is obtained by searching the mapping relation between the stored classification type and the semantic attribute to which the classification type belongs; and then, further searching the mapping relation between the semantic attribute of the column data and the semantic attribute of the database field, namely matching the semantic attribute with the database field in the database, and storing the column data into the corresponding position in the database. In other embodiments, during the training of the classification model, a corresponding number of classification results are mapped to the required semantic attributes (covering the database fields) determined from the business scenario, and similarly, the classification is assumed to include n semantic attributes, with the classification mapping being 0,1, \ 8230; (n-1). After the characteristic vector of the column data is input into the classification model, the semantic attribute of the column vector is directly obtained according to the mapping relation between the classification result and the semantic attribute, and then the semantic attribute is matched with the semantic attribute of the database field.
As shown in fig. 3, a schematic diagram of a data type identification apparatus 100 provided by the present invention is shown, and as can be seen from the diagram, the data type identification apparatus 100 includes: the system comprises a data acquisition module 110, a feature extraction module 120 and a data classification module 130, wherein the feature extraction module 120 is respectively connected with the data acquisition module 110 and the data classification module 130. In the working process, firstly, the data obtaining module 110 obtains column data to be identified, wherein the column data comprises a column header and data content; then, the feature extraction module 120 extracts features of the line data acquired by the data acquisition module 110 to obtain a feature vector, where the feature vector includes a line head feature and a data content feature; finally, the data classification module 130 inputs the feature vectors extracted by the feature extraction module 120 into a pre-trained classification model to classify the feature vectors, thereby completing the identification of the column data.
Specifically, the column data refers to a column of data in a file with a format such as Excel, csv, and the like, and the data format is usually: and the column header is the first row of data in the file and is used for describing the content of the current column. Word embedding (word embedding) is a type representation of words, words with similar meanings have similar representations, and is a general term for a method for mapping words to real number vectors. In the data type identification process, a column of data in a file is regarded as an analysis object, and the purpose is realized by three stages: feature extraction, classification model training and data classification.
Specifically, the feature extraction module 120 includes a feature extraction unit and a feature concatenation unit. In the process of feature extraction, the column data is divided into a column head part and a data content part, so the feature extraction is also divided into a column head feature extraction part and a data content feature extraction part. The column header is generally a feature description of the column data, so here the feature extraction unit converts the column header into a feature vector of a specified dimension through a Word embedding Model (such as Word2Vec, CBoW, skip-Gram Model, etc.). For a column data content part, firstly, acquiring a data sample by a sampling technology, and then extracting characteristics such as character string length, format, constituent elements and the like from single data (one row) in the sample by a characteristic extraction unit to obtain a first preset characteristic; extracting features such as dispersion, continuity, variance and the like from all samples to obtain second preset features; finally, the feature splicing unit splices the obtained column header feature, the first preset feature and the second preset feature to obtain a feature vector of the column data, and the feature vector is used as feature description of the current analysis object (column data).
After the feature extraction is finished, the feature vectors are sent into the trained classification model for classification to obtain the classification result of the feature vectors, so that the purpose of automatically identifying the data type is achieved. The specific form of the classification model is not limited herein, and the classification model can be used in the present invention, such as svm (support vector machine), decision tree, random forest, neural network (deep learning), and the like.
The training module comprises: the system comprises a corpus preprocessing unit, a sample extracting unit, a labeling unit and a training unit, wherein the sample extracting unit is connected with the corpus preprocessing unit, the labeling unit is connected with the sample extracting unit, and the training unit is connected with the labeling unit. In the process of training the classification model, the corpus preprocessing unit selects training corpuses and preprocesses the training corpuses; the sample extraction unit extracts training samples from the training corpus after the preprocessing operation; then, the labeling unit labels the classification category of the extracted training sample; and finally, inputting the selected classification model into the training sample labeled with the classification category by a training unit for training.
Specifically, the word embedding model has a high requirement on the corpus, and when the corpus is selected, articles capable of covering the field related to the data to be analyzed (data in Excel and CSV files) should be selected as much as possible. Then, the corpus preprocessing unit preprocesses the corpus according to a specific use scenario, such as: deleting English, deleting special symbols, converting simplified and traditional characters and the like, and then selecting word segmentation algorithms such as jieba, hanLP and the like to perform word segmentation processing on the speech.
Before training, the required classification classes are determined according to the service scene, and the corresponding number of classification results are mapped according to the number of the classification classes, for example, if n classes are included, each class is mapped to be 0,1, \ 8230; (n-1). Then, the sample extracting unit extracts training samples (specifically, feature vectors after feature extraction of the selected training corpus (column data)) from the training corpus, and labels classification categories for the training samples through the labeling unit, where the labeled content is specifically a classification result mapped according to the classification type, and if the mapping relationship is a number, the labeled information is a corresponding number. The selected training samples should cover all classification categories, and the number of training samples corresponding to each category should not be too different, and should be divided equally as much as possible.
For the classification model, the Word2vec model, as a popular Word embedding model, has been integrated by various open source frameworks. The invention trains the preprocessed corpus by using a word2vec model by means of a genesis open source framework. The classification algorithm selected may be svm, decision trees, random forests, neural networks, and the like. The training unit can be used for identifying column data types after supervised training is carried out on the classification model based on the selected training samples.
Based on this, the present invention further provides a data warehousing device, which comprises the data type identification device, and further comprises: and the matching module is used for obtaining the semantic attributes of the classification categories output by the classification model according to the classification categories, matching the identification result of the data type identification device with the semantic attributes of the database fields and finishing the warehousing operation of the column data, wherein a mapping relation is pre-stored between the classification categories output by the classification model and the semantic attributes of the column data, and the mapping relation is pre-stored between the semantic attributes of the column data and the semantic attributes of the database fields and is stored in the storage module.
In the data storage device, after a classification model outputs a classification result (corresponding to a certain classification type), the semantic attribute of the classification result is obtained by searching the mapping relation between the stored classification category and the semantic attribute to which the classification category belongs; and then, further searching the mapping relation between the semantic attribute of the column data and the semantic attribute of the database field, namely matching the semantic attribute with the database field in the database, and storing the column data into the corresponding position in the database.
It should be noted that the above embodiments can be freely combined as necessary. The above is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention.

Claims (8)

1. A data warehousing method is characterized by comprising the following steps:
s1, acquiring column data to be identified, wherein the column data comprises a column header and data contents;
s2, extracting features of the line data to obtain a feature vector, wherein the feature vector comprises line head features and data content features;
s3, inputting the feature vectors into a pre-trained classification model to classify the feature vectors, and completing the identification of column data;
s4, obtaining the semantic attributes of the classification categories output by the classification model according to the classification categories, wherein a mapping relation is prestored between the classification categories output by the classification model and the semantic attributes to which the classification categories belong;
and S5, matching the semantic attribute to which the obtained column data belongs with the semantic attribute of the database field to finish the warehousing operation of the column data, wherein a mapping relation is prestored between the semantic attribute to which the column data output by the classification model belongs and the semantic attribute of the database field.
2. The data warehousing method of claim 1, characterized in that in step S2, it comprises:
s21, extracting a column head in the column data to obtain column head characteristics;
s22, extracting a first preset feature of single data in the data content;
s23, extracting second preset features aiming at all data contents;
s24, splicing the column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data.
3. The data warehousing method of claim 2,
in step S21, a word embedding model is used to convert the column headers into feature vectors of preset dimensions;
and/or, in step S22, extracting character string length, format and constituent element characteristics of a single piece of data in the data content;
and/or, in step S23, the dispersion, continuity and variance features are extracted for all data contents.
4. A method as claimed in any one of claims 1 to 3, further comprising, before step S1, the step of training a classification model, including:
s01, selecting a training corpus and carrying out preprocessing operation on the training corpus;
s02, selecting a classification model;
s03, extracting training samples from the training corpus after the preprocessing operation;
s04, labeling classification categories of the extracted training samples;
and S05, inputting the training samples marked with the classification categories into a classification model, and training the classification model.
5. A data warehousing apparatus, characterized in that the data warehousing apparatus comprises:
the data acquisition module is used for acquiring column data to be identified, wherein the column data comprises a column header and data contents;
the characteristic extraction module is used for extracting the characteristics of the line data acquired by the data acquisition module to obtain a characteristic vector, and the characteristic vector comprises line head characteristics and data content characteristics;
the data classification module is used for inputting the feature vectors extracted by the feature extraction module into a pre-trained classification model to classify the feature vectors so as to finish the identification of the column data;
and the matching module is used for obtaining the semantic attributes of the classification categories output by the classification model according to the classification categories, matching the identification result of the data type identification device with the semantic attributes of the database fields and finishing the warehousing operation of the column data, wherein a mapping relation is prestored between the classification categories output by the classification model and the semantic attributes of the column data, and the mapping relation is prestored between the semantic attributes of the column data and the semantic attributes of the database fields and is stored in the storage module.
6. The data warehousing device of claim 5, characterized in that the feature extraction module comprises:
the characteristic extraction unit is used for extracting a column head in the column data to obtain column head characteristics; extracting a first preset characteristic of single data in the data content; extracting a second preset characteristic aiming at all data contents;
and the characteristic splicing unit is used for splicing the column head characteristic, the first preset characteristic and the second preset characteristic to obtain a characteristic vector of the column data.
7. The data warehousing device of claim 6,
in a feature extraction unit, converting the column headers into feature vectors with preset dimensions by using a word embedding model; extracting character string length, format and constituent element characteristics of single data in data content; and extracting the features of dispersion, continuity and variance for all data contents.
8. The data warehousing device of any of claims 5-7, wherein the recognition device further comprises a training module for training a classification model; the training module comprises:
the corpus preprocessing unit is used for selecting the training corpus and carrying out preprocessing operation on the training corpus;
the sample extraction unit is used for extracting training samples from the training corpus after the preprocessing operation;
the labeling unit is used for labeling the classification category of the extracted training sample;
and the training unit is used for inputting the selected classification model into the training sample labeled with the classification category to train the training sample.
CN201811096054.7A 2018-09-19 2018-09-19 Data type identification method and device and data storage method and device Active CN109408555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811096054.7A CN109408555B (en) 2018-09-19 2018-09-19 Data type identification method and device and data storage method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811096054.7A CN109408555B (en) 2018-09-19 2018-09-19 Data type identification method and device and data storage method and device

Publications (2)

Publication Number Publication Date
CN109408555A CN109408555A (en) 2019-03-01
CN109408555B true CN109408555B (en) 2022-11-11

Family

ID=65465012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811096054.7A Active CN109408555B (en) 2018-09-19 2018-09-19 Data type identification method and device and data storage method and device

Country Status (1)

Country Link
CN (1) CN109408555B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993235A (en) * 2019-04-10 2019-07-09 苏州浪潮智能科技有限公司 A kind of multivariate data classification method and device
CN110134957B (en) * 2019-05-14 2023-06-13 云南电网有限责任公司电力科学研究院 Scientific and technological achievement warehousing method and system based on semantic analysis
CN110232150B (en) * 2019-05-21 2023-04-14 平安科技(深圳)有限公司 User data analysis method and device, readable storage medium and terminal equipment
CN111046632B (en) * 2019-11-29 2023-11-10 智器云南京信息科技有限公司 Data extraction and conversion method, system, storage medium and electronic equipment
CN111104466B (en) * 2019-12-25 2023-07-28 中国长峰机电技术研究设计院 Method for quickly classifying massive database tables
CN114781471B (en) * 2021-06-02 2022-12-27 清华大学 Entity record matching method and system
CN113312354B (en) * 2021-06-10 2023-07-28 北京百度网讯科技有限公司 Data table identification method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970736A (en) * 2013-01-25 2014-08-06 苏州精易会信息技术有限公司 Method for converting Excel sheet to database table
CN105825138A (en) * 2015-01-04 2016-08-03 北京神州泰岳软件股份有限公司 Sensitive data identification method and device
CN106503222A (en) * 2016-11-04 2017-03-15 上海轻维软件有限公司 Batch based on Excel imports the method and device of management data base
CN106776843A (en) * 2016-11-28 2017-05-31 浪潮软件集团有限公司 Method for importing excel file based on xml analysis
CN107527070A (en) * 2017-08-25 2017-12-29 江苏赛睿信息科技股份有限公司 Recognition methods, storage medium and the server of dimension data and achievement data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970736A (en) * 2013-01-25 2014-08-06 苏州精易会信息技术有限公司 Method for converting Excel sheet to database table
CN105825138A (en) * 2015-01-04 2016-08-03 北京神州泰岳软件股份有限公司 Sensitive data identification method and device
CN106503222A (en) * 2016-11-04 2017-03-15 上海轻维软件有限公司 Batch based on Excel imports the method and device of management data base
CN106776843A (en) * 2016-11-28 2017-05-31 浪潮软件集团有限公司 Method for importing excel file based on xml analysis
CN107527070A (en) * 2017-08-25 2017-12-29 江苏赛睿信息科技股份有限公司 Recognition methods, storage medium and the server of dimension data and achievement data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
导入Excel时对字段自动匹配;姚泱;《Access》;20180716;第1-3页 *

Also Published As

Publication number Publication date
CN109408555A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109408555B (en) Data type identification method and device and data storage method and device
CN110147726B (en) Service quality inspection method and device, storage medium and electronic device
CN111104498B (en) Semantic understanding method in task type dialogue system
WO2018032937A1 (en) Method and apparatus for classifying text information
CN110781276A (en) Text extraction method, device, equipment and storage medium
CN109685056B (en) Method and device for acquiring document information
CN110795919A (en) Method, device, equipment and medium for extracting table in PDF document
CN109711874A (en) User's portrait generation method, device, computer equipment and storage medium
CN109614517A (en) Classification method, device, equipment and the storage medium of video
CN109192225B (en) Method and device for recognizing and marking speech emotion
CN110046254B (en) Method and apparatus for generating a model
CN107291775B (en) Method and device for generating repairing linguistic data of error sample
CN110890088B (en) Voice information feedback method and device, computer equipment and storage medium
CN108399157B (en) Dynamic extraction method of entity and attribute relationship, server and readable storage medium
CN111144102B (en) Method and device for identifying entity in statement and electronic equipment
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN114245203A (en) Script-based video editing method, device, equipment and medium
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN110827797B (en) Voice response event classification processing method and device
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN111274812B (en) Figure relation recognition method, equipment and storage medium
CN111357015A (en) Speech synthesis method, apparatus, computer device and computer-readable storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN112732743B (en) Data analysis method and device based on Chinese natural language
CN115098657A (en) Method, apparatus and medium for natural language translation database query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Data type identification method and device, data entry method and device

Effective date of registration: 20231027

Granted publication date: 20221111

Pledgee: Bank of Hangzhou Limited by Share Ltd. Nanjing branch

Pledgor: COGNITIVE COMPUTING NANJING INFORMATION TECHNOLOGY Co.,Ltd.

Registration number: Y2023980062710

PE01 Entry into force of the registration of the contract for pledge of patent right