CN110427992A - Data matching method, device, computer equipment and storage medium - Google Patents
Data matching method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110427992A CN110427992A CN201910664541.7A CN201910664541A CN110427992A CN 110427992 A CN110427992 A CN 110427992A CN 201910664541 A CN201910664541 A CN 201910664541A CN 110427992 A CN110427992 A CN 110427992A
- Authority
- CN
- China
- Prior art keywords
- column
- data
- sample
- label
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention is suitable for field of computer technology, provides a kind of data matching method, device, computer equipment and storage medium, which comprises obtains tables of data;Each data are arranged and carry out the matching of code table code value;To progress canonical identification in each data column;Determine the column type of each data column;The column feature vector of each column is extracted, the column feature vector includes the statistical nature of column data, the Expressive Features of column name and/or column annotation information and column essential attribute feature;The column feature vector of each column is identified, determines the column label of each column;Each column data is matched based on label.Data matching method provided in an embodiment of the present invention, after being pre-processed using code table code value and canonical identification, the column feature vector respectively arranged using preset column characteristic vector pickup, compared to existing method, the column feature vector that the present invention extracts has won feature of the data in multiple dimensions over by any means with smaller data volume, while guaranteeing accuracy rate, calculation amount is effectively reduced.
Description
Technical field
The invention belongs to field of computer technology more particularly to a kind of data matching method, device, computer equipment and deposit
Storage media.
Background technique
During government services are carried out, it will usually a large amount of government data is generated, although however these government datas place
In different government services, but also can the similar a large amount of repeated datas of present pattern therefore handled to government data
During, it usually needs the similar data of type that different government affairs business generate are integrated, are identified using data, in number
According to the data for finding correlation between library.
There are many kinds of the methods of the existing data that correlation is found between database, the effect that different methods plays
Also different.For example, manually carry out Data Matching method accuracy rate it is relatively high, but calculation amount with the increase of database urgency
Increase severely and add, it is clear that is not suitable for the Data Matching of large database.And using program carry out Data Matching method there are mainly two types of,
One is using field description present in database, using searching for generally searching similar data, but it is easy in this method
The technical problem for causing matching rate not high because of field description missing, another kind is to utilize the data content progress in database
Match, need to use different matching process to different types of data content, calculation amount is larger, and calculating speed is slow.
As it can be seen that existing data identification technology is also deposited in particular in the matching process of the big government data of data volume
In technical problem that is computationally intensive, calculating data inaccuracy.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of data matching method, device, computer equipment and storage medium,
Aiming to solve the problem that existing data identification technology, there is also technical problems that is computationally intensive, calculating data inaccuracy.
The embodiments of the present invention are implemented as follows, a kind of data matching method, which comprises
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column note in the tables of data
Release the column data of information and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
The column type for determining each data column, the column type are identified using default rule according to the column data of each data column
Including numeric type and text-type;
It is arranged according to the column name information of each data column and/or column annotation information, the column data of each data column and each data
Column type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes columns
According to statistical nature, column name and/or column annotation information Expressive Features and data column essential attribute feature, the column data
Statistical nature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data
Column essential attribute feature includes the frequency of use of column data, the column type of data column and the number determined in advance according to preset rules
According to the different degree of column;
Column type based on each data column is using the pre- data identification for first passing through training generation corresponding with the column type
Model identifies the column feature vector that each data arrange, and determines the column label of each data column;
Column label based on each data column matches each data column.
The another object of the embodiment of the present invention is to provide a kind of data matching device, comprising:
Tables of data acquiring unit arranges in the tables of data comprising multiple data for obtaining multiple tables of data to be matched
Column name information and/or column annotation information and each data column column data;
Code table code value matching unit, the column data for being arranged using code table code value each data are matched;
Canonical recognition unit, for identifying the part for meeting preset matching rule in each data column using regular expression;
Column type determining units, the column data for being arranged according to each data is identified using default rule determines that each data arrange
Column type, the column type includes numeric type and text-type;
Column characteristic vector pickup unit, column name information and/or column annotation information, each data column for being arranged according to each data
Column data and each data column column type, the column feature vector of each data column is extracted using preset Feature Selection Model,
The column feature vector includes the statistical nature of column data, the Expressive Features of column name and/or column annotation information and data column base
This attributive character, the statistical nature of the column data include the value range of column data, mean value, variance, quantile, variation lines
Number, kurtosis, the degree of bias, comentropy, data column essential attribute feature include the frequency of use of column data, data column column type and
The different degree of the data column determined in advance according to preset rules;
Column label determination unit, column type for being arranged based on each data pre- are first passed through using corresponding with the column type
The data identification model that training generates identifies the column feature vector that each data arrange, and determines the column label of each data column;
Data matching unit, the column label for being arranged based on each data match each data column.
The another object of the embodiment of the present invention is to provide a kind of computer equipment, including memory and processor, described
Computer program is stored in memory, when the computer program is executed by the processor, so that the processor executes
The step of data matching method as described above.
The another object of the embodiment of the present invention is to provide a kind of computer readable storage medium, described computer-readable to deposit
Computer program is stored on storage media, when the computer program is executed by processor, so that the processor executes as above
The step of stating the data matching method.
A kind of data matching method provided in an embodiment of the present invention, after obtaining multiple tables of data to be matched, to each number
The matching of code table code value and canonical identification are carried out according to column, the column class of each data column is then determined according to the column data that each data arrange
Type, and according to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column
Type extracts the column feature vector of each data column, statistical nature, column name including column data using preset Feature Selection Model
And/or the Expressive Features and data column essential attribute feature of column annotation information, the statistical nature of the column data includes columns
According to value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature packet
The different degree of the frequency of use of column data, the column type of data column and the data column determined in advance according to preset rules is included, so
Column type afterwards based on each data column is using the pre- data identification model pair for first passing through training generation corresponding with the column type
The column feature vector of each data column is identified, and determines the column label of each data column, is based ultimately upon the column label of each data column
Each data column are matched.Data matching method provided in an embodiment of the present invention can make full use of the column name information of each column
And/or the statistical information of column annotation information and column data, and the conventional essential attribute feature of column is combined, such as frequency of use,
The different degree of the data marked in advance determines feature vector of falling out, and the feature vector extracted by above- mentioned information is compared to existing
Data matching method, feature of each column data in multiple dimensions has been won over by any means with smaller data volume, so that calculation amount is significantly
It reduces, and the subsequent data type based on column data is using the pre- data for first passing through training generation corresponding with the data type
Identification model identifies the column feature vector of each column, wherein the data identification model is given birth to by mass data sample training
At, so that the label result of the data finally determined is more accurate, it is accurate in guarantee compared to existing data identification method
While rate, data calculation amount is greatly reduced, in particular for the big government data of data volume, the efficiency of Data Matching is significantly
It improves.
Detailed description of the invention
Fig. 1 is a kind of step flow chart of data matching method provided in an embodiment of the present invention;
Fig. 2 is the step flow chart of another data matching method provided in an embodiment of the present invention;
Fig. 3 is the step flow chart of another data matching method provided in an embodiment of the present invention;
Fig. 4 is identified to be provided in an embodiment of the present invention based on column feature vector of the data type of column data to each column
Method step flow chart;
Fig. 5 is that the numerical value number generated based on random forests algorithm training is stated in a kind of trained generation provided in an embodiment of the present invention
According to the step flow chart of the method for identification model;
Fig. 6 is a kind of structural schematic diagram of data matching device provided in an embodiment of the present invention;
Fig. 7 is the structural schematic diagram of another data matching device provided in an embodiment of the present invention;
Fig. 8 is the structural schematic diagram of another data matching device provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
As shown in Figure 1, in one embodiment it is proposed that a kind of data matching method, specifically includes the following steps:
Step S102 obtains multiple tables of data to be matched.
In embodiments of the present invention, acquisition tables of data to be matched can derive from different databases, such as often
Data acquisition can be realized by input data path in Oracle, SQL, A Liyun, the Hadoop etc. seen, and will be from different numbers
Unification is carried out according to the format of the data obtained in library.
In embodiments of the present invention, the column name information and/or column annotation information comprising multiple data column in the tables of data
And the column data of each data column.
Step S104 is matched using the column data that code table code value arranges each data.
In embodiments of the present invention, the symbol special for part present in tables of data, such as money symbol, utilize code
Such additional character that value code table can match, so as to conveniently determine the data column service class
Type.
Step S106 identifies the part for meeting preset matching rule in each data column using regular expression.
Step S108 identifies the column type for determining each data column according to the column data of each data column using default rule.
In embodiments of the present invention, the column data type of each column can be identified according to the column data that each data arrange, it is described
Column data type includes text-type and numeric type.
Step S110, according to the column name information of each data column and/or column annotation information, the column data of each data column and each
The column type of data column extracts the column feature vector of each data column using preset Feature Selection Model.
In embodiments of the present invention, the column feature vector includes the statistical nature of column data, column name and/or column annotation letter
The Expressive Features and data column essential attribute feature of breath, the statistical nature of the column data include the value range of column data,
Mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature include the use of column data
The different degree of frequency, the column type of data column and the data column determined in advance according to preset rules.
Step S112, the column type based on each data column train generation using pre- first pass through corresponding with the column type
Data identification model identifies the column feature vector that each data arrange, and determines the column label of each data column.
In embodiments of the present invention, data identification model is identified each column determined to the column feature vector of each column
Column label be set in advance according to actual needs, such as label can be the contents such as population, area, GDP.
Step S114, the column label based on each data column match each data column.
In embodiments of the present invention, data identical for label show that the content of two column datas description can match,
Such as be population to the label of A column data, the label of B column data is population, then shows that A, B column data are likely to be different zones
Demographic data, A column data and B column data can be combined.
A kind of data matching method provided in an embodiment of the present invention, after obtaining multiple tables of data to be matched, to each number
The matching of code table code value and canonical identification are carried out according to column, the column class of each data column is then determined according to the column data that each data arrange
Type, and according to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column
Type extracts the column feature vector of each data column, statistical nature, column name including column data using preset Feature Selection Model
And/or the Expressive Features and data column essential attribute feature of column annotation information, the statistical nature of the column data includes columns
According to value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature packet
The different degree of the frequency of use of column data, the column type of data column and the data column determined in advance according to preset rules is included, so
Column type afterwards based on each data column is using the pre- data identification model pair for first passing through training generation corresponding with the column type
The column feature vector of each data column is identified, and determines the column label of each data column, is based ultimately upon the column label of each data column
Each data column are matched.Data matching method provided in an embodiment of the present invention can make full use of the column name information of each column
And/or the statistical information of column annotation information and column data, and the conventional essential attribute feature of column is combined, such as frequency of use,
The different degree of the data marked in advance determines feature vector of falling out, and the feature vector extracted by above- mentioned information is compared to existing
Data matching method, feature of each column data in multiple dimensions has been won over by any means with smaller data volume, so that calculation amount is significantly
It reduces, and the subsequent data type based on column data is using the pre- data for first passing through training generation corresponding with the data type
Identification model identifies the column feature vector of each column, wherein the data identification model is given birth to by mass data sample training
At, so that the label result of the data finally determined is more accurate, it is accurate in guarantee compared to existing data identification method
While rate, data calculation amount is greatly reduced, in particular for the big government data of data volume, the efficiency of Data Matching is significantly
It improves.
As shown in Fig. 2, in one embodiment it is proposed that another data matching method, with a kind of data shown in fig. 1
The difference of matching process is, before the step S110, further includes:
Step S202, the column data arranged based on preset data prediction model each data are pre-processed.
In embodiments of the present invention, the pretreatment includes the completion of missing data and the extraction of significant data.
In embodiments of the present invention, it is contemplated that when establishing database, data are there may be missing, mistake, meeting when serious
Influence final matched accuracy rate, therefore, can by pre-set data prediction model simultaneously from the quality of data and
Two aspect of content is to data cleansing, such as carries out completion to missing data, maked corrections, format wrong data to the numerical value that peels off
Deleted, is smooth to column data progress, significant data is extracted etc., it reduces quality of data difference and recognition result is caused
Influence, promote the accuracy rate that integrally identifies.
It is provided in an embodiment of the present invention another kind data matching method, compared to Fig. 1 provide a kind of data matching method,
By being located in advance before extracting feature vector to data using column data of the pre-set data prediction model to each column
Reason, can effectively improve the quality of data, reduction factor is influenced according to of poor quality and caused by recognition result, improves data
Matched accuracy rate.
As shown in figure 3, in one embodiment it is proposed that another data matching method, with a kind of data shown in fig. 1
The difference of matching process is, before the step S114, further includes:
Step S302 extracts at least one column label determining by the identification of data identification model according to default rule.
In embodiments of the present invention, the result of label is shown by visualization technique, it may be convenient to assist industry
Business personnel check the result of determining column label.
Step S304 judges whether the determining column label of the identification is accurate.When the determining column label of judgement identification is inaccurate
When true, step S306 is executed;When the determining column label of judgement identification is accurate, other steps are executed.
In embodiments of the present invention, when the determining column label of data identification model identification is accurate, show that data identify mould
Type classification accuracy is higher, at this point, the label that can be directly based upon each column data matches each column data, it is described to execute other
Step is generally the label based on each column data and matches to each column data.
Step S306 modifies the column label, and optimizes the data identification model according to modified column label.
In embodiments of the present invention, when the determining column label inaccuracy of data identification model identification, show that data identify
Model optimization is not yet complete, and there are certain errors, it is therefore desirable to optimize to data identification model, therefore, by will be described
Column label is revised as correct label, can reversely optimize to data identification model, further improve data identification model
Accuracy rate.
Another data matching method provided in an embodiment of the present invention, compared to Fig. 1 provide a kind of data matching method,
After tag recognition, the label is shown using visualization technique, is checked with result of the auxiliary activities person to label,
And further judge whether the determining label result of identification is accurate, when judging result inaccuracy, reversely data can be known
Other model optimizes, and further increases the accuracy rate of data identification model.
As shown in figure 4, in one embodiment it is proposed that it is a kind of based on the data type of column data to the column feature of each column
Vector carries out knowledge method for distinguishing, specifically includes the following steps:
Step S402 trains the numeric data identification model generated to data type using random forests algorithm is in advance based on
Column feature vector for the column of numeric type is identified, and determines column label.
In embodiments of the present invention, the random forests algorithm is that a kind of more decision trees of utilization are trained simultaneously sample
The method of prediction can be predicted and be exported according to the feature vector of sample wherein every decision tree includes multiple equinoxs
Label, using the most label of decision tree prediction results different in random forest as final label.
In embodiments of the present invention, training is generated the numeric data generated based on random forests algorithm training and identifies mould
The step of type, please refers to Fig. 5 and its explanatory paragraph.
Step S404 trains the text data identification model generated to data class using NB Algorithm is in advance based on
Type is that the column feature vector of the column of text-type is identified, and determines column label.
The NB Algorithm is the algorithm based on Bayes principle in embodiments of the present invention, between data set
Relationship it is relatively independent when, classifying quality is preferable, be usually used in text data classification.
As shown in figure 5, in one embodiment it is proposed that a kind of training generates to state is generated based on random forests algorithm training
Numeric data identification model method, specifically includes the following steps:
Step S502 obtains multiple data sample tables.
In embodiments of the present invention, multiple sample column name information and/or sample column annotation are included in the data sample table
The sample column data of information and each sample column.
Step S504 obtains the target labels of each sample column.
In embodiments of the present invention, the target labels of the sample column are previously known.
Step S506 identifies the sample column data for determining that each sample arranges using canonical based on the column sample data of each sample column
Type.
In embodiments of the present invention, the process being trained using data sample is needed and carries out recognizing process to data
It remains exactly the same, therefore, the canonical identification used in step S506 is identical as the canonical identification used in abovementioned steps S104.
Step S508, according to the sample of the sample column name information of each sample column and/or sample column annotation information, each sample column
The sample column data type of column data and each sample column is arranged using the sample that preset Feature Selection Model extracts each sample column
Feature vector.
In embodiments of the present invention, the sample column feature vector includes the statistical nature of sample column data, sample column name
And/or the Expressive Features and sample column essential attribute feature of sample column annotation information, the statistical nature of the sample column data
Value range, mean value and variance including sample column data, sample column essential attribute feature include the use of sample column data
The different degree of frequency, the data type of sample column data and the sample column data determined in advance according to preset rules.
In embodiments of the present invention, likewise, the process being trained using data sample is needed and known to data
The process of being clipped to remains exactly the same, therefore spy used in Feature Selection Model used in step S508 and abovementioned steps S106
It is identical that sign extracts model.
Step S510 establishes the numeric data identification model of the initialization containing variable element based on random forests algorithm.
Step S512 is determined according to the sample column feature vector of each sample column and the numeric data identification model
The responsive tags of each sample column.
In embodiments of the present invention, the numeric data identification model can be understood as independent variable column feature vector and because becoming
The functional relation for measuring label, independent variable column feature vector is input in function, so that it may determine dependent variable label.
Step S514, judges whether the responsive tags and the target labels meet preset trained success conditions.When
When judging that the responsive tags and the target labels are unsatisfactory for preset trained success conditions, step S516 is executed;Work as judgement
When the responsive tags and the target labels meet preset trained success conditions, step S518 is executed.
In embodiments of the present invention, the preset trained success conditions can be frequency of training and reach preset value, can also
It is less than certain condition to be in response to the difference of label and target labels.
Step S516 adjusts the variable element in the numeric data identification model, and is back to the step 510.
Current value data identification model is determined as the numerical value number generated based on random forests algorithm training by step S518
According to identification model.
In embodiments of the present invention, when judging responsive tags and the target labels meet preset trained success conditions
When, at this point, can be considered that numeric data identification model is tentatively completed, the high column mark of accuracy rate can be exported according to column feature vector
Label.
As shown in fig. 6, in one embodiment it is proposed that a kind of data matching device, details are as follows.
In embodiments of the present invention, the data matching device includes:
Tables of data acquiring unit 610, for obtaining multiple tables of data to be matched.
In embodiments of the present invention, acquisition tables of data to be matched can derive from different databases, such as often
Data acquisition can be realized by input data path in Oracle, SQL, A Liyun, the Hadoop etc. seen, and will be from different numbers
Unification is carried out according to the format of the data obtained in library.
In embodiments of the present invention, the column name information and/or column annotation information comprising multiple data column in the tables of data
And the column data of each data column.
Code table code value matching unit 620, the column data for being arranged using code table code value each data is matched.
In embodiments of the present invention, the symbol special for part present in tables of data, such as money symbol, utilize code
Such additional character that value code table can match, so as to conveniently determine the data column service class
Type.
Canonical recognition unit 630, for identifying the portion for meeting preset matching rule in each data column using regular expression
Point.
Column type determining units 640, the column data for being arranged according to each data is identified using default rule determines each number
According to the column type of column.
In embodiments of the present invention, the column data type of each column can be identified according to the column data that each data arrange, it is described
Column data type includes text-type and numeric type.
Column characteristic vector pickup unit 650, column name information and/or column annotation information for being arranged according to each data, each number
According to the column type that the column data of column and each data arrange, using preset Feature Selection Model extract the column features of each data column to
Amount.
In embodiments of the present invention, the column feature vector includes the statistical nature of column data, column name and/or column annotation letter
The Expressive Features and data column essential attribute feature of breath, the statistical nature of the column data include the value range of column data,
Mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature include the use of column data
The different degree of frequency, the column type of data column and the data column determined in advance according to preset rules.
Column label determination unit 660, the column type for being arranged based on each data is using corresponding with the column type preparatory
The column feature vector that each data arrange is identified by the data identification model that training generates, and determines the column mark of each data column
Label.
In embodiments of the present invention, data identification model is identified each column determined to the column feature vector of each column
Column label be set in advance according to actual needs, such as label can be the contents such as population, area, GDP.
Data matching unit 670, the column label for being arranged based on each data match each data column.
In embodiments of the present invention, data identical for label show that the content of two column datas description can match,
Such as be population to the label of A column data, the label of B column data is population, then shows that A, B column data are likely to be different zones
Demographic data, A column data and B column data can be combined.
A kind of data matching device provided in an embodiment of the present invention, after obtaining multiple tables of data to be matched, to each number
The matching of code table code value and canonical identification are carried out according to column, the column class of each data column is then determined according to the column data that each data arrange
Type, and according to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column
Type extracts the column feature vector of each data column, statistical nature, column name including column data using preset Feature Selection Model
And/or the Expressive Features and data column essential attribute feature of column annotation information, the statistical nature of the column data includes columns
According to value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature packet
The different degree of the frequency of use of column data, the column type of data column and the data column determined in advance according to preset rules is included, so
Column type afterwards based on each data column is using the pre- data identification model pair for first passing through training generation corresponding with the column type
The column feature vector of each data column is identified, and determines the column label of each data column, is based ultimately upon the column label of each data column
Each data column are matched.Data matching method provided in an embodiment of the present invention can make full use of the column name information of each column
And/or the statistical information of column annotation information and column data, and the conventional essential attribute feature of column is combined, such as frequency of use,
The different degree of the data marked in advance determines feature vector of falling out, and the feature vector extracted by above- mentioned information is compared to existing
Data matching method, feature of each column data in multiple dimensions has been won over by any means with smaller data volume, so that calculation amount is significantly
It reduces, and the subsequent data type based on column data is using the pre- data for first passing through training generation corresponding with the data type
Identification model identifies the column feature vector of each column, wherein the data identification model is given birth to by mass data sample training
At, so that the label result of the data finally determined is more accurate, it is accurate in guarantee compared to existing data identification method
While rate, data calculation amount is greatly reduced, in particular for the big government data of data volume, the efficiency of Data Matching is significantly
It improves.
As shown in fig. 7, in one embodiment it is proposed that another data matching device, with a kind of data shown in Fig. 6
The difference of coalignment is, further includes:
Data pre-processing unit 710, for being carried out based on preset data prediction model to the column data that each data arrange
Pretreatment.
In embodiments of the present invention, the pretreatment includes the completion of missing data and the extraction of significant data.
In embodiments of the present invention, it is contemplated that when establishing database, data are there may be missing, mistake, meeting when serious
Influence final matched accuracy rate, therefore, can by pre-set data prediction model simultaneously from the quality of data and
Two aspect of content is to data cleansing, such as carries out completion to missing data, maked corrections, format wrong data to the numerical value that peels off
Deleted, is smooth to column data progress, significant data is extracted etc., it reduces quality of data difference and recognition result is caused
Influence, promote the accuracy rate that integrally identifies.
It is provided in an embodiment of the present invention another kind data matching device, compared to Fig. 6 provide a kind of data matching device,
By being located in advance before extracting feature vector to data using column data of the pre-set data prediction model to each column
Reason, can effectively improve the quality of data, reduction factor is influenced according to of poor quality and caused by recognition result, improves data
Matched accuracy rate.
As shown in figure 8, in one embodiment it is proposed that another data matching device, with a kind of data shown in Fig. 6
The difference of coalignment is, further includes:
Column label extracting unit 810 is identified for extracting at least one according to default rule by data identification model
Determining column label.
In embodiments of the present invention, the result of label is shown by visualization technique, it may be convenient to assist industry
Business personnel check the result of determining column label.
Column label judging unit 820, for judging whether the determining column label of the identification is accurate.
In embodiments of the present invention, when the determining column label of data identification model identification is accurate, show that data identify mould
Type classification accuracy is higher, at this point, the label that can be directly based upon each column data matches each column data, it is described to execute other
Step is generally the label based on each column data and matches to each column data.
Data identification model optimizes unit 830, for modifying the column when the determining column label inaccuracy of judgement identification
Label, and the data identification model is optimized according to modified column label.
In embodiments of the present invention, when the determining column label inaccuracy of data identification model identification, show that data identify
Model optimization is not yet complete, and there are certain errors, it is therefore desirable to optimize to data identification model, therefore, by will be described
Column label is revised as correct label, can reversely optimize to data identification model, further improve data identification model
Accuracy rate.
Another data matching device provided in an embodiment of the present invention, compared to Fig. 6 provide a kind of data matching device,
After tag recognition, the label is shown using visualization technique, is checked with result of the auxiliary activities person to label,
And further judge whether the determining label result of identification is accurate, when judging result inaccuracy, reversely data can be known
Other model optimizes, and further increases the accuracy rate of data identification model.
In one embodiment it is proposed that a kind of computer equipment, the computer equipment include memory, processor and
It is stored in the computer program that can be run on the memory and on the processor, the processor executes the computer
It is performed the steps of when program
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column note in the tables of data
Release the column data of information and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
The column type for determining each data column, the column type are identified using default rule according to the column data of each data column
Including numeric type and text-type;
It is arranged according to the column name information of each data column and/or column annotation information, the column data of each data column and each data
Column type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes columns
According to statistical nature, column name and/or column annotation information Expressive Features and data column essential attribute feature, the column data
Statistical nature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data
Column essential attribute feature includes the frequency of use of column data, the column type of data column and the number determined in advance according to preset rules
According to the different degree of column;
Column type based on each data column is using the pre- data identification for first passing through training generation corresponding with the column type
Model identifies the column feature vector that each data arrange, and determines the column label of each data column;
Column label based on each data column matches each data column.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium
Computer program, when computer program is executed by processor, so that processor executes following steps:
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column note in the tables of data
Release the column data of information and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
The column type for determining each data column, the column type are identified using default rule according to the column data of each data column
Including numeric type and text-type;
It is arranged according to the column name information of each data column and/or column annotation information, the column data of each data column and each data
Column type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes columns
According to statistical nature, column name and/or column annotation information Expressive Features and data column essential attribute feature, the column data
Statistical nature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data
Column essential attribute feature includes the frequency of use of column data, the column type of data column and the number determined in advance according to preset rules
According to the different degree of column;
Column type based on each data column is using the pre- data identification for first passing through training generation corresponding with the column type
Model identifies the column feature vector that each data arrange, and determines the column label of each data column;
Column label based on each data column matches each data column.
Although should be understood that various embodiments of the present invention flow chart in each step according to arrow instruction successively
It has been shown that, but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein,
There is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.Moreover, in each embodiment
At least part step may include multiple sub-steps perhaps these sub-steps of multiple stages or stage be not necessarily
Synchronization executes completion, but can execute at different times, and the execution sequence in these sub-steps or stage also need not
Be so successively carry out, but can at least part of the sub-step or stage of other steps or other steps in turn or
Person alternately executes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read
In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein
Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile
And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled
Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory
(RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM
(SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM
(ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight
Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention
Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (10)
1. a kind of data matching method, which is characterized in that the described method includes:
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column annotation letter in the tables of data
The column data of breath and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
Identify that the column type for determining each data column, the column type include using default rule according to the column data of each data column
Numeric type and text-type;
According to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column
Type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes column data
Statistical nature, the Expressive Features of column name and/or column annotation information and data column essential attribute feature, the statistics of the column data
Feature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data column base
This attributive character includes the frequency of use of column data, the column type of data column and arranges in advance according to the data that preset rules determine
Different degree;
Column type based on each data column is using the pre- data identification model for first passing through training generation corresponding with the column type
The column feature vector of each data column is identified, and determines the column label of each data column;
Column label based on each data column matches each data column.
2. data matching method according to claim 1, which is characterized in that in the column name information arranged according to each data
And/or column annotation information, each data column column data and each data column column type, mentioned using preset Feature Selection Model
Before the step of taking the column feature vector of each data column, further includes:
The column data arranged based on preset data prediction model each data is pre-processed, and the pretreatment includes missing number
According to completion and significant data extraction.
3. data matching method according to claim 1, which is characterized in that in the column label pair based on each data column
Before the step of each data column are matched, further includes:
At least one column label determining by the identification of data identification model is extracted according to default rule;
Judge whether the determining column label of the identification is accurate;
When the determining column label inaccuracy of judgement identification, the column label is modified, and institute is optimized according to modified column label
State data identification model.
4. data matching method according to claim 1, which is characterized in that the column type based on each data column uses
The pre- data identification model for first passing through training generation corresponding with the column type knows the column feature vector that each data arrange
Not, and the step of determining the column label of each data column it specifically includes:
Use the numeric data identification model for being in advance based on random forests algorithm training generation to data type for the column of numeric type
Column feature vector identified, and determine column label;
Use the text data identification model for being in advance based on NB Algorithm training generation to data type for text-type
The column feature vector of column is identified, and determines column label.
5. data matching method according to claim 4, which is characterized in that training generates described based on random forests algorithm
The step of numeric data identification model that training generates, specifically includes:
Multiple data sample tables are obtained, include multiple sample column name information and/or sample column annotation letter in the data sample table
The sample column data of breath and each sample column;
Obtain the target labels of each sample column;
The sample column data type for determining that each sample arranges is identified using canonical based on the column sample data of each sample column;
According to the sample column name information of each sample column and/or sample column annotation information, the sample column data of each sample column and each
The sample column data type of sample column extracts the sample column feature vector of each sample column, institute using preset Feature Selection Model
State the description spy that sample column feature vector includes the statistical nature of sample column data, sample column name and/or sample column annotation information
Sign and sample column essential attribute feature, the statistical nature of the sample column data include the value range of sample column data,
Value and variance, sample column essential attribute feature include the frequency of use of sample column data, the data type of sample column data with
And the different degree of the sample column data determined in advance according to preset rules;
The numeric data identification model of the initialization containing variable element is established based on random forests algorithm;
The sound of each sample column is determined according to the sample column feature vector of each sample column and the numeric data identification model
Answer label;
Judge whether the responsive tags and the target labels meet preset trained success conditions;
When judging that the responsive tags and the target labels are unsatisfactory for preset trained success conditions, the numerical value number is adjusted
According to the variable element in identification model, and it is back to the sample column feature vector according to each sample column and the numerical value number
The step of determining the responsive tags of each sample column according to identification model;
When judging the responsive tags and the target labels meet preset trained success conditions, current value data are identified
Model is determined as the numeric data identification model generated based on random forests algorithm training.
6. a kind of data matching device characterized by comprising
Tables of data acquiring unit, for obtaining multiple tables of data to be matched, the column comprising multiple data column in the tables of data
The column data of name information and/or column annotation information and each data column;
Code table code value matching unit, the column data for being arranged using code table code value each data are matched;
Canonical recognition unit, for identifying the part for meeting preset matching rule in each data column using regular expression;
Column type determining units, the column data for being arranged according to each data identify the column for determining each data column using default rule
Type, the column type include numeric type and text-type;
Column characteristic vector pickup unit, the column of column name information and/or column annotation information, each data column for being arranged according to each data
The column type of data and each data column extracts the column feature vector of each data column using preset Feature Selection Model, described
Column feature vector includes the statistical nature of column data, the Expressive Features of column name and/or column annotation information and the basic category of data column
Property feature, the statistical nature of the column data includes the value range of column data, mean value, variance, quantile, the coefficient of variation, peak
Degree, the degree of bias, comentropy, data column essential attribute feature include the frequency of use of column data, data column column type and in advance
According to the different degree for the data column that preset rules determine;
Column label determination unit, column type for being arranged based on each data pre- first pass through training using corresponding with the column type
The data identification model of generation identifies the column feature vector that each data arrange, and determines the column label of each data column;
Data matching unit, the column label for being arranged based on each data match each data column.
7. a kind of data matching unit according to claim 6, which is characterized in that further include:
Data pre-processing unit, the column data for being arranged based on preset data prediction model each data are pre-processed,
The pretreatment includes the completion of missing data and the extraction of significant data.
8. a kind of data matching unit according to claim 6, which is characterized in that further include:
Column label extracting unit, for extracting at least one column determining by the identification of data identification model according to default rule
Label;
Column label judging unit, for judging whether the determining column label of the identification is accurate;
Data identification model optimizes unit, for modifying the column label when the determining column label inaccuracy of judgement identification, and
Optimize the data identification model according to modified column label.
9. a kind of computer equipment, which is characterized in that including memory and processor, computer journey is stored in the memory
Sequence, when the computer program is executed by the processor, so that the processor perform claim requires any one of 1 to 5 power
Benefit requires the step of data matching method.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program, when the computer program is executed by processor, so that the processor perform claim requires any one of 1 to 5 right
It is required that the step of data matching method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910664541.7A CN110427992A (en) | 2019-07-23 | 2019-07-23 | Data matching method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910664541.7A CN110427992A (en) | 2019-07-23 | 2019-07-23 | Data matching method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110427992A true CN110427992A (en) | 2019-11-08 |
Family
ID=68411857
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910664541.7A Pending CN110427992A (en) | 2019-07-23 | 2019-07-23 | Data matching method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110427992A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929285A (en) * | 2019-12-10 | 2020-03-27 | 支付宝(杭州)信息技术有限公司 | Method and device for processing private data |
CN111104466A (en) * | 2019-12-25 | 2020-05-05 | 航天科工网络信息发展有限公司 | Method for rapidly classifying massive database tables |
CN112162978A (en) * | 2020-10-30 | 2021-01-01 | 杭州安恒信息安全技术有限公司 | Data blood margin detection method and device, electronic equipment and readable storage medium |
CN113076379A (en) * | 2021-04-27 | 2021-07-06 | 上海德衡数据科技有限公司 | Method and system for distinguishing element number areas based on digital ICD |
CN113127509A (en) * | 2019-12-31 | 2021-07-16 | 中国移动通信集团重庆有限公司 | Method and device for adapting SQL execution engine in PaaS platform |
CN113157788A (en) * | 2021-04-13 | 2021-07-23 | 福州外语外贸学院 | Big data mining method and system |
CN113312354A (en) * | 2021-06-10 | 2021-08-27 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
WO2022123370A1 (en) * | 2020-12-11 | 2022-06-16 | International Business Machines Corporation | Finding locations of tabular data across systems |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407357A (en) * | 2016-09-07 | 2017-02-15 | 深圳市中易科技有限责任公司 | Engineering method for developing text data rule model |
CN107851233A (en) * | 2015-06-19 | 2018-03-27 | 阿普泰克科技公司 | Local analytics at assets |
CN108537207A (en) * | 2018-04-24 | 2018-09-14 | Oppo广东移动通信有限公司 | Lip reading recognition methods, device, storage medium and mobile terminal |
CN109299094A (en) * | 2018-09-18 | 2019-02-01 | 深圳壹账通智能科技有限公司 | Tables of data processing method, device, computer equipment and storage medium |
CN109597892A (en) * | 2018-12-25 | 2019-04-09 | 杭州数梦工场科技有限公司 | Classification method, device, equipment and the storage medium of data in a kind of database |
CN109635118A (en) * | 2019-01-10 | 2019-04-16 | 博拉网络股份有限公司 | A kind of user's searching and matching method based on big data |
CN109887285A (en) * | 2019-03-15 | 2019-06-14 | 北京经纬恒润科技有限公司 | A kind of determination method and device for reason of stopping |
-
2019
- 2019-07-23 CN CN201910664541.7A patent/CN110427992A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107851233A (en) * | 2015-06-19 | 2018-03-27 | 阿普泰克科技公司 | Local analytics at assets |
CN106407357A (en) * | 2016-09-07 | 2017-02-15 | 深圳市中易科技有限责任公司 | Engineering method for developing text data rule model |
CN108537207A (en) * | 2018-04-24 | 2018-09-14 | Oppo广东移动通信有限公司 | Lip reading recognition methods, device, storage medium and mobile terminal |
CN109299094A (en) * | 2018-09-18 | 2019-02-01 | 深圳壹账通智能科技有限公司 | Tables of data processing method, device, computer equipment and storage medium |
CN109597892A (en) * | 2018-12-25 | 2019-04-09 | 杭州数梦工场科技有限公司 | Classification method, device, equipment and the storage medium of data in a kind of database |
CN109635118A (en) * | 2019-01-10 | 2019-04-16 | 博拉网络股份有限公司 | A kind of user's searching and matching method based on big data |
CN109887285A (en) * | 2019-03-15 | 2019-06-14 | 北京经纬恒润科技有限公司 | A kind of determination method and device for reason of stopping |
Non-Patent Citations (1)
Title |
---|
吴家碚等: "《C语言程序设计与应用(高职)》", 31 January 2015 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929285A (en) * | 2019-12-10 | 2020-03-27 | 支付宝(杭州)信息技术有限公司 | Method and device for processing private data |
CN110929285B (en) * | 2019-12-10 | 2022-01-25 | 支付宝(杭州)信息技术有限公司 | Method and device for processing private data |
CN111104466A (en) * | 2019-12-25 | 2020-05-05 | 航天科工网络信息发展有限公司 | Method for rapidly classifying massive database tables |
CN113127509B (en) * | 2019-12-31 | 2023-08-15 | 中国移动通信集团重庆有限公司 | Method and device for adapting SQL execution engine in PaaS platform |
CN113127509A (en) * | 2019-12-31 | 2021-07-16 | 中国移动通信集团重庆有限公司 | Method and device for adapting SQL execution engine in PaaS platform |
CN112162978A (en) * | 2020-10-30 | 2021-01-01 | 杭州安恒信息安全技术有限公司 | Data blood margin detection method and device, electronic equipment and readable storage medium |
WO2022123370A1 (en) * | 2020-12-11 | 2022-06-16 | International Business Machines Corporation | Finding locations of tabular data across systems |
US11500886B2 (en) | 2020-12-11 | 2022-11-15 | International Business Machines Corporation | Finding locations of tabular data across systems |
GB2616577A (en) * | 2020-12-11 | 2023-09-13 | Ibm | Finding locations of tabular data across systems |
CN113157788A (en) * | 2021-04-13 | 2021-07-23 | 福州外语外贸学院 | Big data mining method and system |
CN113157788B (en) * | 2021-04-13 | 2024-02-13 | 福州外语外贸学院 | Big data mining method and system |
CN113076379A (en) * | 2021-04-27 | 2021-07-06 | 上海德衡数据科技有限公司 | Method and system for distinguishing element number areas based on digital ICD |
CN113312354B (en) * | 2021-06-10 | 2023-07-28 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
CN113312354A (en) * | 2021-06-10 | 2021-08-27 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110427992A (en) | Data matching method, device, computer equipment and storage medium | |
CN110704633B (en) | Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium | |
CN109635117B (en) | Method and device for recognizing user intention based on knowledge graph | |
CN111444723B (en) | Information extraction method, computer device, and storage medium | |
CN111368049B (en) | Information acquisition method, information acquisition device, electronic equipment and computer readable storage medium | |
CN108959242B (en) | Target entity identification method and device based on part-of-speech characteristics of Chinese characters | |
CN111160017A (en) | Keyword extraction method, phonetics scoring method and phonetics recommendation method | |
CN110795919A (en) | Method, device, equipment and medium for extracting table in PDF document | |
CN109858010A (en) | Field new word identification method, device, computer equipment and storage medium | |
US20140351228A1 (en) | Dialog system, redundant message removal method and redundant message removal program | |
CN111814482B (en) | Text key data extraction method and system and computer equipment | |
CN108573707B (en) | Method, device, equipment and medium for processing voice recognition result | |
CN110427612B (en) | Entity disambiguation method, device, equipment and storage medium based on multiple languages | |
CN112287095A (en) | Method and device for determining answers to questions, computer equipment and storage medium | |
JP2019503541A (en) | An annotation system for extracting attributes from electronic data structures | |
CN112699923A (en) | Document classification prediction method and device, computer equipment and storage medium | |
CN110795942B (en) | Keyword determination method and device based on semantic recognition and storage medium | |
CN112347997A (en) | Test question detection and identification method and device, electronic equipment and medium | |
CN110532229B (en) | Evidence file retrieval method, device, computer equipment and storage medium | |
KR102185733B1 (en) | Server and method for automatically generating profile | |
CN111581346A (en) | Event extraction method and device | |
CN110727743A (en) | Data identification method and device, computer equipment and storage medium | |
CN113420116B (en) | Medical document analysis method, device, equipment and medium | |
CN110750626B (en) | Scene-based task-driven multi-turn dialogue method and system | |
CN113849644A (en) | Text classification model configuration method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191108 |