CN101576906B

CN101576906B - Database schema reconstructing system and method

Info

Publication number: CN101576906B
Application number: CN200910078789.1A
Authority: CN
Inventors: 何军; 杜小勇; 刘红岩; 胡泊
Original assignee: 杜小勇
Priority date: 2009-03-03
Filing date: 2009-03-03
Publication date: 2011-03-30
Anticipated expiration: 2029-03-03
Also published as: CN101576906A

Abstract

The invention relates to a database schema reconstructing system and a method, wherein the system comprises a multivalued relation database storing a plurality of multivalued relation data tables, a data preprocessor and classification application equipment. The method comprises the following steps of: A. constructing the relationship of attribute and type in each table; B. calculating the correlation value of each attribute and type in a signal table so as to select attribute subsets of the signal table; C. calculating the correlation value of the attribute subset and type in each table; D. carrying out a descending order arrangement to the table according to the correlation value of each table and the type; and E. recalling the unselected attribute in the step B, wherein the correlationvalue of the attribute and the type is larger than the minimum value in the correlation values of the attribute subset and the type of the table.

Description

A kind of database schema reconstructing system and method

Technical field

The present invention relates to Computer Database and data mining field, particularly relate to a kind of database schema reconstructing system and method that improves many relation classification.

Background technology

In information age today, a lot of data on the Internet are all with a surprising speed increment.We will carry out data mining, will use classification, and these methods of cluster obtain useful pattern from data.A lot of machine learning and data mining all can face a same problem: the higher-dimension spell.In face of the high dimensional data collection, most of method all can poor efficiency, and accuracy rate descends, and the processing time increases.1970, having the computing machine scholar just to propose the attribute selection was an efficient and simple method, can solve the problem of higher-dimension spell.In fact, it is exactly the preprocessing process of a learning knowledge method that attribute is selected, and it farthest removes irrelevant and redundant information, is a step that improves the data set quality.A large amount of experimental results show that carried out after attributive character selects training set, and the raising of learning algorithm efficient and accuracy rate all is beneficial to.

But we still are faced with two problems:

(1) in the database of many relations, the method that does not have a kind of attribute to select can be fit to the structure of many relations.Owing to unaccommodated reason, directly using the attribute system of selection is unsuccessful to many relational databases, also is irrational.And many relational databases are seen everywhere in daily life, and become the popular format of storage data gradually, therefore, lack the decrease in efficiency to this database carries out that attribute system of selection meeting causes classifying, cluster etc. is used.

(2) even the multilist database is handled with conventional single Table Properties system of selection, can be efficient not high yet, the defect that the processing time is long.Reason is that in many relational databases, the existence of some table there is no need, and the relation of they and classification is little, has caused the waste in the sorting technique search.Therefore, not only to select, also want his-and-hers watches to select the attribute in each table.Could improve the efficient of classification application like this.

Summary of the invention

The present invention produces in view of above-mentioned technical matters.An object of the present invention is to propose a kind of database schema reconstructing system and method that improves many relation classification.

In one aspect, database schema reconstructing system according to the present invention comprises: many relational databases are used to store some many relation database tables; Data pre-processor, the many relation datas that are used for the many-many relationship tables of data are carried out the selection of attribute and table and are handled so that database is reconstructed; And classification application equipment, be used for the many relational databases after the reconstruct are trained, predict new data with the rule that produces.

In aspect this, wherein data pre-processor further comprises: make up module, be used for making up the relation between described each table attribute and classification; Attribute is selected module, and the relevance values that is used for calculating each attribute of single table and classification is to select the attribute set of single table; Concern computing module, be used to calculate the attribute set of each table and the relevance values of classification; Order module is used for coming his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification; Recall module, be used for recalling at attribute and select the non-selected attribute of module, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of attribute set of showing and classification.

In one aspect, database schema reconstructing method according to the present invention comprises step: A, make up the relation between attribute and classification in each table; B, calculate the relevance values of each attribute in the single table and classification to select the attribute set of single table; C, calculate the attribute set of each table and the relevance values of classification; D, come his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification; E, recall non-selected attribute in step B, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of the attribute set of table and classification.

In aspect this, wherein in step B, calculate the relevance values of each attribute and classification by following formula, wherein InformationGain is the information gain value between computation attribute X and Y, H (X) is the entropy of computation attribute.

SU (X, Y) = 2 [\frac{InformationGain (X | Y)}{H (X) + H (Y)}]

In aspect this, wherein in step C by attribute set between all properties and classification the mean value of correlativity divided by attribute between the mean value of correlativity calculate the attribute set of each table and the relevance values of classification.

By the present invention, the structure of many-many relationship database is transformed, make it according to the linear array that concerns of classification size.Make classification application find table relevant and attribute faster thus, reduce the search volume, thereby improved the time of classification with classification.

Description of drawings

In conjunction with accompanying drawing subsequently, what may be obvious that from following detailed description draws above-mentioned and other purpose of the present invention, feature and advantage.In the accompanying drawings:

Fig. 1 has provided the block scheme according to database schema reconstructing system of the present invention;

Fig. 2 has provided the more detailed block diagram according to data pre-processor of the present invention;

Fig. 3 has provided the process flow diagram according to database schema reconstructing method of the present invention;

Fig. 4 has provided the example according to many relational databases of the present invention;

Fig. 5 has provided the example according to many relational databases of the present invention;

Fig. 6 has provided the example according to the database after the reconstruct of the present invention.

Embodiment

To at length discuss hereinafter, the disclosure can adopt the embodiment of the embodiment of complete hardware, complete software or comprise the form of both embodiment of hardware and software element.In a preferred embodiment, the disclosure can be implemented with software, and it can be including, but not limited to firmware, resident software, microcode or the like.

For a more complete understanding of the present invention and advantage, below in conjunction with drawings and the specific embodiments the present invention is done explanation in further detail.

At first, with reference to figure 1, database schema reconstructing system according to the present invention is described.

As shown in Figure 1, this database schema system comprises many relational databases, data pre-processor and classification application equipment.

These many relational databases are used to store some many relation database tables, one of them many relation database table.Fig. 4 has provided an example of many relational databases.In this example, many relational databases are financial databases, and this database has 8 tables, is linked together by major key and external key between table.As shown in Figure 5, show loan and show have this attribute of account_id to link between the account as main external key.Object table is the loan table, and objective attribute target attribute is that status has two values, and on behalf of this loan (loan), yes repay on schedule, and on behalf of this loan, no do not repay on schedule.

Many relation datas in the data pre-processor many-many relationship tables of data carry out the selection of attribute and table and handle so that database is reconstructed.With reference to figure 2, this data pre-processor is illustrated in greater detail subsequently.

Classification application equipment is the multilist sorter just.This classification application equipment is used for relying on original data to train the sorter that can predict the new data classification automatically under the based environment of pass more than.If without the method for our database schema reconstructing, classification application equipment also can be to the processing of classifying of existing database, but not enough to some extent on the performance.For example among Fig. 4, and then the first processing list loan of classification application equipment meeting handles four tables trans, account, disposition and order, goes down successively.Like this, the performance deficiency is in particular in that the time of training data is long, and the training back is poor to the prediction accuracy of new record.Reason is: the first, press legacy data storehouse pattern, and the more than table of the each processing of classification application equipment, the time of training rule like this increases; The second, the order of classification application device processes table is not optimum, and the rule that obtains so is not optimum, causes the accuracy of prediction new record classification to reduce.Yet, we to database schema reconstructing after, these two defectives remedy to some extent: the first, the database schema after the reconstruct is a list structure, classification application equipment is each only can handle a table, the training time shortens; The second, the database after the reconstruct be by with the classification relevance ranking, that is to say, with classification is maximally related can priority processing, obtain the rule of more optimizing than original.For example, in the pattern originally, because table district is far away from table loan, so miss this rule: district.avg_salary＜10000=possibly〉label=no, the account of regional per capita income below 10,000 yuan the time can not repaid the loan on schedule.And this rule is actually very important, can help to improve prediction accuracy.

Now, with reference to figure 2, data pre-processor according to the present invention is described.

As shown in Figure 2, this data pre-processor comprises structure module, attribute selection module, concerns computing module, order module and recall module.

The structure module is used for making up the relation between each table attribute and classification.Specifically, write down all classification on the mark with every in each table in the database.For example, in original database, have only the record among the object table loan to comprise classification, and record does not have the mark classification in all the other 7 tables; If do not have category attribute in the current table, the relation by the link of main external key from object table then, corresponding classification value is passed in the current table, as shown in Figure 6, show loan and show have this attribute of Account ID to link between the account as main external key, according to this link, the Loan ID among the object table loan has been passed to the account table, and corresponding classification value has been passed to the account table.

Attribute selects module to be used to select the attribute set of single table.Specifically, calculate each attribute in the single table and the relevance values of classification according to the following equation.

H (X) = - \underset{i}{Σ} P (x_{i}) \log_{2} (P (x_{i}))

Formula (1)

The entropy of this formula computation attribute X wherein, the probable value when wherein P (x) is computation attribute X value x;

H (X | Y) = - \underset{j}{Σ} P (y_{j}) \underset{i}{Σ} P (x_{i} | y_{j}) \log_{2} (P (x_{i} | y_{j}))

Formula (2)

Wherein this formula is calculated when Y value y, the entropy of attribute X value x;

InformationGain (X|Y)=H (X)-H (X|Y) formula (3)

The information gain value of this formula computation attribute X after attribute Y occurs wherein.

SU (X, Y) = 2 [\frac{InformationGain (X | Y)}{H (X) + H (Y)}]

Formula (4)

This formula is the relevance values between computation attribute and classification, and wherein InformationGain is the information gain value between computation attribute X and Y, and H (X) is the entropy of computation attribute.

According to the ordering of the size of relevance values, it is relevant more with classification to be worth big more this attribute of representative, therefrom selects then and the maximally related attribute set of classification.

Concern that computing module is used to calculate the attribute set of each table and the relevance values of classification.Should be noted that the attribute set of table and the relevance values of classification are also referred to as the relevance values of showing with classification hereinafter.Specifically, utilize following formula calculate the attribute set selected in each table do as a whole with show between relation, promptly use in the attribute set between all properties and classification the mean value of correlativity divided by the mean value of the correlativity between attribute.

TSU = \frac{n \overset{&OverBar;}{{SU}_{cf}}}{\sqrt{n + n (n - 1) \overset{&OverBar;}{{SU}_{ff}}}}

Formula (5)

Wherein, n represents attribute number, SU _CfRepresent each attribute and the mean value of showing relevance values, SU _FfThe mean value of relevance values between representation attribute, wherein the value of correlativity is all calculated by formula 4 between the relevance values of attribute and table and attribute.This formula result of calculation is the correlativity of table and classification, and is same, and this table of the big more representative of this value is relevant more with classification.This formula reckoner is the relevance values of community set and classification just.

Order module is used for carrying out descending sort according to the big or small his-and-hers watches of the relevance values of each table and classification, as shown in Figure 6, and through the calculating of above-mentioned several steps, this table of trans is the most relevant with classification, so it come loan table near, and then be the order table, go down successively.So promptly change original database space structure, original main external key link structure is made into the list structure of a definite sequence, database has been carried out reconstruct.

Recall module and be used for recalling, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of table and classification at the non-selected attribute of attribute selection module.

Next, with reference to figure 3, database schema reconstructing method according to the present invention is described.

As shown in Figure 3, this database schema reconstructing method comprises step:

A, make up the relation between attribute and classification in each table.

Specifically, write down all classification on the mark with every in each table in the database.For example, in original database, have only the record among the object table loan to comprise classification, and record does not have the mark classification in all the other 7 tables; If do not have category attribute in the current table, the relation by the link of main external key from object table then, corresponding classification value is passed in the current table, for example among Fig. 5, the account table does not have classification, and so by main external key transmission, the account table obtains classification row in the end, "+" represents yes with symbol, and symbol "-" is represented no; If the record that has does not obtain the classification value, then leave out.As shown in Figure 6, this delegation of AccountID=67 does not obtain the classification value from the loan table, and then we think that it does not have classified information, leaves out.In addition, as shown in Figure 5, show loan and show have this attribute of Account ID to be connected between the account as main external key, according to this link, the Loan ID among the object table loan has been passed to the account table, and corresponding classification value has been passed to the account table, physical connection is not passed through in operation like this, but time and space have been saved in virtual connection, have reduced cost.

B, the attribute of single table is selected.

Specifically, the attribute system of selection is existing technology, mainly is to utilize this notion of information entropy, and information entropy is a notion that is used for the metric amount in the information theory.That is to say, from single table, select an attribute set, make each attribute in this subclass all relevant with classification, and the redundant minimum between each attribute.That is to say, calculate each attribute in the single table and the relevance values of classification according to the following equation.

H (X) = - \underset{i}{Σ} P (x_{i}) \log_{2} (P (x_{i}))

Formula (1)

H (X | Y) = - \underset{j}{Σ} P (y_{j}) \underset{i}{Σ} P (x_{i} | y_{j}) \log_{2} (P (x_{i} | y_{j}))

Formula (2)

InformationGain (X|Y)=H (X)-H (X|Y) formula (3)

SU (X, Y) = 2 [\frac{InformationGain (X | Y)}{H (X) + H (Y)}]

Formula (4)

C, calculate the attribute set of each table and the relevance values of classification, promptly utilize following formula to calculate the attribute set selected in each table and make relation between as a whole and the table, the mean value of the relevance values of all properties and classification is divided by the mean value of the relevance values between attribute in the usefulness attribute set.

TSU = \frac{n \overset{&OverBar;}{{SU}_{cf}}}{\sqrt{n + n (n - 1) \overset{&OverBar;}{{SU}_{ff}}}}

Formula (5)

Wherein, n represents attribute number, SU _CfRepresent each attribute and the mean value of showing relevance values, SU _FfThe mean value of relevance values between representation attribute, wherein the value of correlativity is all calculated by formula 4 between the relevance values of attribute and table and attribute.This formula result of calculation is the correlativity of table and classification, and is same, and this table of the big more representative of this value is relevant more with classification.This formula is calculated the relevance values of attribute set and classification just.

D, come his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification, as shown in Figure 6, through the calculating of above-mentioned several steps, this table of trans is the most relevant with classification, so it come loan table near, and then be the order table, go down successively.So promptly change original database space structure, original main external key link structure is made into the list structure of a definite sequence, database has been carried out reconstruct, such benefit is: make with the maximally related table of classification near from object table, sorter can be as early as possible processing, improve classification effectiveness.As shown in Figure 6, Fig. 6 has provided the database after the reconstruct.

E, recall some attributes of removal, promptly some attribute has been removed among the step B, if the value of the correlativity of this attribute and classification is then recalled greater than the minimum value in the relevance values of table and classification.For example, among the table trans attribute A is arranged, in single Table Properties selection course, do not have selected.In this step,, then attribute A is recalled greater than the relevance values (table of relevance values minimum in this database structure) of account table as the relevance values of attribute A and classification with classification.

By top description as can be known, the method according to this invention goes for many relational databases.Many relational databases are the abundantest, modal data memory formats in current society.But the many-many relationship database carries out attribute selects the method for optimization almost not have, the most direct method is exactly that the method for handling single relational database is used on many relational databases, but can cause form not to be inconsistent, also need to carry out the conversion of form, so this method has been filled up this blank.In addition, many-many relationship database of the present invention is optimized, and makes the efficient of classification application improve.The structure of new method many-many relationship database is transformed, make it according to the linear array that concerns of classification size.The benefit of Pai Lieing is to make classification application to find table relevant with classification and attribute faster like this, reduces the search volume, thereby has improved the time of classification.And this method has solved a problem: if there is a table far from object table in database, and classification application can begin to search for from object table, do not search when might stop this away from table, the influence also very big to classify accuracy.

What may be obvious that for the person of ordinary skill of the art draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims

1. a database schema reconstructing system comprises:

Many relational databases are used to store some many relation database tables;

Data pre-processor, the many relation datas that are used for the many-many relationship tables of data are carried out the selection of attribute and table and are handled so that database is reconstructed; And

Classification application equipment is used for the many relational databases after the reconstruct are trained, and predicts new data with the rule that produces;

Wherein, data pre-processor further comprises:

Make up module, be used for making up the relation between described each table attribute and classification;

Attribute is selected module, and the relevance values that is used for calculating each attribute of single table and classification passes through formula to select the attribute set of single table

SU (X, Y) = 2 [\frac{InformationGain (X | Y)}{H (X) + H (Y)}]

Calculate the relevance values of each attribute and classification, wherein, (X is the function of the degree of correlation of arbitrary attribute Y of tolerance and objective attribute target attribute X Y) to SU, and InformationGain is the information gain value between computation attribute X and Y, and H (X) is the entropy of computation attribute;

Concern computing module, be used to calculate the attribute set of each table and the relevance values of classification;

Order module is used for coming his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification;

Recall module, be used for recalling at attribute and select the non-selected attribute of module, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of attribute set of showing and classification.

2. method that is used for database schema reconstructing system, wherein this system comprises many relational databases, data pre-processor and the classification application equipment of storing some many relation database tables, this method comprises:

A, make up the relation between attribute and classification in each table;

B, calculate each attribute in the single table and classification relevance values to select the attribute set of single table, pass through formula

SU (X, Y) = 2 [\frac{InformationGain (X | Y)}{H (X) + H (Y)}]

C, calculate the attribute set of each table and the relevance values of classification;

D, come his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification;

E, recall non-selected attribute in step B, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of the attribute set of table and classification.

3. according to the method for claim 2, wherein in step C by attribute set between all properties and classification the mean value of correlativity divided by attribute between the mean value of correlativity calculate the attribute set of each table and the relevance values of classification.