CN101576906B - Database schema reconstructing system and method - Google Patents

Database schema reconstructing system and method Download PDF

Info

Publication number
CN101576906B
CN101576906B CN200910078789.1A CN200910078789A CN101576906B CN 101576906 B CN101576906 B CN 101576906B CN 200910078789 A CN200910078789 A CN 200910078789A CN 101576906 B CN101576906 B CN 101576906B
Authority
CN
China
Prior art keywords
attribute
classification
relevance values
many
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200910078789.1A
Other languages
Chinese (zh)
Other versions
CN101576906A (en
Inventor
何军
杜小勇
刘红岩
胡泊
Original Assignee
杜小勇
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杜小勇 filed Critical 杜小勇
Priority to CN200910078789.1A priority Critical patent/CN101576906B/en
Publication of CN101576906A publication Critical patent/CN101576906A/en
Application granted granted Critical
Publication of CN101576906B publication Critical patent/CN101576906B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a database schema reconstructing system and a method, wherein the system comprises a multivalued relation database storing a plurality of multivalued relation data tables, a data preprocessor and classification application equipment. The method comprises the following steps of: A. constructing the relationship of attribute and type in each table; B. calculating the correlation value of each attribute and type in a signal table so as to select attribute subsets of the signal table; C. calculating the correlation value of the attribute subset and type in each table; D. carrying out a descending order arrangement to the table according to the correlation value of each table and the type; and E. recalling the unselected attribute in the step B, wherein the correlationvalue of the attribute and the type is larger than the minimum value in the correlation values of the attribute subset and the type of the table.

Description

A kind of database schema reconstructing system and method
Technical field
The present invention relates to Computer Database and data mining field, particularly relate to a kind of database schema reconstructing system and method that improves many relation classification.
Background technology
In information age today, a lot of data on the Internet are all with a surprising speed increment.We will carry out data mining, will use classification, and these methods of cluster obtain useful pattern from data.A lot of machine learning and data mining all can face a same problem: the higher-dimension spell.In face of the high dimensional data collection, most of method all can poor efficiency, and accuracy rate descends, and the processing time increases.1970, having the computing machine scholar just to propose the attribute selection was an efficient and simple method, can solve the problem of higher-dimension spell.In fact, it is exactly the preprocessing process of a learning knowledge method that attribute is selected, and it farthest removes irrelevant and redundant information, is a step that improves the data set quality.A large amount of experimental results show that carried out after attributive character selects training set, and the raising of learning algorithm efficient and accuracy rate all is beneficial to.
But we still are faced with two problems:
(1) in the database of many relations, the method that does not have a kind of attribute to select can be fit to the structure of many relations.Owing to unaccommodated reason, directly using the attribute system of selection is unsuccessful to many relational databases, also is irrational.And many relational databases are seen everywhere in daily life, and become the popular format of storage data gradually, therefore, lack the decrease in efficiency to this database carries out that attribute system of selection meeting causes classifying, cluster etc. is used.
(2) even the multilist database is handled with conventional single Table Properties system of selection, can be efficient not high yet, the defect that the processing time is long.Reason is that in many relational databases, the existence of some table there is no need, and the relation of they and classification is little, has caused the waste in the sorting technique search.Therefore, not only to select, also want his-and-hers watches to select the attribute in each table.Could improve the efficient of classification application like this.
Summary of the invention
The present invention produces in view of above-mentioned technical matters.An object of the present invention is to propose a kind of database schema reconstructing system and method that improves many relation classification.
In one aspect, database schema reconstructing system according to the present invention comprises: many relational databases are used to store some many relation database tables; Data pre-processor, the many relation datas that are used for the many-many relationship tables of data are carried out the selection of attribute and table and are handled so that database is reconstructed; And classification application equipment, be used for the many relational databases after the reconstruct are trained, predict new data with the rule that produces.
In aspect this, wherein data pre-processor further comprises: make up module, be used for making up the relation between described each table attribute and classification; Attribute is selected module, and the relevance values that is used for calculating each attribute of single table and classification is to select the attribute set of single table; Concern computing module, be used to calculate the attribute set of each table and the relevance values of classification; Order module is used for coming his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification; Recall module, be used for recalling at attribute and select the non-selected attribute of module, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of attribute set of showing and classification.
In one aspect, database schema reconstructing method according to the present invention comprises step: A, make up the relation between attribute and classification in each table; B, calculate the relevance values of each attribute in the single table and classification to select the attribute set of single table; C, calculate the attribute set of each table and the relevance values of classification; D, come his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification; E, recall non-selected attribute in step B, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of the attribute set of table and classification.
In aspect this, wherein in step B, calculate the relevance values of each attribute and classification by following formula, wherein InformationGain is the information gain value between computation attribute X and Y, H (X) is the entropy of computation attribute.
SU ( X , Y ) = 2 [ InformationGain ( X | Y ) H ( X ) + H ( Y ) ]
In aspect this, wherein in step C by attribute set between all properties and classification the mean value of correlativity divided by attribute between the mean value of correlativity calculate the attribute set of each table and the relevance values of classification.
By the present invention, the structure of many-many relationship database is transformed, make it according to the linear array that concerns of classification size.Make classification application find table relevant and attribute faster thus, reduce the search volume, thereby improved the time of classification with classification.
Description of drawings
In conjunction with accompanying drawing subsequently, what may be obvious that from following detailed description draws above-mentioned and other purpose of the present invention, feature and advantage.In the accompanying drawings:
Fig. 1 has provided the block scheme according to database schema reconstructing system of the present invention;
Fig. 2 has provided the more detailed block diagram according to data pre-processor of the present invention;
Fig. 3 has provided the process flow diagram according to database schema reconstructing method of the present invention;
Fig. 4 has provided the example according to many relational databases of the present invention;
Fig. 5 has provided the example according to many relational databases of the present invention;
Fig. 6 has provided the example according to the database after the reconstruct of the present invention.
Embodiment
To at length discuss hereinafter, the disclosure can adopt the embodiment of the embodiment of complete hardware, complete software or comprise the form of both embodiment of hardware and software element.In a preferred embodiment, the disclosure can be implemented with software, and it can be including, but not limited to firmware, resident software, microcode or the like.
For a more complete understanding of the present invention and advantage, below in conjunction with drawings and the specific embodiments the present invention is done explanation in further detail.
At first, with reference to figure 1, database schema reconstructing system according to the present invention is described.
As shown in Figure 1, this database schema system comprises many relational databases, data pre-processor and classification application equipment.
These many relational databases are used to store some many relation database tables, one of them many relation database table.Fig. 4 has provided an example of many relational databases.In this example, many relational databases are financial databases, and this database has 8 tables, is linked together by major key and external key between table.As shown in Figure 5, show loan and show have this attribute of account_id to link between the account as main external key.Object table is the loan table, and objective attribute target attribute is that status has two values, and on behalf of this loan (loan), yes repay on schedule, and on behalf of this loan, no do not repay on schedule.
Many relation datas in the data pre-processor many-many relationship tables of data carry out the selection of attribute and table and handle so that database is reconstructed.With reference to figure 2, this data pre-processor is illustrated in greater detail subsequently.
Classification application equipment is the multilist sorter just.This classification application equipment is used for relying on original data to train the sorter that can predict the new data classification automatically under the based environment of pass more than.If without the method for our database schema reconstructing, classification application equipment also can be to the processing of classifying of existing database, but not enough to some extent on the performance.For example among Fig. 4, and then the first processing list loan of classification application equipment meeting handles four tables trans, account, disposition and order, goes down successively.Like this, the performance deficiency is in particular in that the time of training data is long, and the training back is poor to the prediction accuracy of new record.Reason is: the first, press legacy data storehouse pattern, and the more than table of the each processing of classification application equipment, the time of training rule like this increases; The second, the order of classification application device processes table is not optimum, and the rule that obtains so is not optimum, causes the accuracy of prediction new record classification to reduce.Yet, we to database schema reconstructing after, these two defectives remedy to some extent: the first, the database schema after the reconstruct is a list structure, classification application equipment is each only can handle a table, the training time shortens; The second, the database after the reconstruct be by with the classification relevance ranking, that is to say, with classification is maximally related can priority processing, obtain the rule of more optimizing than original.For example, in the pattern originally, because table district is far away from table loan, so miss this rule: district.avg_salary<10000=possibly〉label=no, the account of regional per capita income below 10,000 yuan the time can not repaid the loan on schedule.And this rule is actually very important, can help to improve prediction accuracy.
Now, with reference to figure 2, data pre-processor according to the present invention is described.
As shown in Figure 2, this data pre-processor comprises structure module, attribute selection module, concerns computing module, order module and recall module.
The structure module is used for making up the relation between each table attribute and classification.Specifically, write down all classification on the mark with every in each table in the database.For example, in original database, have only the record among the object table loan to comprise classification, and record does not have the mark classification in all the other 7 tables; If do not have category attribute in the current table, the relation by the link of main external key from object table then, corresponding classification value is passed in the current table, as shown in Figure 6, show loan and show have this attribute of Account ID to link between the account as main external key, according to this link, the Loan ID among the object table loan has been passed to the account table, and corresponding classification value has been passed to the account table.
Attribute selects module to be used to select the attribute set of single table.Specifically, calculate each attribute in the single table and the relevance values of classification according to the following equation.
H ( X ) = - Σ i P ( x i ) log 2 ( P ( x i ) ) Formula (1)
The entropy of this formula computation attribute X wherein, the probable value when wherein P (x) is computation attribute X value x;
H ( X | Y ) = - Σ j P ( y j ) Σ i P ( x i | y j ) log 2 ( P ( x i | y j ) ) Formula (2)
Wherein this formula is calculated when Y value y, the entropy of attribute X value x;
InformationGain (X|Y)=H (X)-H (X|Y) formula (3)
The information gain value of this formula computation attribute X after attribute Y occurs wherein.
SU ( X , Y ) = 2 [ InformationGain ( X | Y ) H ( X ) + H ( Y ) ] Formula (4)
This formula is the relevance values between computation attribute and classification, and wherein InformationGain is the information gain value between computation attribute X and Y, and H (X) is the entropy of computation attribute.
According to the ordering of the size of relevance values, it is relevant more with classification to be worth big more this attribute of representative, therefrom selects then and the maximally related attribute set of classification.
Concern that computing module is used to calculate the attribute set of each table and the relevance values of classification.Should be noted that the attribute set of table and the relevance values of classification are also referred to as the relevance values of showing with classification hereinafter.Specifically, utilize following formula calculate the attribute set selected in each table do as a whole with show between relation, promptly use in the attribute set between all properties and classification the mean value of correlativity divided by the mean value of the correlativity between attribute.
TSU = n SU cf ‾ n + n ( n - 1 ) SU ff ‾ Formula (5)
Wherein, n represents attribute number, SU CfRepresent each attribute and the mean value of showing relevance values, SU FfThe mean value of relevance values between representation attribute, wherein the value of correlativity is all calculated by formula 4 between the relevance values of attribute and table and attribute.This formula result of calculation is the correlativity of table and classification, and is same, and this table of the big more representative of this value is relevant more with classification.This formula reckoner is the relevance values of community set and classification just.
Order module is used for carrying out descending sort according to the big or small his-and-hers watches of the relevance values of each table and classification, as shown in Figure 6, and through the calculating of above-mentioned several steps, this table of trans is the most relevant with classification, so it come loan table near, and then be the order table, go down successively.So promptly change original database space structure, original main external key link structure is made into the list structure of a definite sequence, database has been carried out reconstruct.
Recall module and be used for recalling, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of table and classification at the non-selected attribute of attribute selection module.
Next, with reference to figure 3, database schema reconstructing method according to the present invention is described.
As shown in Figure 3, this database schema reconstructing method comprises step:
A, make up the relation between attribute and classification in each table.
Specifically, write down all classification on the mark with every in each table in the database.For example, in original database, have only the record among the object table loan to comprise classification, and record does not have the mark classification in all the other 7 tables; If do not have category attribute in the current table, the relation by the link of main external key from object table then, corresponding classification value is passed in the current table, for example among Fig. 5, the account table does not have classification, and so by main external key transmission, the account table obtains classification row in the end, "+" represents yes with symbol, and symbol "-" is represented no; If the record that has does not obtain the classification value, then leave out.As shown in Figure 6, this delegation of AccountID=67 does not obtain the classification value from the loan table, and then we think that it does not have classified information, leaves out.In addition, as shown in Figure 5, show loan and show have this attribute of Account ID to be connected between the account as main external key, according to this link, the Loan ID among the object table loan has been passed to the account table, and corresponding classification value has been passed to the account table, physical connection is not passed through in operation like this, but time and space have been saved in virtual connection, have reduced cost.
B, the attribute of single table is selected.
Specifically, the attribute system of selection is existing technology, mainly is to utilize this notion of information entropy, and information entropy is a notion that is used for the metric amount in the information theory.That is to say, from single table, select an attribute set, make each attribute in this subclass all relevant with classification, and the redundant minimum between each attribute.That is to say, calculate each attribute in the single table and the relevance values of classification according to the following equation.
H ( X ) = - Σ i P ( x i ) log 2 ( P ( x i ) ) Formula (1)
The entropy of this formula computation attribute X wherein, the probable value when wherein P (x) is computation attribute X value x;
H ( X | Y ) = - Σ j P ( y j ) Σ i P ( x i | y j ) log 2 ( P ( x i | y j ) ) Formula (2)
Wherein this formula is calculated when Y value y, the entropy of attribute X value x;
InformationGain (X|Y)=H (X)-H (X|Y) formula (3)
The information gain value of this formula computation attribute X after attribute Y occurs wherein.
SU ( X , Y ) = 2 [ InformationGain ( X | Y ) H ( X ) + H ( Y ) ] Formula (4)
This formula is the relevance values between computation attribute and classification, and wherein InformationGain is the information gain value between computation attribute X and Y, and H (X) is the entropy of computation attribute.
According to the ordering of the size of relevance values, it is relevant more with classification to be worth big more this attribute of representative, therefrom selects then and the maximally related attribute set of classification.
C, calculate the attribute set of each table and the relevance values of classification, promptly utilize following formula to calculate the attribute set selected in each table and make relation between as a whole and the table, the mean value of the relevance values of all properties and classification is divided by the mean value of the relevance values between attribute in the usefulness attribute set.
TSU = n SU cf ‾ n + n ( n - 1 ) SU ff ‾ Formula (5)
Wherein, n represents attribute number, SU CfRepresent each attribute and the mean value of showing relevance values, SU FfThe mean value of relevance values between representation attribute, wherein the value of correlativity is all calculated by formula 4 between the relevance values of attribute and table and attribute.This formula result of calculation is the correlativity of table and classification, and is same, and this table of the big more representative of this value is relevant more with classification.This formula is calculated the relevance values of attribute set and classification just.
D, come his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification, as shown in Figure 6, through the calculating of above-mentioned several steps, this table of trans is the most relevant with classification, so it come loan table near, and then be the order table, go down successively.So promptly change original database space structure, original main external key link structure is made into the list structure of a definite sequence, database has been carried out reconstruct, such benefit is: make with the maximally related table of classification near from object table, sorter can be as early as possible processing, improve classification effectiveness.As shown in Figure 6, Fig. 6 has provided the database after the reconstruct.
E, recall some attributes of removal, promptly some attribute has been removed among the step B, if the value of the correlativity of this attribute and classification is then recalled greater than the minimum value in the relevance values of table and classification.For example, among the table trans attribute A is arranged, in single Table Properties selection course, do not have selected.In this step,, then attribute A is recalled greater than the relevance values (table of relevance values minimum in this database structure) of account table as the relevance values of attribute A and classification with classification.
By top description as can be known, the method according to this invention goes for many relational databases.Many relational databases are the abundantest, modal data memory formats in current society.But the many-many relationship database carries out attribute selects the method for optimization almost not have, the most direct method is exactly that the method for handling single relational database is used on many relational databases, but can cause form not to be inconsistent, also need to carry out the conversion of form, so this method has been filled up this blank.In addition, many-many relationship database of the present invention is optimized, and makes the efficient of classification application improve.The structure of new method many-many relationship database is transformed, make it according to the linear array that concerns of classification size.The benefit of Pai Lieing is to make classification application to find table relevant with classification and attribute faster like this, reduces the search volume, thereby has improved the time of classification.And this method has solved a problem: if there is a table far from object table in database, and classification application can begin to search for from object table, do not search when might stop this away from table, the influence also very big to classify accuracy.
What may be obvious that for the person of ordinary skill of the art draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims (3)

1. a database schema reconstructing system comprises:
Many relational databases are used to store some many relation database tables;
Data pre-processor, the many relation datas that are used for the many-many relationship tables of data are carried out the selection of attribute and table and are handled so that database is reconstructed; And
Classification application equipment is used for the many relational databases after the reconstruct are trained, and predicts new data with the rule that produces;
Wherein, data pre-processor further comprises:
Make up module, be used for making up the relation between described each table attribute and classification;
Attribute is selected module, and the relevance values that is used for calculating each attribute of single table and classification passes through formula to select the attribute set of single table
SU ( X , Y ) = 2 [ InformationGain ( X | Y ) H ( X ) + H ( Y ) ]
Calculate the relevance values of each attribute and classification, wherein, (X is the function of the degree of correlation of arbitrary attribute Y of tolerance and objective attribute target attribute X Y) to SU, and InformationGain is the information gain value between computation attribute X and Y, and H (X) is the entropy of computation attribute;
Concern computing module, be used to calculate the attribute set of each table and the relevance values of classification;
Order module is used for coming his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification;
Recall module, be used for recalling at attribute and select the non-selected attribute of module, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of attribute set of showing and classification.
2. method that is used for database schema reconstructing system, wherein this system comprises many relational databases, data pre-processor and the classification application equipment of storing some many relation database tables, this method comprises:
A, make up the relation between attribute and classification in each table;
B, calculate each attribute in the single table and classification relevance values to select the attribute set of single table, pass through formula
SU ( X , Y ) = 2 [ InformationGain ( X | Y ) H ( X ) + H ( Y ) ]
Calculate the relevance values of each attribute and classification, wherein, (X is the function of the degree of correlation of arbitrary attribute Y of tolerance and objective attribute target attribute X Y) to SU, and InformationGain is the information gain value between computation attribute X and Y, and H (X) is the entropy of computation attribute;
C, calculate the attribute set of each table and the relevance values of classification;
D, come his-and-hers watches to carry out descending sort according to each table and the size of the relevance values of classification;
E, recall non-selected attribute in step B, and the relevance values of this attribute and classification is greater than the minimum value in the relevance values of the attribute set of table and classification.
3. according to the method for claim 2, wherein in step C by attribute set between all properties and classification the mean value of correlativity divided by attribute between the mean value of correlativity calculate the attribute set of each table and the relevance values of classification.
CN200910078789.1A 2009-03-03 2009-03-03 Database schema reconstructing system and method Expired - Fee Related CN101576906B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910078789.1A CN101576906B (en) 2009-03-03 2009-03-03 Database schema reconstructing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910078789.1A CN101576906B (en) 2009-03-03 2009-03-03 Database schema reconstructing system and method

Publications (2)

Publication Number Publication Date
CN101576906A CN101576906A (en) 2009-11-11
CN101576906B true CN101576906B (en) 2011-03-30

Family

ID=41271839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910078789.1A Expired - Fee Related CN101576906B (en) 2009-03-03 2009-03-03 Database schema reconstructing system and method

Country Status (1)

Country Link
CN (1) CN101576906B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800422A (en) * 2018-12-20 2019-05-24 北京明略软件系统有限公司 Method, system, terminal and the storage medium that a kind of pair of tables of data is classified
CN110082116B (en) * 2019-03-18 2022-04-19 深圳市元征科技股份有限公司 Evaluation method and evaluation device for vehicle four-wheel positioning data and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308496A (en) * 2008-07-04 2008-11-19 沈阳格微软件有限责任公司 Large scale text data external clustering method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308496A (en) * 2008-07-04 2008-11-19 沈阳格微软件有限责任公司 Large scale text data external clustering method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
舒红平等.基于信息熵的决策属性分类挖掘算法及应用.《计算机工程与应用》.2004,(第1期),186-188. *

Also Published As

Publication number Publication date
CN101576906A (en) 2009-11-11

Similar Documents

Publication Publication Date Title
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
CN109446344B (en) Intelligent analysis report automatic generation system based on big data
CN101404015B (en) Automatically generating a hierarchy of terms
CN111475509A (en) Big data-based user portrait and multidimensional analysis system
CN102306176B (en) On-line analytical processing (OLAP) keyword query method based on intrinsic characteristic of data warehouse
CN107016501A (en) A kind of efficient industrial big data multidimensional analysis method
CN103744928A (en) Network video classification method based on historical access records
CN111489201A (en) Method, device and storage medium for analyzing customer value
Tsytsarau et al. Efficient sentiment correlation for large-scale demographics
CN107895033A (en) A kind of method for early warning of student's academic warning system based on machine learning
CN103425740A (en) IOT (Internet Of Things) faced material information retrieval method based on semantic clustering
CN100401301C (en) Body learning based intelligent subject-type network reptile system configuration method
CN110795613A (en) Commodity searching method, device and system and electronic equipment
CN115309749A (en) Big data experiment system for scientific and technological service
CN118132732A (en) Enhanced search user question and answer method, device, computer equipment and storage medium
CN101576906B (en) Database schema reconstructing system and method
CN112199488A (en) Incremental knowledge graph entity extraction method and system for power customer service question answering
CN117131383A (en) Method for improving search precision drainage performance of double-tower model
CN116186041A (en) Data lake index creation method and device, electronic equipment and computer storage medium
WO2020106950A1 (en) User-experience development system
CN116401338A (en) Design feature extraction and attention mechanism based on data asset intelligent retrieval input and output requirements and method thereof
CN112800219B (en) Method and system for feeding back customer service log to return database
CN115730053A (en) Wind turbine generator operation and maintenance auxiliary intelligent question and answer method and device
CN113342844A (en) Industrial intelligent search system
CN112000389A (en) Configuration recommendation method, system, device and computer storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110330

Termination date: 20130303