CN105205172A

CN105205172A - Database retrieval method

Info

Publication number: CN105205172A
Application number: CN201510660869.3A
Authority: CN
Inventors: 许驰
Original assignee: CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Current assignee: CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Priority date: 2015-10-14
Filing date: 2015-10-14
Publication date: 2015-12-30

Abstract

The invention provides a database retrieval method. The database retrieval method comprises the steps that a decision tree is generated according to sample data; the data are divided in a distributive mode and are divided into multiple blocks; when retrieval is produced, retrieval requests are sent to the corresponding data blocks, and similar records are calculated according to the decision tree. By means of the database retrieval method, distributed management and retrieval are performed by utilizing related data, calculation optimization process is improved, the complex degree of the method is reduced, and the calculation cost is saved.

Description

A kind of database index method

Technical field

The present invention relates to data to store, particularly a kind of database index method.

Background technology

Along with the appearance with massive medical data that develops rapidly of intelligent medical treatment, need corresponding large database as carrier to preserve these data, but the retrieval of mass data is retrieved into a large problem.The document retrieval quantity of medical circle along with Internet resources also exponentially level increase similarity.Similarity retrieval has a wide range of applications scene, such as content-based retrieval, repeat record identification, Optimization of Information Retrieval etc.Generally, similarity retrieval refers to, given data collection D, a searching object q, similar computing method s, user sets a numerical value k, then return the individual object the most similar to q of k, or user specifies a threshold value t, then return all object range being greater than t with the similarity of q.Along with development and the widespread use of cloud, need the data be managed rapidly to increase, efficient similarity retrieval is more and more important.

But existing similarity retrieval method seems inapplicable when band related data.Because for traditional indexing means, when needing the attribute dimensions considering object than time high, poor-performing, in relation, the significance level of attribute at different levels is not considered, accuracy is poor.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of database index method, comprising:

Decision tree is generated according to sample data;

Distributed division is carried out to data, Data Placement is become multiple pieces;

When producing retrieval, retrieval request being divided into corresponding data block, calculating similar record according to decision tree.

Preferably, described according to sample data generate decision tree comprise further:

In advance analyzing and processing is carried out to sample data, when calculating decision tree, two threshold values are set and prevent overfitting, comprise mistake and support threshold value FP, false negative threshold value FN, when calculating certain node of decision tree, when first category ratio is more than or equal to 1-FP, terminate to calculate, and flag node classification is first category, when calculating certain node of decision tree, when the second classification ratio is more than or equal to 1-FN, terminate to calculate, and flag node classification is the second classification; Corresponding decision tree is calculated according to the attribute list of sample data.

Preferably, describedly distributed division is carried out to data comprise further: the attribute kit of each object is containing multiple token or belong to certain numerical value scope, adopt the division based on hashed value, according to token and numerical range by object hash operations to different data blocks, same token is comprised or property value is hashed computing in a data block in same span in attribute, consistance hash is applied in distributed environment, when an object needs to write in multiple data block, only copy object ID, the storage of data and mirror image are safeguarded by data base management system (DBMS).

Preferably, describedly retrieval request is divided into corresponding data block and comprises further:

Complete as calculated at decision tree, and after data have been divided, according to following steps process retrieval request:

Step 1: carry out hash according to the token of retrieval request character type attribute and Numeric Attributes span, hash function when hash function and Data Placement is consistent, by consistance hashed value, be sent on different computing nodes according to each hashed value, each back end is safeguarded and is retrieved the data block with same Hash value;

Step 2: after a back end receives a retrieval, according to its hashed value, compare with the data object in corresponding data block, when calculating the similarity of object in retrieval and data block, compare successively according to the attribute node in decision tree, and select left child node or right child node further to compare according to the Similarity value calculated, until calculate leaf node, judge last analog result according to the classification of leaf node.

The present invention compared to existing technology, has the following advantages:

Utilize the related data of band to carry out distributed management and retrieval, improve the optimizing process of calculating, the complexity of reduction method, saves calculation cost.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the database index method according to the embodiment of the present invention.

Embodiment

Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.

For the problems referred to above, the present invention proposes the decision tree computing method towards large-scale data under distributed environment, the significance level of different attribute, different layers affiliated partner attribute is analyzed based on decision tree, thus can in the process of Candidate Set checking, terminate in advance to calculate by Accuracy Analysis, and do not need more all attributes, and ensured the Approximation Quality of result by probability theory, avoid recursive calculation, reduce computation complexity and storage cost.

An aspect of of the present present invention provides a kind of database index method.Fig. 1 is the database index method process flow diagram according to the embodiment of the present invention.As shown in Figure 1, concrete steps of the present invention are implemented as follows:

First Data Placement is become multiple pieces by method of the present invention, and be divided in a block by likely similar, the ID of data can be sent in multiple pieces.Then, when producing retrieval, retrieval being divided into corresponding data block, calculating similar record according to decision tree.Whole computation process contains 4 parts: wherein step (1) and (2) belong to processed offline, and step (3) and (4) belong to online process.

(1) decision tree is generated according to sample data

Traditional decision tree calculates according to the significance level of attribute the classification that an object belongs to as soon as possible.The present invention utilizes decision tree method to reflect the importance of attribute, thus judges that whether object is similar fast.

In order to calculate decision tree in off-line phase, need to carry out analyzing and processing to sample data in advance.When calculating decision tree, in order to prevent overfitting, need to arrange two threshold values: mistake supports threshold value FP, false negative threshold value FN.When calculating certain node of decision tree, when Y classification ratio is more than or equal to 1-FP, terminate to calculate, and flag node classification is Y.Equally, when calculating certain node of decision tree, when N classification ratio is more than or equal to 1-FN, terminate to calculate, and flag node classification is N.

Attribute list according to sample data can calculate corresponding decision tree.When data volume is larger, the calculated amount of decision tree is larger, and the process particularly calculating decision tree pre-sorting in distributed environment can consume a large amount of communication bandwidths.Therefore the present invention proposes the decision tree computing method of optimization.

(2) distributed division is carried out to data

Common distributed division is carried out to data mainly contain two kinds: a kind of is division based on index, and different data types has different indexes to adopt.Another kind is the division based on hashed value, and the attribute of each object can comprise multiple token or belong to certain numerical range, according to token and numerical range by object hash operations to different data blocks.Current distributed data base management system (DDBMS) emphasizes retractility and fault-tolerance more, and the first division methods exists "bottleneck" problem, and needs maintenance global index, and the present invention also uses the division methods based on hashed value.Comprise same token in attribute or property value can be hashed computing in a data block in same span, consistance hashed value is used in distributed environment.When an object needs to write in multiple data block, only copy ID.Storage and the mirror image of data are safeguarded pellucidly by data base management system (DBMS).

(3) retrieval is assigned to corresponding data block

Basically identical with the method in step (2), but copy many parts of retrievals unlike needs, and send to computing node, in case result for retrieval is imperfect.

(4) similarity retrieval in block is carried out according to decision tree

After retrieval is assigned to corresponding data block, need to calculate accurately in block.Carry out calculating by decision tree can terminate in advance to calculate, thus reduce calculation cost.Therefore the efficiency and the accuracy that calculate decision tree affect overall performance.

The optimization decision tree take-off point computing method that the present invention proposes comprise exact method and approximation method, introduce hereinafter successively.

Decision tree take-off point exact method

First recursively finding out take-off point by gini index makes data set A be split into data set D and E, makes DG=GN (A)-p _d× GN (D)-p _e× GN (E) is maximum.Wherein p _dand p _ethe ratio of D and E shared by A respectively, wherein gini index p _kfor the ratio of classification k in set A, l is categorical measure.

A kind of method with 3 computation of Period decision trees under distributed environment is proposed below.

In the cycle 1: in order to load balancing, estimate fractile.If r is Branch Computed point nodes, then fractile i/r needs to be calculated, wherein i ∈ (1, r-1).Partial node needs the Nogata distributed data of this locality to be sent on host node.Use complexity communication cost, can calculate the result that standard deviation is ε Z, m is the quantity of partial node, and Z is the data total amount of whole data set.

In the cycle 2: according to the fractile that the cycle 1 calculates, the data in corresponding fractile are sent on r host node by each partial node.Each host node calculates local take-off point.The communication cost in this stage is O (Z).

In the cycle 3: collected by the take-off point that the cycle 2 calculates, find out best take-off point as end product.This phase communication cost is O (r).

Because m/ ε <Z and r<Z, so overall communication cost is O (Z).

The following describe decision tree take-off point approximation method:

In order to improve the efficiency of Branch Computed point further, when acceptable suitable reduction degree of accuracy, the present invention proposes the approximation method of lower calculation cost and communication cost.Approximation method only needs one-period, because only hop data, it is just passable that next host node of generalized case does host node.

The ultimate principle of approximation method of the present invention is that larger for this locality possible take-off point is sent to host node with larger probability by each partial node, and host node, by carrying out unbiased esti-mator to categorical measure, analyzes last take-off point.

Specifically, in the first stage, each partial node arranges one and sends S set.Partial node sorts to local data, and first by data boundary object write S, recursive calculation goes out each layer take-off point of local decision tree afterwards, and take-off point is become list L according to the descending sort of gini index value.Take out the data object l in list in turn _i, by with lower probability by l _ibe written in S:

p = m i n (1, d_{i j} \frac{\sqrt{m}}{ϵ Z})

Wherein, d _ijcomplete the data centralization l of sequence _idata object l in gathering with S _jbetween record number; M is the quantity of partial node.For the data object sent with p=1, be called as determinacy object, with (l _i,f _li) form write S, wherein f _liwith l _ifor the frequency of forward part Various types of data during take-off point.For the data object sent with p<1, be called as probability object, with (l _i, l _i→ l _j) form write S in.

In subordinate phase, construct the unbiased esti-mator f ' of classification frequency f.Definition f _ifor the partial value belonging to frequency f from i-th partial node, namely for a candidate branch point v, in order to estimate the f of its correspondence _iv (), first obtains from the nearest data object t of the S middle distance v of i-th partial node transmission _n.If t _nbe probability object, then need to look for one from t _nto a determinacy object t _dpath, the vector length defining this paths is R, and namely step number to the right deducts step number left, and R may be negative, the determinacy object t of acquisition _df be expressed as f _d.If tn is determinacy object, then R=0.When context is clear, the present invention simplifies f _iv () is f _i.Estimated frequency f _iformula be:

f^{'} (v) = Σ_{i = 1}^{m} {f_{i}}^{'} (v)

Complete as calculated at decision tree, and after data have been divided.When there being retrieval request, the basic step of process retrieval is as follows:

Step 1: carry out hash according to the token of retrieval request character type attribute and Numeric Attributes span.Hash function will be consistent with hash function during Data Placement.Because a retrieval may have multiple character type token and Numeric Attributes, multiple hashed value can be produced.Retrieval can, by consistance hashed value, be sent on different computing nodes according to each hashed value, and each back end is safeguarded and retrieved the data block with same Hash value.If the quantity of the average hashed value of a retrieval is h, the average length of a data object is l, then the communication cost produced in this process is hl.

Step 2: after a back end receives a retrieval, according to its hashed value, compare with the data object in corresponding data block.When calculating the similarity of object in retrieval and data block, not more all attributes, but according to the attribute node in decision tree, compare successively, and select left child node or right child node further to compare according to the Similarity value calculated, until calculate leaf node, judge last analog result according to the classification of leaf node.

The communication cost produced in this process is (1-α) B (1-β).Wherein, B is the quantity of data object in data block; α is the ratio of data localization, is ID due to what safeguard in data block, and concrete storage is realized by first floor system; p _iit is the probability that a data object calculates the i-th ATM layer relationsATM in decision tree; R is the quantity of the average relation of data object; β is the rate buffer of relationship object, and these more close objects are local when asking inferior relation has certain buffering, does not just need again to produce communication cost when identical data object is applied.

In sum, the present invention proposes the related data of band and carry out distributed management and search method, improve the optimizing process of calculating, the complexity of reduction method, saves calculation cost.

Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored in data base management system (DBMS) and be performed by computing system.Like this, the present invention is not restricted to any specific hardware and software combination.

Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims

1. a database index method, is characterized in that, comprising:

Decision tree is generated according to sample data;

2. method according to claim 1, is characterized in that, described according to sample data generate decision tree comprise further:

3. method according to claim 2, it is characterized in that, describedly distributed division is carried out to data comprise further: the attribute kit of each object is containing multiple token or belong to certain numerical value scope, adopt the division based on hashed value, according to token and numerical range by object hash operations to different data blocks, same token is comprised or property value is hashed computing in a data block in same span in attribute, consistance hash is applied in distributed environment, when an object needs to write in multiple data block, only copy object ID, the storage of data and mirror image are safeguarded by data base management system (DBMS).

4. method according to claim 3, is characterized in that, describedly retrieval request is divided into corresponding data block and comprises further: