CN105205172A - Database retrieval method - Google Patents

Database retrieval method Download PDF

Info

Publication number
CN105205172A
CN105205172A CN201510660869.3A CN201510660869A CN105205172A CN 105205172 A CN105205172 A CN 105205172A CN 201510660869 A CN201510660869 A CN 201510660869A CN 105205172 A CN105205172 A CN 105205172A
Authority
CN
China
Prior art keywords
data
decision tree
retrieval
node
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510660869.3A
Other languages
Chinese (zh)
Inventor
许驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Original Assignee
CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd filed Critical CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Priority to CN201510660869.3A priority Critical patent/CN105205172A/en
Publication of CN105205172A publication Critical patent/CN105205172A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Abstract

The invention provides a database retrieval method. The database retrieval method comprises the steps that a decision tree is generated according to sample data; the data are divided in a distributive mode and are divided into multiple blocks; when retrieval is produced, retrieval requests are sent to the corresponding data blocks, and similar records are calculated according to the decision tree. By means of the database retrieval method, distributed management and retrieval are performed by utilizing related data, calculation optimization process is improved, the complex degree of the method is reduced, and the calculation cost is saved.

Description

A kind of database index method
Technical field
The present invention relates to data to store, particularly a kind of database index method.
Background technology
Along with the appearance with massive medical data that develops rapidly of intelligent medical treatment, need corresponding large database as carrier to preserve these data, but the retrieval of mass data is retrieved into a large problem.The document retrieval quantity of medical circle along with Internet resources also exponentially level increase similarity.Similarity retrieval has a wide range of applications scene, such as content-based retrieval, repeat record identification, Optimization of Information Retrieval etc.Generally, similarity retrieval refers to, given data collection D, a searching object q, similar computing method s, user sets a numerical value k, then return the individual object the most similar to q of k, or user specifies a threshold value t, then return all object range being greater than t with the similarity of q.Along with development and the widespread use of cloud, need the data be managed rapidly to increase, efficient similarity retrieval is more and more important.
But existing similarity retrieval method seems inapplicable when band related data.Because for traditional indexing means, when needing the attribute dimensions considering object than time high, poor-performing, in relation, the significance level of attribute at different levels is not considered, accuracy is poor.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of database index method, comprising:
Decision tree is generated according to sample data;
Distributed division is carried out to data, Data Placement is become multiple pieces;
When producing retrieval, retrieval request being divided into corresponding data block, calculating similar record according to decision tree.
Preferably, described according to sample data generate decision tree comprise further:
In advance analyzing and processing is carried out to sample data, when calculating decision tree, two threshold values are set and prevent overfitting, comprise mistake and support threshold value FP, false negative threshold value FN, when calculating certain node of decision tree, when first category ratio is more than or equal to 1-FP, terminate to calculate, and flag node classification is first category, when calculating certain node of decision tree, when the second classification ratio is more than or equal to 1-FN, terminate to calculate, and flag node classification is the second classification; Corresponding decision tree is calculated according to the attribute list of sample data.
Preferably, describedly distributed division is carried out to data comprise further: the attribute kit of each object is containing multiple token or belong to certain numerical value scope, adopt the division based on hashed value, according to token and numerical range by object hash operations to different data blocks, same token is comprised or property value is hashed computing in a data block in same span in attribute, consistance hash is applied in distributed environment, when an object needs to write in multiple data block, only copy object ID, the storage of data and mirror image are safeguarded by data base management system (DBMS).
Preferably, describedly retrieval request is divided into corresponding data block and comprises further:
Complete as calculated at decision tree, and after data have been divided, according to following steps process retrieval request:
Step 1: carry out hash according to the token of retrieval request character type attribute and Numeric Attributes span, hash function when hash function and Data Placement is consistent, by consistance hashed value, be sent on different computing nodes according to each hashed value, each back end is safeguarded and is retrieved the data block with same Hash value;
Step 2: after a back end receives a retrieval, according to its hashed value, compare with the data object in corresponding data block, when calculating the similarity of object in retrieval and data block, compare successively according to the attribute node in decision tree, and select left child node or right child node further to compare according to the Similarity value calculated, until calculate leaf node, judge last analog result according to the classification of leaf node.
The present invention compared to existing technology, has the following advantages:
Utilize the related data of band to carry out distributed management and retrieval, improve the optimizing process of calculating, the complexity of reduction method, saves calculation cost.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the database index method according to the embodiment of the present invention.
Embodiment
Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.
For the problems referred to above, the present invention proposes the decision tree computing method towards large-scale data under distributed environment, the significance level of different attribute, different layers affiliated partner attribute is analyzed based on decision tree, thus can in the process of Candidate Set checking, terminate in advance to calculate by Accuracy Analysis, and do not need more all attributes, and ensured the Approximation Quality of result by probability theory, avoid recursive calculation, reduce computation complexity and storage cost.
An aspect of of the present present invention provides a kind of database index method.Fig. 1 is the database index method process flow diagram according to the embodiment of the present invention.As shown in Figure 1, concrete steps of the present invention are implemented as follows:
First Data Placement is become multiple pieces by method of the present invention, and be divided in a block by likely similar, the ID of data can be sent in multiple pieces.Then, when producing retrieval, retrieval being divided into corresponding data block, calculating similar record according to decision tree.Whole computation process contains 4 parts: wherein step (1) and (2) belong to processed offline, and step (3) and (4) belong to online process.
(1) decision tree is generated according to sample data
Traditional decision tree calculates according to the significance level of attribute the classification that an object belongs to as soon as possible.The present invention utilizes decision tree method to reflect the importance of attribute, thus judges that whether object is similar fast.
In order to calculate decision tree in off-line phase, need to carry out analyzing and processing to sample data in advance.When calculating decision tree, in order to prevent overfitting, need to arrange two threshold values: mistake supports threshold value FP, false negative threshold value FN.When calculating certain node of decision tree, when Y classification ratio is more than or equal to 1-FP, terminate to calculate, and flag node classification is Y.Equally, when calculating certain node of decision tree, when N classification ratio is more than or equal to 1-FN, terminate to calculate, and flag node classification is N.
Attribute list according to sample data can calculate corresponding decision tree.When data volume is larger, the calculated amount of decision tree is larger, and the process particularly calculating decision tree pre-sorting in distributed environment can consume a large amount of communication bandwidths.Therefore the present invention proposes the decision tree computing method of optimization.
(2) distributed division is carried out to data
Common distributed division is carried out to data mainly contain two kinds: a kind of is division based on index, and different data types has different indexes to adopt.Another kind is the division based on hashed value, and the attribute of each object can comprise multiple token or belong to certain numerical range, according to token and numerical range by object hash operations to different data blocks.Current distributed data base management system (DDBMS) emphasizes retractility and fault-tolerance more, and the first division methods exists "bottleneck" problem, and needs maintenance global index, and the present invention also uses the division methods based on hashed value.Comprise same token in attribute or property value can be hashed computing in a data block in same span, consistance hashed value is used in distributed environment.When an object needs to write in multiple data block, only copy ID.Storage and the mirror image of data are safeguarded pellucidly by data base management system (DBMS).
(3) retrieval is assigned to corresponding data block
Basically identical with the method in step (2), but copy many parts of retrievals unlike needs, and send to computing node, in case result for retrieval is imperfect.
(4) similarity retrieval in block is carried out according to decision tree
After retrieval is assigned to corresponding data block, need to calculate accurately in block.Carry out calculating by decision tree can terminate in advance to calculate, thus reduce calculation cost.Therefore the efficiency and the accuracy that calculate decision tree affect overall performance.
The optimization decision tree take-off point computing method that the present invention proposes comprise exact method and approximation method, introduce hereinafter successively.
Decision tree take-off point exact method
First recursively finding out take-off point by gini index makes data set A be split into data set D and E, makes DG=GN (A)-p d× GN (D)-p e× GN (E) is maximum.Wherein p dand p ethe ratio of D and E shared by A respectively, wherein gini index p kfor the ratio of classification k in set A, l is categorical measure.
A kind of method with 3 computation of Period decision trees under distributed environment is proposed below.
In the cycle 1: in order to load balancing, estimate fractile.If r is Branch Computed point nodes, then fractile i/r needs to be calculated, wherein i ∈ (1, r-1).Partial node needs the Nogata distributed data of this locality to be sent on host node.Use complexity communication cost, can calculate the result that standard deviation is ε Z, m is the quantity of partial node, and Z is the data total amount of whole data set.
In the cycle 2: according to the fractile that the cycle 1 calculates, the data in corresponding fractile are sent on r host node by each partial node.Each host node calculates local take-off point.The communication cost in this stage is O (Z).
In the cycle 3: collected by the take-off point that the cycle 2 calculates, find out best take-off point as end product.This phase communication cost is O (r).
Because m/ ε <Z and r<Z, so overall communication cost is O (Z).
The following describe decision tree take-off point approximation method:
In order to improve the efficiency of Branch Computed point further, when acceptable suitable reduction degree of accuracy, the present invention proposes the approximation method of lower calculation cost and communication cost.Approximation method only needs one-period, because only hop data, it is just passable that next host node of generalized case does host node.
The ultimate principle of approximation method of the present invention is that larger for this locality possible take-off point is sent to host node with larger probability by each partial node, and host node, by carrying out unbiased esti-mator to categorical measure, analyzes last take-off point.
Specifically, in the first stage, each partial node arranges one and sends S set.Partial node sorts to local data, and first by data boundary object write S, recursive calculation goes out each layer take-off point of local decision tree afterwards, and take-off point is become list L according to the descending sort of gini index value.Take out the data object l in list in turn i, by with lower probability by l ibe written in S:
p = m i n ( 1 , d i j m &epsiv; Z )
Wherein, d ijcomplete the data centralization l of sequence idata object l in gathering with S jbetween record number; M is the quantity of partial node.For the data object sent with p=1, be called as determinacy object, with (l i,f li) form write S, wherein f liwith l ifor the frequency of forward part Various types of data during take-off point.For the data object sent with p<1, be called as probability object, with (l i, l i→ l j) form write S in.
In subordinate phase, construct the unbiased esti-mator f ' of classification frequency f.Definition f ifor the partial value belonging to frequency f from i-th partial node, namely for a candidate branch point v, in order to estimate the f of its correspondence iv (), first obtains from the nearest data object t of the S middle distance v of i-th partial node transmission n.If t nbe probability object, then need to look for one from t nto a determinacy object t dpath, the vector length defining this paths is R, and namely step number to the right deducts step number left, and R may be negative, the determinacy object t of acquisition df be expressed as f d.If tn is determinacy object, then R=0.When context is clear, the present invention simplifies f iv () is f i.Estimated frequency f iformula be:
f &prime; ( v ) = &Sigma; i = 1 m f i &prime; ( v )
Complete as calculated at decision tree, and after data have been divided.When there being retrieval request, the basic step of process retrieval is as follows:
Step 1: carry out hash according to the token of retrieval request character type attribute and Numeric Attributes span.Hash function will be consistent with hash function during Data Placement.Because a retrieval may have multiple character type token and Numeric Attributes, multiple hashed value can be produced.Retrieval can, by consistance hashed value, be sent on different computing nodes according to each hashed value, and each back end is safeguarded and retrieved the data block with same Hash value.If the quantity of the average hashed value of a retrieval is h, the average length of a data object is l, then the communication cost produced in this process is hl.
Step 2: after a back end receives a retrieval, according to its hashed value, compare with the data object in corresponding data block.When calculating the similarity of object in retrieval and data block, not more all attributes, but according to the attribute node in decision tree, compare successively, and select left child node or right child node further to compare according to the Similarity value calculated, until calculate leaf node, judge last analog result according to the classification of leaf node.
The communication cost produced in this process is (1-α) B (1-β).Wherein, B is the quantity of data object in data block; α is the ratio of data localization, is ID due to what safeguard in data block, and concrete storage is realized by first floor system; p iit is the probability that a data object calculates the i-th ATM layer relationsATM in decision tree; R is the quantity of the average relation of data object; β is the rate buffer of relationship object, and these more close objects are local when asking inferior relation has certain buffering, does not just need again to produce communication cost when identical data object is applied.
In sum, the present invention proposes the related data of band and carry out distributed management and search method, improve the optimizing process of calculating, the complexity of reduction method, saves calculation cost.
Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored in data base management system (DBMS) and be performed by computing system.Like this, the present invention is not restricted to any specific hardware and software combination.
Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims (4)

1. a database index method, is characterized in that, comprising:
Decision tree is generated according to sample data;
Distributed division is carried out to data, Data Placement is become multiple pieces;
When producing retrieval, retrieval request being divided into corresponding data block, calculating similar record according to decision tree.
2. method according to claim 1, is characterized in that, described according to sample data generate decision tree comprise further:
In advance analyzing and processing is carried out to sample data, when calculating decision tree, two threshold values are set and prevent overfitting, comprise mistake and support threshold value FP, false negative threshold value FN, when calculating certain node of decision tree, when first category ratio is more than or equal to 1-FP, terminate to calculate, and flag node classification is first category, when calculating certain node of decision tree, when the second classification ratio is more than or equal to 1-FN, terminate to calculate, and flag node classification is the second classification; Corresponding decision tree is calculated according to the attribute list of sample data.
3. method according to claim 2, it is characterized in that, describedly distributed division is carried out to data comprise further: the attribute kit of each object is containing multiple token or belong to certain numerical value scope, adopt the division based on hashed value, according to token and numerical range by object hash operations to different data blocks, same token is comprised or property value is hashed computing in a data block in same span in attribute, consistance hash is applied in distributed environment, when an object needs to write in multiple data block, only copy object ID, the storage of data and mirror image are safeguarded by data base management system (DBMS).
4. method according to claim 3, is characterized in that, describedly retrieval request is divided into corresponding data block and comprises further:
Complete as calculated at decision tree, and after data have been divided, according to following steps process retrieval request:
Step 1: carry out hash according to the token of retrieval request character type attribute and Numeric Attributes span, hash function when hash function and Data Placement is consistent, by consistance hashed value, be sent on different computing nodes according to each hashed value, each back end is safeguarded and is retrieved the data block with same Hash value;
Step 2: after a back end receives a retrieval, according to its hashed value, compare with the data object in corresponding data block, when calculating the similarity of object in retrieval and data block, compare successively according to the attribute node in decision tree, and select left child node or right child node further to compare according to the Similarity value calculated, until calculate leaf node, judge last analog result according to the classification of leaf node.
CN201510660869.3A 2015-10-14 2015-10-14 Database retrieval method Pending CN105205172A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510660869.3A CN105205172A (en) 2015-10-14 2015-10-14 Database retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510660869.3A CN105205172A (en) 2015-10-14 2015-10-14 Database retrieval method

Publications (1)

Publication Number Publication Date
CN105205172A true CN105205172A (en) 2015-12-30

Family

ID=54952855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510660869.3A Pending CN105205172A (en) 2015-10-14 2015-10-14 Database retrieval method

Country Status (1)

Country Link
CN (1) CN105205172A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912877A (en) * 2016-05-12 2016-08-31 成都鼎智汇科技有限公司 Data processing method of medicine product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912877A (en) * 2016-05-12 2016-08-31 成都鼎智汇科技有限公司 Data processing method of medicine product

Similar Documents

Publication Publication Date Title
Yagoubi et al. Dpisax: Massively distributed partitioned isax
CN102915347B (en) A kind of distributed traffic clustering method and system
CN105760443A (en) Project recommending system, device and method
Youn et al. Efficient data stream clustering with sliding windows based on locality-sensitive hashing
CN103761286B (en) A kind of Service Source search method based on user interest
Apiletti et al. Pampa-HD: A parallel MapReduce-based frequent pattern miner for high-dimensional data
CN108595624A (en) A kind of large-scale distributed functional dependence discovery method
CN108280176A (en) Data mining optimization method based on MapReduce
CN116680090A (en) Edge computing network management method and platform based on big data
CN108256083A (en) Content recommendation method based on deep learning
CN108256086A (en) Data characteristics statistical analysis technique
Kumar Efficient k-mean clustering algorithm for large datasets using data mining standard score normalization
CN105205172A (en) Database retrieval method
Ramzan et al. A comprehensive review on data stream mining techniques for data classification; and future trends
Guo et al. K-loop free assignment in conference review systems
CN109800231A (en) A kind of real-time track co-movement motion pattern detection method based on Flink
CN111107493B (en) Method and system for predicting position of mobile user
Zheng et al. User preference-based data partitioning top-k skyline query processing algorithm
CN105354243B (en) The frequent probability subgraph search method of parallelization based on merger cluster
Levchenko et al. Spark-parsketch: a massively distributed indexing of time series datasets
Bai et al. An efficient skyline query algorithm in the distributed environment
Lakshmi et al. Compact Tree for Associative Classification of Data Stream Mining
Challa et al. AnySC: Anytime Set-wise Classification of Variable Speed Data Streams
Xu et al. Explore maximal frequent itemsets for big data pre-processing based on small sample in cloud computing
CN105389337A (en) Method for searching big data space for statistical significance mode

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151230