CN105117442B

CN105117442B - A kind of big data querying method based on probability

Info

Publication number: CN105117442B
Application number: CN201510492377.8A
Authority: CN
Inventors: 宋杰; 伍晋博; 张川; 张一川; 张莉
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2015-08-12
Filing date: 2015-08-12
Publication date: 2018-05-04
Anticipated expiration: 2035-08-12
Also published as: CN105117442A

Abstract

The present invention discloses a kind of big data querying method based on probability, belongs to database technical field.This method includes：According to data model, the step of division to the data set with multiple attributes；Data set after division is placed into the step of model is loaded according to data probability；The step of probabilistic query is carried out to data set.This method is a kind of querying method of approximate integrality, and the query performance of data is improved by suitably losing inquiry integrality；Place model by a kind of data based on probability, realize data probability place and data each storage file existing probability solution；Pass through a kind of heuristic data querying method so that Database Systems can inquire about data by looking into full probability；And it ensure that the inquiry error of probabilistic query by probability calculation.

Description

A kind of big data querying method based on probability

Technical field

The invention belongs to database technical field, more particularly to a kind of big data querying method based on probability.

Background technology

The height fusion of people, machine, the thing ternary world have triggered the explosive growth of data scale and the height of data pattern Complicate, the world has been enter into the big data epoch of networking.The arrival in big data epoch is brought to traditional data management system Great challenge, NoSQL (Not only SQL) database is by its height extension, High Availabitity and flexible data model etc. Feature has obtained the extensive favor of academia and industrial quarters.One of the core technology of data query technique as Database Systems, With the development of cloud computing technology and NoSQL database technologys, the data query technique based on NoSQL receives much concern, and And extensive research is in the industry cycle also obtained.

It is well known that the NoSQL databases of current main-stream are based primarily upon MapReduce programming models, distributed field system The technologies such as system are managed big data, wherein, distributed file system is mainly used for the storage of big data, MapReduce Programming model is used for the processing of big data.The data query performance of NoSQL databases and data store with Index Design, be based on The problems such as query processing of MapReduce, query optimization, is closely related, and the research of big data inquiring technology at present is concentrated mainly on In the performance optimization of these key technologies, and research extensively and profoundly is had been obtained at present on these problems, possess perhaps More outstanding solution, paper " inquiring technology Review Study in cloud data management system " from index management, query processing, look into Ask the research work of many aspects to inquiring technology in cloud data management system such as optimization and Online aggregate and summarize and divide Analysis.However, for the inquiry mode of data, either traditional relevant database or new NoSQL databases, its Used inquiry mode is all complete query, i.e., for given querying condition, defines the matching of querying condition anyway Algorithm (accurate or approximate), anyway sorts query results, inquiry all will definitely return to all matched datas.Example Such as, a certain user message table includes the fields such as identification card number, name, age, for any given querying condition, such as inquires about year All users of the age more than 30 years old or all names are the users of Zhang San, and inquiry all will definitely return to all satisfaction inquiry bars The data of part.

Under big data environment, since data scale is larger and the complexity of data structure, complete query needs to consume Larger time cost.Many practical applications show that people are not it needs to be determined that complete query result, it is not required that to inquiry As a result accurately sequence (such as Top-k inquiries), it is thus only necessary to meet the partial query result of certain integrity demands, or can fit Locality loss inquires about integrality to meet performance requirement.For example, people, when airport inquiry meets the hotel of certain condition, they are simultaneously The result set that need not be returned is total data, their the opposite requirements to the response time can higher.And current database system The complete query mode of use can not meet this query demand, and there is an urgent need for define a kind of approximate integrity inquiries technology to make up This vacancy.Approximate integrity inquiries are different from traditional complete query, its approximation is mainly reflected in data and looks into full possibility Property on, that is, inquire the probability of all data for meeting querying condition, herein referred to as full probability is looked into, look into full probability description Query results are the possibilities of complete data set.

The content of the invention

In view of the deficienciess of the prior art, the object of the present invention is to provide a kind of big data issuer based on probability Method, to meet the needs of approximate integrity inquiries in big data environment.

The technical scheme is that so：

A kind of big data querying method based on probability, comprises the following steps：

Step 1：Data set with multiple attributes is divided；

Step 1.1：Querying attributes of one or more attribute of data set as data set are selected, give each inquiry The wide granularity of division of attribute codomain；

Step 1.2：Fill up the data of querying attributes value vacancy in data set, it is generally the case that by these querying attributes Value be set to minimum value, maximum or null value of the querying attributes in its codomain；

Step 1.3：Judge the data type of querying attributes value, the data type of querying attributes value shares numerical value and text This two types；If value type, then step 1.4 is performed, if text type, then perform step 1.5；

Step 1.4：Size according to querying attributes value is ranked up, and inquiry is belonged to according to the granularity of division of querying attributes Property carry out wide division, continue to execute step 1.6；

Step 1.5：Lexcographical order according to querying attributes value initial is ranked up, according to the granularity of division of querying attributes Wide division is carried out to querying attributes, continues to execute step 1.6；

Step 1.6：The dimension information of each dimension is stored in distributed file system, dimension information mainly include dimension title, The granularity of division of dimension value value type and dimension.

Step 2：Data set after division is loaded；

Step 2.1：All obtained deblockings that divide are concentrated to be grouped to data；

A dimension using each querying attributes as multi-dimensional data space, then the data distribution in the data set is at one In multi-dimensional data space, the wide division of codomain progress to querying attributes in fact namely carries out the valued space of each dimension etc. Width division, based on the division of each dimension, the data being distributed in multi-dimensional data space are divided into multiple small data blocks, herein Each small data block that division obtains is referred to as a block；

The block in multi-dimensional data space is numbered based on hyperspace linearization technique, according to the size of numbering Block is divided one or more block group by order；

Step 2.2：Create storage catalogue of the data set in distributed file system；

Step 2.2.1：Judge that the root root catalogues of Database Systems storage data whether there is, if it does not, Then perform step 2.2.2；If it is present perform step 2.2.3；

Step 2.2.2：Database system data storage data root root catalogues are created, perform step 2.2.3；

Step 2.2.3：The particular category table catalogues of the storage data, the mesh are created under root root catalogues Name nominating of the record specified by with the data set；

Step 2.2.4：M bucket subdirectory is created for each block groups to store data, this m subdirectory Naming rule is " the small group # subdirectories bucket numberings of block "；

Step 2.3：Data in each block in each block groups are placed with m different placement probability respectively Into m different bucket subdirectories in table catalogues, data are stored in the trunk files of bucket subdirectories；

For any a data in block, data may be stored in different in m bucket subdirectory In trunk files, referred herein to this m trunk file is a trunk group；For being placed into any one of the trunk groups The data of a block are, it is necessary to record placement number of the block data in the trunk groups；

If any one trunk file has reached the size specified in trunk groups, step 2.4 is performed；Otherwise, Continue to execute step 2.3；

If completing the placement of all data in data set, step 2.5 is performed；

Step 2.4：New trunk file storage datas are respectively created in m bucket subdirectory, perform step 2.3；

Step 2.5：Each block in each block groups is stored in distribution in the placement number of all trunk groups In formula file system.

Step 3：Probabilistic query is carried out to data set；

Step 3.1：User sets querying condition by input inquiry sentence；

Query statement includes tetra- clauses of select, from, where and recall, wherein, select clause representations need The attribute of inquiry and the type of aggregation operator, including avg, min, max, sum and count；From clause representations need to inquire about Target data set；Where clause representations querying attributes and its value；Recall clause representations look into full probability, and pr represents to look into complete The size of probability, it is that a value is more than 0 number for being less than or equal to 1 to look into full probability, and expression, which inquires, meets all of querying condition The size of the possibility of data；

Step 3.2：Whether the querying condition that judgment step 3.1 is set meets following constraints：

Constraint 1：Target data set is necessarily present in Database Systems；

Constraint 2：Querying attributes are the querying attributes specified, and are a nonvoid subsets of querying attributes set；

Constraint 3：Clustered pattern is one in the method for congregating specified；

Constraint 4：It must be a decimal more than 0 less than or equal to 1 to look into full probability；

It is not specified or be unsatisfactory for constraint 4 if meeting constraint 1~constraint 3, then perform step 3.3；If meet above-mentioned 4 at the same time A constraint, then perform step 3.4；If being unsatisfactory for any one constraints of constraint 1~constraint 3, inquiry failure, terminates；

Step 3.3：Full probability will be looked into and be set to 1, perform step 3.4；

Step 3.4：Tables of data and querying attributes according to specified by query statement determine the block belonging to inquiry data And block groups；

Step 3.5：Read placement number of the data in each trunk groups of the block in block groups；

Step 3.6：Solve existing probability of the inquiry data in each trunk files；

The solution formula of existing probability isWherein p_iRepresent the block data i-th The placement probability of a bucket subdirectories, w_kRepresent placement number of the block data in k-th of trunk group；

Step 3.7：According to data each trunk files existing probability, heuristically select trunk files, make institute The trunk files of choosing meet following two constraintss；

Constraint 5：Inquire about data on selected trunk files look into full probability be more than or equal to look into full probability pr；

Constraint 6：For identical querying condition, it is not exactly the same that selected trunk files are inquired about every time so that every time Query result there is certain randomness, guarantee meets that all data of querying condition are likely to be queried to；

The heuristic system of selection specific steps of trunk files are described as follows：

Step 3.7.1：The existing probability of all trunk files to that may store inquiry data is normalized；

Step 3.7.2：Probability 1-p is not present in selection_eLess than or equal to looking into full probability p_rTrunk files, added To MapSelect ＜ trunk, p_eIn ＞ set, other trunk files are added to MapNonSelect ＜ trunk, p_e＞ In set；

Step 3.7.3：In MapNonSelect ＜ trunk, p_eTwo trunk files are randomly choosed in ＞ set, if looking into It is respectively p there is no probability that data, which are ask, in the two trunk files₁, p₂, solve p₁With p₂Product p；

Step 3.7.4：If it is more than there is no the product p of probability looks into full probability p_r, then step 3.7.5 is performed；

If it is less than there is no the product p of probability looks into full probability p_r, then step 3.7.6 is performed；

If it is equal to there is no the product p of probability looks into full probability p_r, then step 3.7.7 is performed；

If MapNonSelect ＜ trunk, p_eThe trunk files that ＞ set can not select, then perform step 3.8；

Step 3.7.5：From MapNonSelect ＜ trunk, p_eThe two elements are deleted in ＞ set, are continued MapNonSelect ＜ trunk, p_eRandom selection one makes p=p there is no the trunk files that probability is more than p in ＞ set (1-p_e), p_eFor the existing probability of selected trunk files, step 3.7.4 is performed；

Step 3.7.6：Will be in MapNonSelect ＜ trunk, p_eProbability 1-p is not present in ＞ set_e≤{min|p₁, p₂All trunk files be added to MapSelect ＜ trunk, p_eIn ＞ set, and these trunk files are existed MapNonSelect ＜ trunk, p_eDeleted in ＞ set；In MapNonSelect ＜ trunk, p_eRemaining trunk in ＞ set File relay continue selection than min | p₁, p₂Bigger trunk files, orderPerform step 3.7.4；

Step 3.7.7：If in MapNonSelect ＜ trunk, p_eThere are the non-selected trunk files complete in ＞ set Portion is added to MapSelect ＜ trunk, p_eIn ＞ set, step 3.8 is performed；

Step 3.8：Pass through formulaCalculate inquiry error, wherein trunk_ikRepresent Trunk files in i-th of bucket subdirectory in k-th of trunk group, p_iRepresent block data in i-th of bucket The placement probability of subdirectory, w_kRepresent placement number of the block data in k-th of trunk group；S represents all trunk groups Sum；

Step 3.9：Based on MapReduce programming model parallel processing MapSelect ＜ trunk, p_eIn ＞ set Trunk files, inquiry meet the data of querying attributes.

Beneficial effects of the present invention：A kind of big data querying method based on probability of the present invention, has the following advantages that：

1st, the present invention proposes a kind of approximate integrity inquiries method, and number is improved by suitably losing inquiry integrality According to query performance.

2nd, the present invention devises a kind of data based on probability and places model, realizes probability placement and the data of data In the solution of each storage file existing probability.

3rd, the present invention devises a kind of heuristic data querying method so that Database Systems can by look into full probability come Inquire about data.

4th, the present invention ensure that the inquiry error of probabilistic query by probability calculation.

Brief description of the drawings

Fig. 1 is the big data querying method flow chart based on probability of the specific embodiment of the invention；

Fig. 2 is the data model figure of the specific embodiment of the invention, wherein：

Fig. 2 (a) figures are the logical model structure diagram of the data model of the specific embodiment of the invention；

Fig. 2 (b) figures are the part amplification displaying figure of Fig. 2 (a) figures；

Fig. 2 (c) figures are the physical model structure schematic diagram of the data model of the specific embodiment of the invention；

Fig. 3 is the Data Physical storage format schematic diagram of the specific embodiment of the invention；

Fig. 4 is that the probability of the specific embodiment of the invention places illustraton of model；

Fig. 5 is probability distribution graph of the specific embodiment of the invention time data in file system；

Fig. 6 is the heuristic search algorithm flow chart of the specific embodiment of the invention；

Fig. 7 is the heuristic selection strategy figure of the specific embodiment of the invention, wherein：

Fig. 7 (a) figures are in the heuristic trunk selection courses of the specific embodiment of the invention, and inquiry data are selected Product in trunk files there is no probability is more than the situation figure for looking into full probability；

Fig. 7 (b) figures be invention embodiment in heuristic trunk selection courses, inquiry data in selected trunk Product in file there is no probability is less than the situation figure for looking into full probability；

Fig. 8 is the actual queries time in the case where difference looks into full probability and the best queries timeliness of the specific embodiment of the invention The experimental result picture of energy；

Fig. 9 be the specific embodiment of the invention when it is respectively 1 and 0.5 to look into full probability, the performance with other databases Contrast and experiment figure.

Embodiment

The present invention proposes a kind of big data querying method based on probability, is looked into entirely generally in query process by specifying Rate, the integrality of loss inquiry data improve the query performance of data, are a kind of new data query techniques, and have Preferable versatility, applicability and scalability.The present invention is made with reference to the accompanying drawings and detailed description further detailed Explanation.A kind of big data querying method based on probability of present embodiment, as shown in Figure 1, including：It is right according to data model The step of data set with multiple attributes is divided；The step that data set is loaded according to data probability placement model Suddenly；The step of probabilistic query is carried out to data set.

Data model definitions organizational forms of the data in Database Systems, the main logical model and thing for including data Manage two parts of model.Fig. 2 is the data model figure of present embodiment, and Fig. 2 (a) represents the logical model of data, Fig. 2 (c) tables The physical model of registration evidence.In specific implementation process, the data set with multiple attributes can be carried out according to logical model Division；Organizational form and storage format of the data set in distributed file system are defined according to physical model.

Logical model illustrates the organizational form of data logically, for storing any one number to Database Systems According to collection, data set usually contains multiple attributes in itself, and according to priori or expertise, (priori is that data set is carried out The accumulation of the historical experience of operation, expertise are understanding of the domain expert to data set), select one of data set or more Querying attributes (querying attributes be attribute that user when inquire about data relied on) of a attribute as data set, for example, for Student information table, including multiple fields such as student name, student number, gender, age and credit, can pass through according to conventional history Test or by school student information manager, select these attributes for being often queried to of the student number of student information table, name Querying attributes as data set.Then, the wide granularity of division of each querying attributes codomain is given, and will each be inquired about respectively A dimension of the attribute as multi-dimensional data space, then the data in data set are distributed in more than one according to these querying attributes In dimension data space.

For each dimension of data space, the data type of its value is probably numeric data or text data, according to Given granularity of division carries out wide division to the codomain of each dimension respectively, and division methods are specifically described as：If the value of dimension It is numeric data, is ranked up according to the size of numeric data, then the codomain of data is carried out according to given granularity of division Wide division；If the value of dimension is text data, then the lexcographical order according to text data initial is ranked up, Ran Hougen Wide division is carried out to the codomain of data according to given granularity of division.

Based on the division of each dimension, multi-dimensional data space is divided into multiple small data blocks for being referred to as block, due to Under big data environment, the diversity and complexity of data structure, data in data set may be on some querying attributes not There are value, value of the data in the dimension is set to minimum value, maximum or null value of the data in the dimension value, based on this Every data in data set is all divided into a definite block.Fig. 2 (a), which represents one, has three querying attributes Data set, its data distribution is in a three-dimensional data space, by d₁、d₂、d₃Three dimensions carry out wide division, data The every data concentrated all is distributed in a definite block.

Physical model describes the organizational form of data physically, i.e., tissue and storage in distributed file system Mode.The data organization that the present invention physically concentrates a data is table, bucket and trunk.Table represents one The storage catalogue of a data set, it is corresponding with a multi-dimensional data space in logical model；Bucket is the son of table catalogues Catalogue, is the unit that data in block carry out probability placement, the data in a block may probability be placed into it is multiple In bucket catalogues；Trunk is elementary cell of the number of data sets according to storage, included in bucket catalogues, and each trunk texts Part may store the data in multiple block.Under big data environment, due to the diversity of data, the structure per data is not It is identical, therefore data physically carry out the storage of data with the form defined in SequenceFile. SequenceFile is a series of binary file for the serializing that have recorded key-value pairs, for every number in Database Systems According to before preferentially the key-value pair of querying attributes is placed on most, then storing other key-value pairs successively, its storage format is as shown in Figure 3.

By foregoing description, data are logically organized in the data space of a multidimensional by data model, and Data are divided into a definite block；The data probability in each block is physically placed into corresponding table In multiple bucket subdirectories of catalogue.Data in block are most important in the modes of emplacement of each bucket subdirectories , its modes of emplacement is placed model by the probability of data and is determined, i.e., the data in data set are to place model loading according to probability Into Database Systems.The probability that Fig. 4 illustrates data in present embodiment places model.

It is assumed that the data probability in data set in a block is placed into bucket_1~mM bucket In catalogue, its placement probability on each bucket is respectively p_1~m, then the data in block are in a placement process It may be placed into m trunk file of m bucket catalogue, this m trunk file is referred to as a trunk packet, Fig. 3 Shared G_1~sS trunk packet.In the placement process of data, any one file in a trunk is grouped reaches The file size specified, records the placement number that the data in block are grouped in the trunk, and creates a new trunk points Group stores data, and Fig. 5 describes the probability distribution situation that data in distributed file system are engraved in some time.

In the described probability of Fig. 4 places model, the possible probability of each bucket catalogues places the number of multiple block According to.Herein, it is necessary to methods of the block based on linearisation of multi-dimensional data space be numbered, according to the size order of numbering Block in data space is divided into one or more block packets, block number of each block groups is identical. The data probability of difference block is placed in m identical bucket catalogue during one block is grouped.

The data probability of block is placed in m bucket catalogue in each block groups, and data are in m bucket mesh Recording playback puts probability based on Amdahl's law to solve.Amdahl's law describes calculating task and is parallelized processing Afterwards, the relation between the speed-up ratio of calculating task and the number of parallel processing node, shown in its function expression such as formula (1), Wherein, n is the number of calculate node, and p is the part that calculating task can be parallelized processing, and p is by n_sIt is a to calculate section The speed-up ratio speedup measured on point_mDetermine, shown in expression formula such as formula (2).Based on Amdahl's law, you can try to achieve Data m bucket catalogue placement probability, shown in its solution formula such as formula (3).

It is placed on due to the data probability of each block in a block group in m identical bucket catalogue, In the probability placement of data or query process, occurs the problem of hot spot is read to write with hot spot in order to prevent, probability places model will Ask placement probability of the data in a block group in each block in same bucket not exactly the same therefore right Each block in block groups, it is only necessary to ensure its incomplete phase of order in the placement probability of m bucket catalogue With.Simplest mode is, if the numbering of block is odd number, according to the ascending order of placement probability sequentially by the block In data probability be placed in m bucket catalogue；If the numbering of block is even number, according to the descending for placing probability Data probability in the block is placed in m bucket catalogue by order.

The core of the present invention is to devise a kind of big data querying method based on probability, can be passed through in query process It is given to look into full probability and look into full possibility to reduce data, improve the query performance of data.In the query process of data, user Input inquiry sentence is needed to set querying condition, and the form of query statement is as shown in table 1, and select clause representations need to inquire about Attribute and aggregation operator type, including avg, min, max, sum and count；From clause representations need the mesh inquired about Mark data set；Where clause representations querying attributes and its value；Recall clause representations look into full probability, and it is one to look into full probability Value is more than 0 number for being less than or equal to 1, represents to inquire the size of the possibility for all data for meeting querying condition.

1 query statement of table

It is clear that when giving querying condition by the query statement shown in table 1, pass through specified target data set name Title and querying attributes and its value, are easy to where the block and block where judging data by data model Block groups, so that it is determined that bucket catalogues where inquiry data.The key of big data inquiry based on probability is at these Selection meets to look into the trunk files of full probability to carry out data query in bucket catalogues.

In order to select to meet the trunk files for looking into full probability, it is necessary first to solve inquiry data and deposited in each trunk files In the size of probability, shown in its solution formula such as formula (4), wherein p_eRepresent the existing probability of data, p_iRepresent the block numbers According to the placement probability in i-th of bucket subdirectory, w_kRepresent placement number of the block data in k-th of trunk group. Solving to data are inquired about after the existing probability of all trunk files, it is necessary to which existing probability is normalized, returning One changes shown in formula such as formula (5), and n represents the sum of trunk files, p_ejRepresent the existing probability of j-th of trunk file.

Inquiry data are obtained after the existing probability of each trunk files, it is necessary to according to inquiry data each solving The existing probability of a trunk files heuristically selects trunk files, and make selection obtain trunk files meet it is following two Constraints：1. inquiry data selected trunk files look into full probability be more than or equal to look into full probability p_r, ensure to look into Ask result and meet querying condition；2. for identical querying condition, the incomplete phase of trunk files arrived every time selected by inquiry Together so that each query result has certain randomness, and guarantee meets that all data of querying condition are likely to be looked into Ask.

Fig. 6 is the heuristic search algorithm flow chart of present embodiment, and the pseudo-code of heuristic search algorithm is described such as the institute of table 2 Show.First, inquiry data are normalized in the existing probability of each trunk files by normalizing formula, secondly, Probability 1-p is not present in selection_eLess than or equal to looking into full probability p_rTrunk files, be added to MapSelect ＜ Trunk, p_eIn ＞ set, other non-selected trunk files are added to MapNonSelect ＜ trunk, p_e＞ gathers In, finally, using heuristicTrunkSelect () function in MapNonSelect ＜ trunk, p_eSelected in ＞ set Trunk files are added to MapSelect ＜ trunk, p_eIn ＞ set so that MapSelect ＜ trunk, p_eTrunk in ＞ File meets the requirement for looking into full probability.

2 heuristic search algorithm of table

HeuristicTrunkSelect () function carries out the selection of trunk files using heuristic selection strategy, is selecting During selecting, in fact it could happen that in Fig. 7 (a) figures, two kinds of situations in Fig. 7 (b) figures, first, in MapNonSelect ＜ trunk, p_eTwo trunk files are randomly choosed in ＞ set, if the product p of probability is not present in it₁·p₂＞ p_r, then by the two trunk File is from MapNonSelect ＜ trunk, p_eRemoved in ＞ set, continue to select 1-p_e＞ p₁·p₂Trunk files, such as Shown in Fig. 7 (a) figures, if the product p of probability is not present in it₁·p₂＜ p_r, then 1-p is removed_e≤{min|p₁, p₂It is all Trunk files, and these trunk files are added to MapSelect ＜ trunk, p_eIn ＞ set, continue selection and compare 1-p_e The trunk files of bigger, as shown in Fig. 7 (b) figures.Until without selectable trunk files or product there is no probability etc. In p_rWhen stop selection, by MapNonSelect ＜ trunk, p_eGather non-selected trunk files in ＞ to be added to MapSelect ＜ trunk, p_eIn ＞, then MapSelect ＜ trunk, p_eTrunk files in ＞ are that satisfaction looks into full probability p_rAll trunk files.

The principle of heuristic trunk file selection methods is described as follows：Since the data in block are stored in this n In a trunk files, so inquiry data are 1 in the full probability of looking into of this n trunk file, i.e. p_e(p_e1, p_e2, p_e3... p_en) =1.It is respectively p there is no probability that data, which are inquired about, in this n trunk file_e1', p_e2', p_e3' ... p_en', then in block Data be stored in trunk_1~n-1On probability be p_en', that is, it is p to look into full probability_en', inquiry data are stored in trunk_1~n-2On Probability be p_en′·p_en-1', that is, it is p to look into full probability_en′·p_en-1', and so on, data are stored in trunk_1~n-mOn it is general Rate isLooking into full probability isIt follows that data the looking into any number of trunk files in block Full probability is the product there is no probability of block data data in remaining trunk files.

Inquiry error can be estimated in the recall ratio of each trunk files by calculating inquiry data, MapSelect ＜ trunk, p_eInquiry error caused by inquiring about data in trunk files selected in ＞ set can be by public affairs Formula (6) solves, wherein trunk_ikRepresent the trunk files in k-th of trunk group, p in i-th of bucket subdirectory_i Represent block data in the placement probability of i-th of bucket subdirectory, w_kRepresent block data in k-th trunk group Place number；S represents the sum of all trunk groups.

Finally, the trunk files of selection are subjected to parallelization processing by MapReduce programming models, select satisfaction to look into The data of inquiry condition.If necessary to carry out aggregation operator on some attributes, then using MapReduce programming models specific Aggregation operator is carried out on attribute, returns to query result.

The embodiment of the present invention is realized based on Hadoop-2.6.0, is in Hadoop isomeric group environment Saved as in 37 and experimental verification is carried out to the present invention under the experimental situation of the PC machine of 8G, wherein, 12 machines use Intel i3 Processor, 12 machines use Intel i5 processors, and 13 machines use Intel i7 processors, and one uses i7 processors Machine as NameNode, remaining 36 machine is as DataNode.

Use S altogether in an experiment₁, S₂Two datasets are tested, S₁, S₂For the data set generated at random, data volume point Other 40G and 50G.S is chosen respectively₁, S₂Querying attributes of 3 attributes of data set as data, based on these querying attributes, number According to collection data distribution in a 3-dimensional data space, 64 block are divided into by the data space is wide, every 8 block are One block group, and the data probability of each block is placed in 44 bucket catalogues, its placement probability is respectively 0.001 to 0.0043, and 0.054.

Present embodiment respectively verifies the present invention by two groups of experiments of Fig. 8 and Fig. 9.Fig. 8 uses S₂As experiment Data set, illustrates the query performance of the data in the case where difference looks into full probability, and is compared with preferably most fast query time. Most fast query time is the influence for the random selection factor for eliminating trunk files, according to there is no descending suitable of probability Sequence selects trunk files, then the trunk file numbers being rejected are up to maximum, and data query performance also reaches optimal.It is logical Cross the experimental result that Fig. 8 is shown to understand, regardless of whether going the randomization of trunk files to select, data query performance is complete with looking into The relation that probability linearly successively decreases.

S is respectively adopted in Fig. 9₁And S₂As experimental data set, the full probability of looking into for setting data is respectively 1 and 0.5 (in figure point Do not represented with " probabilistic query -1 " and " probabilistic query -0.5 "), with HBase, Cassandra, Hive and MongoDB database Data query performance is compared.The experimental result shown by Fig. 9, when it is 1 to look into full probability, data query performance connects It is bordering on other databases in addition to HBase；When it is 0.5 to look into full probability, data query performance is substantially better than all carry out in fact The database tested.

Claims

A kind of 1. big data querying method based on probability, it is characterised in that：Comprise the following steps：

Step 1：Data set with multiple attributes is divided；

Step 2：Data set after division is loaded；

Step 3：Probabilistic query is carried out to data set；

The step 1 includes the following steps：

Step 1.1：Querying attributes of one or more attribute of data set as data set are selected, give each querying attributes The wide granularity of division of codomain；

Step 1.2：Fill up the data of querying attributes value vacancy in data set, it is generally the case that by taking for these querying attributes Value is set to minimum value, maximum or null value of the querying attributes in its codomain；

Step 1.3：Judge the data type of querying attributes value, the data type of querying attributes value shares numerical value and text two Type, if value type, then performs step 1.4, if text type, then performs step 1.5；

Step 1.4：Size according to querying attributes value is ranked up, according to the granularity of division of querying attributes to querying attributes into The wide division of row, continues to execute step 1.6；

Step 1.5：Lexcographical order according to querying attributes value initial is ranked up, according to the granularity of division of querying attributes to looking into Ask attribute and carry out wide division, continue to execute step 1.6；

Step 1.6：The dimension information of each dimension is stored in distributed file system, dimension information mainly includes dimension title, dimension value The granularity of division of value type and dimension.
2. big data querying method based on probability according to claim 1, it is characterised in that：The step 2 includes Following steps：

Step 2.1：All obtained deblockings that divide are concentrated to be grouped to data；

A dimension using each querying attributes as multi-dimensional data space, then the data distribution in the data set is in a multidimensional In data space, the wide division of codomain progress to querying attributes namely carries out wide stroke to the valued space of each dimension in fact Point, based on the division of each dimension, the data being distributed in multi-dimensional data space are divided into multiple small data blocks, will draw herein The each small data block got is referred to as a block；

The block in multi-dimensional data space is numbered based on hyperspace linearization technique, according to the size order of numbering Block is divided into one or more block group；

Step 2.2：Create storage catalogue of the data set in distributed file system；

Step 2.2.1：Judge that the root root catalogues of Database Systems storage data whether there is, if it does not exist, then holding Row step 2.2.2；If it is present perform step 2.2.3；

Step 2.2.2：Database system data storage data root root catalogues are created, perform step 2.2.3；

Step 2.2.3：Create the particular category table catalogues of the storage data under root root catalogues, the catalogue with Name nominating specified by the data set；

Step 2.2.4：M bucket subdirectory is created for each block groups to store data, the name of this m subdirectory Rule is " the small group # subdirectories bucket numberings of block "；The bucket is the subdirectory of table catalogues, is block In data carry out the unit of probability placement, the data in a block may probability be placed into multiple bucket catalogues；

Step 2.3：Data in each block in each block groups are placed into m different placement probability respectively In m different bucket subdirectories in table catalogues, data are stored in the trunk files of bucket subdirectories；It is described Trunk is elementary cell of the number of data sets according to storage, included in bucket catalogues, and each trunk files may store it is more Data in a block；

For any a data in block, data may be stored in the different trunk texts in m bucket subdirectory In part, referred herein to this m trunk file is a trunk group；Any one block for being placed into the trunk groups Data, it is necessary to record placement number of the block data in the trunk groups；

If any one trunk file has reached the size specified in trunk groups, step 2.4 is performed；Otherwise, continue Perform step 2.3；

If completing the placement of all data in data set, step 2.5 is performed；

Step 2.4：New trunk file storage datas are respectively created in m bucket subdirectory, perform step 2.3；

Step 2.5：Each block in each block groups is stored in distributed text in the placement number of all trunk groups In part system.
3. big data querying method based on probability according to claim 2, it is characterised in that：The step 3 includes Following steps：

Step 3.1：User sets querying condition by input inquiry sentence；

Step 3.2：Whether the querying condition that judgment step 3.1 is set meets following constraints：

Constraint 1：Target data set is necessarily present in Database Systems；

Constraint 2：Querying attributes are the querying attributes specified, and are a nonvoid subsets of querying attributes set；

Constraint 3：Clustered pattern is one in the method for congregating specified；

Constraint 4：It must be a decimal more than 0 less than or equal to 1 to look into full probability；

It is not specified or be unsatisfactory for constraint 4 if meeting constraint 1~constraint 3, then perform step 3.3；If meet above-mentioned 4 at the same time about Beam, then perform step 3.4；If being unsatisfactory for any one constraints of constraint 1~constraint 3, inquiry failure, terminates；

Step 3.3：Full probability will be looked into and be set to 1, perform step 3.4；

Step 3.4：Tables of data and querying attributes according to specified by query statement determine inquiry data belonging to block and Block groups；

Step 3.5：Read placement number of the data in each trunk groups of the block in block groups；

Step 3.6：Solve existing probability of the inquiry data in each trunk files；

Step 3.7：According to data each trunk files existing probability, heuristically select trunk files, make selected Trunk files meet following two constraintss；

Constraint 5：Inquire about data on selected trunk files look into full probability be more than or equal to look into full probability p_r；

Constraint 6：For identical querying condition, it is not exactly the same that selected trunk files are inquired about every time so that each looks into Asking result has certain randomness, and guarantee meets that all data of querying condition are likely to be queried to；

The heuristic system of selection specific steps of trunk files are described as follows：

Step 3.7.1：The existing probability of all trunk files to that may store inquiry data is normalized；

Step 3.7.2：Probability 1-p is not present in selection_eLess than or equal to looking into full probability p_rTrunk files, be added to MapSelect ＜ trunk, p_eIn ＞ set, other trunk files are added to MapNonSelect ＜ trunk, p_e＞ collection In conjunction；

Step 3.7.3：In MapNonSelect ＜ trunk, p_eTwo trunk files are randomly choosed in ＞ set, if inquiry data There is no probability it is respectively p in the two trunk files₁, p₂, solve p₁With p₂Product p；

Step 3.7.4：If it is more than there is no the product p of probability looks into full probability p_r, then step 3.7.5 is performed；

If it is less than there is no the product p of probability looks into full probability p_r, then step 3.7.6 is performed；

If it is equal to there is no the product p of probability looks into full probability p_r, then step 3.7.7 is performed；

If MapNonSelect ＜ trunk, p_eThe trunk files that ＞ set can not select, then perform step 3.8；

Step 3.7.5：From MapNonSelect ＜ trunk, p_eThe two elements are deleted in ＞ set, are continued MapNonSelect ＜ trunk, p_eRandom selection one makes p=p there is no the trunk files that probability is more than p in ＞ set (1-p_e), p_eFor the existing probability of selected trunk files, step 3.7.4 is performed；

Step 3.7.6：Will be in MapNonSelect ＜ trunk, p_eProbability 1-p is not present in ＞ set_e≤{min|p₁, p₂ All trunk files are added to MapSelect ＜ trunk, p_eIn ＞ set, and these trunk files are existed MapNonSelect ＜ trunk, p_eDeleted in ＞ set；In MapNonSelect ＜ trunk, p_eRemaining trunk in ＞ set File relay continue selection than min | p₁, p₂Bigger trunk files, orderPerform step 3.7.4；

Step 3.7.7：If in MapNonSelect ＜ trunk, p_eThere are non-selected trunk files all to add in ＞ set It is added to MapSelect ＜ trunk, p_eIn ＞ set, step 3.8 is performed；

Step 3.8：Pass through formulaCalculate inquiry error, wherein trunk_ikRepresent i-th Trunk files in bucket subdirectories in k-th of trunk group, p_iRepresent block data in i-th of bucket specific item The placement probability of record, w_kRepresent placement number of the block data in k-th of trunk group；S represents the total of all trunk groups Number；

Step 3.9：Based on MapReduce programming model parallel processing MapSelect ＜ trunk, p_eTrunk texts in ＞ set Part, inquiry meet the data of querying attributes.
4. big data querying method based on probability according to claim 3, it is characterised in that：Institute in the step 3.1 The query statement stated, including tetra- clauses of select, from, where and recall, wherein, select clause representations need to look into The attribute of inquiry and the type of aggregation operator, including avg, min, max, sum and count；From clause representations need what is inquired about Target data set；Where clause representations querying attributes and its value；Recall clause representations look into full probability, p_rExpression is looked into complete general The size of rate, it is that a value is more than 0 number for being less than or equal to 1 to look into full probability, represents to inquire all numbers for meeting querying condition According to possibility size.
5. big data querying method based on probability according to claim 3, it is characterised in that：Looked into the step 3.6 Ask data is in the solution formula of the existing probability of each trunk filesWherein p_iRepresenting should Block data are in the placement probability of i-th of bucket subdirectory, w_kRepresent the block data putting in k-th trunk group Put number.