CN105117442B - A kind of big data querying method based on probability - Google Patents

A kind of big data querying method based on probability Download PDF

Info

Publication number
CN105117442B
CN105117442B CN201510492377.8A CN201510492377A CN105117442B CN 105117442 B CN105117442 B CN 105117442B CN 201510492377 A CN201510492377 A CN 201510492377A CN 105117442 B CN105117442 B CN 105117442B
Authority
CN
China
Prior art keywords
data
trunk
probability
files
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510492377.8A
Other languages
Chinese (zh)
Other versions
CN105117442A (en
Inventor
宋杰
伍晋博
张川
张一川
张莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201510492377.8A priority Critical patent/CN105117442B/en
Publication of CN105117442A publication Critical patent/CN105117442A/en
Application granted granted Critical
Publication of CN105117442B publication Critical patent/CN105117442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of big data querying method based on probability, belongs to database technical field.This method includes:According to data model, the step of division to the data set with multiple attributes;Data set after division is placed into the step of model is loaded according to data probability;The step of probabilistic query is carried out to data set.This method is a kind of querying method of approximate integrality, and the query performance of data is improved by suitably losing inquiry integrality;Place model by a kind of data based on probability, realize data probability place and data each storage file existing probability solution;Pass through a kind of heuristic data querying method so that Database Systems can inquire about data by looking into full probability;And it ensure that the inquiry error of probabilistic query by probability calculation.

Description

A kind of big data querying method based on probability
Technical field
The invention belongs to database technical field, more particularly to a kind of big data querying method based on probability.
Background technology
The height fusion of people, machine, the thing ternary world have triggered the explosive growth of data scale and the height of data pattern Complicate, the world has been enter into the big data epoch of networking.The arrival in big data epoch is brought to traditional data management system Great challenge, NoSQL (Not only SQL) database is by its height extension, High Availabitity and flexible data model etc. Feature has obtained the extensive favor of academia and industrial quarters.One of the core technology of data query technique as Database Systems, With the development of cloud computing technology and NoSQL database technologys, the data query technique based on NoSQL receives much concern, and And extensive research is in the industry cycle also obtained.
It is well known that the NoSQL databases of current main-stream are based primarily upon MapReduce programming models, distributed field system The technologies such as system are managed big data, wherein, distributed file system is mainly used for the storage of big data, MapReduce Programming model is used for the processing of big data.The data query performance of NoSQL databases and data store with Index Design, be based on The problems such as query processing of MapReduce, query optimization, is closely related, and the research of big data inquiring technology at present is concentrated mainly on In the performance optimization of these key technologies, and research extensively and profoundly is had been obtained at present on these problems, possess perhaps More outstanding solution, paper " inquiring technology Review Study in cloud data management system " from index management, query processing, look into Ask the research work of many aspects to inquiring technology in cloud data management system such as optimization and Online aggregate and summarize and divide Analysis.However, for the inquiry mode of data, either traditional relevant database or new NoSQL databases, its Used inquiry mode is all complete query, i.e., for given querying condition, defines the matching of querying condition anyway Algorithm (accurate or approximate), anyway sorts query results, inquiry all will definitely return to all matched datas.Example Such as, a certain user message table includes the fields such as identification card number, name, age, for any given querying condition, such as inquires about year All users of the age more than 30 years old or all names are the users of Zhang San, and inquiry all will definitely return to all satisfaction inquiry bars The data of part.
Under big data environment, since data scale is larger and the complexity of data structure, complete query needs to consume Larger time cost.Many practical applications show that people are not it needs to be determined that complete query result, it is not required that to inquiry As a result accurately sequence (such as Top-k inquiries), it is thus only necessary to meet the partial query result of certain integrity demands, or can fit Locality loss inquires about integrality to meet performance requirement.For example, people, when airport inquiry meets the hotel of certain condition, they are simultaneously The result set that need not be returned is total data, their the opposite requirements to the response time can higher.And current database system The complete query mode of use can not meet this query demand, and there is an urgent need for define a kind of approximate integrity inquiries technology to make up This vacancy.Approximate integrity inquiries are different from traditional complete query, its approximation is mainly reflected in data and looks into full possibility Property on, that is, inquire the probability of all data for meeting querying condition, herein referred to as full probability is looked into, look into full probability description Query results are the possibilities of complete data set.
The content of the invention
In view of the deficienciess of the prior art, the object of the present invention is to provide a kind of big data issuer based on probability Method, to meet the needs of approximate integrity inquiries in big data environment.
The technical scheme is that so:
A kind of big data querying method based on probability, comprises the following steps:
Step 1:Data set with multiple attributes is divided;
Step 1.1:Querying attributes of one or more attribute of data set as data set are selected, give each inquiry The wide granularity of division of attribute codomain;
Step 1.2:Fill up the data of querying attributes value vacancy in data set, it is generally the case that by these querying attributes Value be set to minimum value, maximum or null value of the querying attributes in its codomain;
Step 1.3:Judge the data type of querying attributes value, the data type of querying attributes value shares numerical value and text This two types;If value type, then step 1.4 is performed, if text type, then perform step 1.5;
Step 1.4:Size according to querying attributes value is ranked up, and inquiry is belonged to according to the granularity of division of querying attributes Property carry out wide division, continue to execute step 1.6;
Step 1.5:Lexcographical order according to querying attributes value initial is ranked up, according to the granularity of division of querying attributes Wide division is carried out to querying attributes, continues to execute step 1.6;
Step 1.6:The dimension information of each dimension is stored in distributed file system, dimension information mainly include dimension title, The granularity of division of dimension value value type and dimension.
Step 2:Data set after division is loaded;
Step 2.1:All obtained deblockings that divide are concentrated to be grouped to data;
A dimension using each querying attributes as multi-dimensional data space, then the data distribution in the data set is at one In multi-dimensional data space, the wide division of codomain progress to querying attributes in fact namely carries out the valued space of each dimension etc. Width division, based on the division of each dimension, the data being distributed in multi-dimensional data space are divided into multiple small data blocks, herein Each small data block that division obtains is referred to as a block;
The block in multi-dimensional data space is numbered based on hyperspace linearization technique, according to the size of numbering Block is divided one or more block group by order;
Step 2.2:Create storage catalogue of the data set in distributed file system;
Step 2.2.1:Judge that the root root catalogues of Database Systems storage data whether there is, if it does not, Then perform step 2.2.2;If it is present perform step 2.2.3;
Step 2.2.2:Database system data storage data root root catalogues are created, perform step 2.2.3;
Step 2.2.3:The particular category table catalogues of the storage data, the mesh are created under root root catalogues Name nominating of the record specified by with the data set;
Step 2.2.4:M bucket subdirectory is created for each block groups to store data, this m subdirectory Naming rule is " the small group # subdirectories bucket numberings of block ";
Step 2.3:Data in each block in each block groups are placed with m different placement probability respectively Into m different bucket subdirectories in table catalogues, data are stored in the trunk files of bucket subdirectories;
For any a data in block, data may be stored in different in m bucket subdirectory In trunk files, referred herein to this m trunk file is a trunk group;For being placed into any one of the trunk groups The data of a block are, it is necessary to record placement number of the block data in the trunk groups;
If any one trunk file has reached the size specified in trunk groups, step 2.4 is performed;Otherwise, Continue to execute step 2.3;
If completing the placement of all data in data set, step 2.5 is performed;
Step 2.4:New trunk file storage datas are respectively created in m bucket subdirectory, perform step 2.3;
Step 2.5:Each block in each block groups is stored in distribution in the placement number of all trunk groups In formula file system.
Step 3:Probabilistic query is carried out to data set;
Step 3.1:User sets querying condition by input inquiry sentence;
Query statement includes tetra- clauses of select, from, where and recall, wherein, select clause representations need The attribute of inquiry and the type of aggregation operator, including avg, min, max, sum and count;From clause representations need to inquire about Target data set;Where clause representations querying attributes and its value;Recall clause representations look into full probability, and pr represents to look into complete The size of probability, it is that a value is more than 0 number for being less than or equal to 1 to look into full probability, and expression, which inquires, meets all of querying condition The size of the possibility of data;
Step 3.2:Whether the querying condition that judgment step 3.1 is set meets following constraints:
Constraint 1:Target data set is necessarily present in Database Systems;
Constraint 2:Querying attributes are the querying attributes specified, and are a nonvoid subsets of querying attributes set;
Constraint 3:Clustered pattern is one in the method for congregating specified;
Constraint 4:It must be a decimal more than 0 less than or equal to 1 to look into full probability;
It is not specified or be unsatisfactory for constraint 4 if meeting constraint 1~constraint 3, then perform step 3.3;If meet above-mentioned 4 at the same time A constraint, then perform step 3.4;If being unsatisfactory for any one constraints of constraint 1~constraint 3, inquiry failure, terminates;
Step 3.3:Full probability will be looked into and be set to 1, perform step 3.4;
Step 3.4:Tables of data and querying attributes according to specified by query statement determine the block belonging to inquiry data And block groups;
Step 3.5:Read placement number of the data in each trunk groups of the block in block groups;
Step 3.6:Solve existing probability of the inquiry data in each trunk files;
The solution formula of existing probability isWherein piRepresent the block data i-th The placement probability of a bucket subdirectories, wkRepresent placement number of the block data in k-th of trunk group;
Step 3.7:According to data each trunk files existing probability, heuristically select trunk files, make institute The trunk files of choosing meet following two constraintss;
Constraint 5:Inquire about data on selected trunk files look into full probability be more than or equal to look into full probability pr;
Constraint 6:For identical querying condition, it is not exactly the same that selected trunk files are inquired about every time so that every time Query result there is certain randomness, guarantee meets that all data of querying condition are likely to be queried to;
The heuristic system of selection specific steps of trunk files are described as follows:
Step 3.7.1:The existing probability of all trunk files to that may store inquiry data is normalized;
Step 3.7.2:Probability 1-p is not present in selectioneLess than or equal to looking into full probability prTrunk files, added To MapSelect < trunk, peIn > set, other trunk files are added to MapNonSelect < trunk, pe> In set;
Step 3.7.3:In MapNonSelect < trunk, peTwo trunk files are randomly choosed in > set, if looking into It is respectively p there is no probability that data, which are ask, in the two trunk files1, p2, solve p1With p2Product p;
Step 3.7.4:If it is more than there is no the product p of probability looks into full probability pr, then step 3.7.5 is performed;
If it is less than there is no the product p of probability looks into full probability pr, then step 3.7.6 is performed;
If it is equal to there is no the product p of probability looks into full probability pr, then step 3.7.7 is performed;
If MapNonSelect < trunk, peThe trunk files that > set can not select, then perform step 3.8;
Step 3.7.5:From MapNonSelect < trunk, peThe two elements are deleted in > set, are continued MapNonSelect < trunk, peRandom selection one makes p=p there is no the trunk files that probability is more than p in > set (1-pe), peFor the existing probability of selected trunk files, step 3.7.4 is performed;
Step 3.7.6:Will be in MapNonSelect < trunk, peProbability 1-p is not present in > sete≤{min|p1, p2All trunk files be added to MapSelect < trunk, peIn > set, and these trunk files are existed MapNonSelect < trunk, peDeleted in > set;In MapNonSelect < trunk, peRemaining trunk in > set File relay continue selection than min | p1, p2Bigger trunk files, orderPerform step 3.7.4;
Step 3.7.7:If in MapNonSelect < trunk, peThere are the non-selected trunk files complete in > set Portion is added to MapSelect < trunk, peIn > set, step 3.8 is performed;
Step 3.8:Pass through formulaCalculate inquiry error, wherein trunkikRepresent Trunk files in i-th of bucket subdirectory in k-th of trunk group, piRepresent block data in i-th of bucket The placement probability of subdirectory, wkRepresent placement number of the block data in k-th of trunk group;S represents all trunk groups Sum;
Step 3.9:Based on MapReduce programming model parallel processing MapSelect < trunk, peIn > set Trunk files, inquiry meet the data of querying attributes.
Beneficial effects of the present invention:A kind of big data querying method based on probability of the present invention, has the following advantages that:
1st, the present invention proposes a kind of approximate integrity inquiries method, and number is improved by suitably losing inquiry integrality According to query performance.
2nd, the present invention devises a kind of data based on probability and places model, realizes probability placement and the data of data In the solution of each storage file existing probability.
3rd, the present invention devises a kind of heuristic data querying method so that Database Systems can by look into full probability come Inquire about data.
4th, the present invention ensure that the inquiry error of probabilistic query by probability calculation.
Brief description of the drawings
Fig. 1 is the big data querying method flow chart based on probability of the specific embodiment of the invention;
Fig. 2 is the data model figure of the specific embodiment of the invention, wherein:
Fig. 2 (a) figures are the logical model structure diagram of the data model of the specific embodiment of the invention;
Fig. 2 (b) figures are the part amplification displaying figure of Fig. 2 (a) figures;
Fig. 2 (c) figures are the physical model structure schematic diagram of the data model of the specific embodiment of the invention;
Fig. 3 is the Data Physical storage format schematic diagram of the specific embodiment of the invention;
Fig. 4 is that the probability of the specific embodiment of the invention places illustraton of model;
Fig. 5 is probability distribution graph of the specific embodiment of the invention time data in file system;
Fig. 6 is the heuristic search algorithm flow chart of the specific embodiment of the invention;
Fig. 7 is the heuristic selection strategy figure of the specific embodiment of the invention, wherein:
Fig. 7 (a) figures are in the heuristic trunk selection courses of the specific embodiment of the invention, and inquiry data are selected Product in trunk files there is no probability is more than the situation figure for looking into full probability;
Fig. 7 (b) figures be invention embodiment in heuristic trunk selection courses, inquiry data in selected trunk Product in file there is no probability is less than the situation figure for looking into full probability;
Fig. 8 is the actual queries time in the case where difference looks into full probability and the best queries timeliness of the specific embodiment of the invention The experimental result picture of energy;
Fig. 9 be the specific embodiment of the invention when it is respectively 1 and 0.5 to look into full probability, the performance with other databases Contrast and experiment figure.
Embodiment
The present invention proposes a kind of big data querying method based on probability, is looked into entirely generally in query process by specifying Rate, the integrality of loss inquiry data improve the query performance of data, are a kind of new data query techniques, and have Preferable versatility, applicability and scalability.The present invention is made with reference to the accompanying drawings and detailed description further detailed Explanation.A kind of big data querying method based on probability of present embodiment, as shown in Figure 1, including:It is right according to data model The step of data set with multiple attributes is divided;The step that data set is loaded according to data probability placement model Suddenly;The step of probabilistic query is carried out to data set.
Data model definitions organizational forms of the data in Database Systems, the main logical model and thing for including data Manage two parts of model.Fig. 2 is the data model figure of present embodiment, and Fig. 2 (a) represents the logical model of data, Fig. 2 (c) tables The physical model of registration evidence.In specific implementation process, the data set with multiple attributes can be carried out according to logical model Division;Organizational form and storage format of the data set in distributed file system are defined according to physical model.
Logical model illustrates the organizational form of data logically, for storing any one number to Database Systems According to collection, data set usually contains multiple attributes in itself, and according to priori or expertise, (priori is that data set is carried out The accumulation of the historical experience of operation, expertise are understanding of the domain expert to data set), select one of data set or more Querying attributes (querying attributes be attribute that user when inquire about data relied on) of a attribute as data set, for example, for Student information table, including multiple fields such as student name, student number, gender, age and credit, can pass through according to conventional history Test or by school student information manager, select these attributes for being often queried to of the student number of student information table, name Querying attributes as data set.Then, the wide granularity of division of each querying attributes codomain is given, and will each be inquired about respectively A dimension of the attribute as multi-dimensional data space, then the data in data set are distributed in more than one according to these querying attributes In dimension data space.
For each dimension of data space, the data type of its value is probably numeric data or text data, according to Given granularity of division carries out wide division to the codomain of each dimension respectively, and division methods are specifically described as:If the value of dimension It is numeric data, is ranked up according to the size of numeric data, then the codomain of data is carried out according to given granularity of division Wide division;If the value of dimension is text data, then the lexcographical order according to text data initial is ranked up, Ran Hougen Wide division is carried out to the codomain of data according to given granularity of division.
Based on the division of each dimension, multi-dimensional data space is divided into multiple small data blocks for being referred to as block, due to Under big data environment, the diversity and complexity of data structure, data in data set may be on some querying attributes not There are value, value of the data in the dimension is set to minimum value, maximum or null value of the data in the dimension value, based on this Every data in data set is all divided into a definite block.Fig. 2 (a), which represents one, has three querying attributes Data set, its data distribution is in a three-dimensional data space, by d1、d2、d3Three dimensions carry out wide division, data The every data concentrated all is distributed in a definite block.
Physical model describes the organizational form of data physically, i.e., tissue and storage in distributed file system Mode.The data organization that the present invention physically concentrates a data is table, bucket and trunk.Table represents one The storage catalogue of a data set, it is corresponding with a multi-dimensional data space in logical model;Bucket is the son of table catalogues Catalogue, is the unit that data in block carry out probability placement, the data in a block may probability be placed into it is multiple In bucket catalogues;Trunk is elementary cell of the number of data sets according to storage, included in bucket catalogues, and each trunk texts Part may store the data in multiple block.Under big data environment, due to the diversity of data, the structure per data is not It is identical, therefore data physically carry out the storage of data with the form defined in SequenceFile. SequenceFile is a series of binary file for the serializing that have recorded key-value pairs, for every number in Database Systems According to before preferentially the key-value pair of querying attributes is placed on most, then storing other key-value pairs successively, its storage format is as shown in Figure 3.
By foregoing description, data are logically organized in the data space of a multidimensional by data model, and Data are divided into a definite block;The data probability in each block is physically placed into corresponding table In multiple bucket subdirectories of catalogue.Data in block are most important in the modes of emplacement of each bucket subdirectories , its modes of emplacement is placed model by the probability of data and is determined, i.e., the data in data set are to place model loading according to probability Into Database Systems.The probability that Fig. 4 illustrates data in present embodiment places model.
It is assumed that the data probability in data set in a block is placed into bucket1~mM bucket In catalogue, its placement probability on each bucket is respectively p1~m, then the data in block are in a placement process It may be placed into m trunk file of m bucket catalogue, this m trunk file is referred to as a trunk packet, Fig. 3 Shared G1~sS trunk packet.In the placement process of data, any one file in a trunk is grouped reaches The file size specified, records the placement number that the data in block are grouped in the trunk, and creates a new trunk points Group stores data, and Fig. 5 describes the probability distribution situation that data in distributed file system are engraved in some time.
In the described probability of Fig. 4 places model, the possible probability of each bucket catalogues places the number of multiple block According to.Herein, it is necessary to methods of the block based on linearisation of multi-dimensional data space be numbered, according to the size order of numbering Block in data space is divided into one or more block packets, block number of each block groups is identical. The data probability of difference block is placed in m identical bucket catalogue during one block is grouped.
The data probability of block is placed in m bucket catalogue in each block groups, and data are in m bucket mesh Recording playback puts probability based on Amdahl's law to solve.Amdahl's law describes calculating task and is parallelized processing Afterwards, the relation between the speed-up ratio of calculating task and the number of parallel processing node, shown in its function expression such as formula (1), Wherein, n is the number of calculate node, and p is the part that calculating task can be parallelized processing, and p is by nsIt is a to calculate section The speed-up ratio speedup measured on pointmDetermine, shown in expression formula such as formula (2).Based on Amdahl's law, you can try to achieve Data m bucket catalogue placement probability, shown in its solution formula such as formula (3).
It is placed on due to the data probability of each block in a block group in m identical bucket catalogue, In the probability placement of data or query process, occurs the problem of hot spot is read to write with hot spot in order to prevent, probability places model will Ask placement probability of the data in a block group in each block in same bucket not exactly the same therefore right Each block in block groups, it is only necessary to ensure its incomplete phase of order in the placement probability of m bucket catalogue With.Simplest mode is, if the numbering of block is odd number, according to the ascending order of placement probability sequentially by the block In data probability be placed in m bucket catalogue;If the numbering of block is even number, according to the descending for placing probability Data probability in the block is placed in m bucket catalogue by order.
The core of the present invention is to devise a kind of big data querying method based on probability, can be passed through in query process It is given to look into full probability and look into full possibility to reduce data, improve the query performance of data.In the query process of data, user Input inquiry sentence is needed to set querying condition, and the form of query statement is as shown in table 1, and select clause representations need to inquire about Attribute and aggregation operator type, including avg, min, max, sum and count;From clause representations need the mesh inquired about Mark data set;Where clause representations querying attributes and its value;Recall clause representations look into full probability, and it is one to look into full probability Value is more than 0 number for being less than or equal to 1, represents to inquire the size of the possibility for all data for meeting querying condition.
1 query statement of table
It is clear that when giving querying condition by the query statement shown in table 1, pass through specified target data set name Title and querying attributes and its value, are easy to where the block and block where judging data by data model Block groups, so that it is determined that bucket catalogues where inquiry data.The key of big data inquiry based on probability is at these Selection meets to look into the trunk files of full probability to carry out data query in bucket catalogues.
In order to select to meet the trunk files for looking into full probability, it is necessary first to solve inquiry data and deposited in each trunk files In the size of probability, shown in its solution formula such as formula (4), wherein peRepresent the existing probability of data, piRepresent the block numbers According to the placement probability in i-th of bucket subdirectory, wkRepresent placement number of the block data in k-th of trunk group. Solving to data are inquired about after the existing probability of all trunk files, it is necessary to which existing probability is normalized, returning One changes shown in formula such as formula (5), and n represents the sum of trunk files, pejRepresent the existing probability of j-th of trunk file.
Inquiry data are obtained after the existing probability of each trunk files, it is necessary to according to inquiry data each solving The existing probability of a trunk files heuristically selects trunk files, and make selection obtain trunk files meet it is following two Constraints:1. inquiry data selected trunk files look into full probability be more than or equal to look into full probability pr, ensure to look into Ask result and meet querying condition;2. for identical querying condition, the incomplete phase of trunk files arrived every time selected by inquiry Together so that each query result has certain randomness, and guarantee meets that all data of querying condition are likely to be looked into Ask.
Fig. 6 is the heuristic search algorithm flow chart of present embodiment, and the pseudo-code of heuristic search algorithm is described such as the institute of table 2 Show.First, inquiry data are normalized in the existing probability of each trunk files by normalizing formula, secondly, Probability 1-p is not present in selectioneLess than or equal to looking into full probability prTrunk files, be added to MapSelect < Trunk, peIn > set, other non-selected trunk files are added to MapNonSelect < trunk, pe> gathers In, finally, using heuristicTrunkSelect () function in MapNonSelect < trunk, peSelected in > set Trunk files are added to MapSelect < trunk, peIn > set so that MapSelect < trunk, peTrunk in > File meets the requirement for looking into full probability.
2 heuristic search algorithm of table
HeuristicTrunkSelect () function carries out the selection of trunk files using heuristic selection strategy, is selecting During selecting, in fact it could happen that in Fig. 7 (a) figures, two kinds of situations in Fig. 7 (b) figures, first, in MapNonSelect < trunk, peTwo trunk files are randomly choosed in > set, if the product p of probability is not present in it1·p2> pr, then by the two trunk File is from MapNonSelect < trunk, peRemoved in > set, continue to select 1-pe> p1·p2Trunk files, such as Shown in Fig. 7 (a) figures, if the product p of probability is not present in it1·p2< pr, then 1-p is removede≤{min|p1, p2It is all Trunk files, and these trunk files are added to MapSelect < trunk, peIn > set, continue selection and compare 1-pe The trunk files of bigger, as shown in Fig. 7 (b) figures.Until without selectable trunk files or product there is no probability etc. In prWhen stop selection, by MapNonSelect < trunk, peGather non-selected trunk files in > to be added to MapSelect < trunk, peIn >, then MapSelect < trunk, peTrunk files in > are that satisfaction looks into full probability prAll trunk files.
The principle of heuristic trunk file selection methods is described as follows:Since the data in block are stored in this n In a trunk files, so inquiry data are 1 in the full probability of looking into of this n trunk file, i.e. pe(pe1, pe2, pe3... pen) =1.It is respectively p there is no probability that data, which are inquired about, in this n trunk filee1', pe2', pe3' ... pen', then in block Data be stored in trunk1~n-1On probability be pen', that is, it is p to look into full probabilityen', inquiry data are stored in trunk1~n-2On Probability be pen′·pen-1', that is, it is p to look into full probabilityen′·pen-1', and so on, data are stored in trunk1~n-mOn it is general Rate isLooking into full probability isIt follows that data the looking into any number of trunk files in block Full probability is the product there is no probability of block data data in remaining trunk files.
Inquiry error can be estimated in the recall ratio of each trunk files by calculating inquiry data, MapSelect < trunk, peInquiry error caused by inquiring about data in trunk files selected in > set can be by public affairs Formula (6) solves, wherein trunkikRepresent the trunk files in k-th of trunk group, p in i-th of bucket subdirectoryi Represent block data in the placement probability of i-th of bucket subdirectory, wkRepresent block data in k-th trunk group Place number;S represents the sum of all trunk groups.
Finally, the trunk files of selection are subjected to parallelization processing by MapReduce programming models, select satisfaction to look into The data of inquiry condition.If necessary to carry out aggregation operator on some attributes, then using MapReduce programming models specific Aggregation operator is carried out on attribute, returns to query result.
The embodiment of the present invention is realized based on Hadoop-2.6.0, is in Hadoop isomeric group environment Saved as in 37 and experimental verification is carried out to the present invention under the experimental situation of the PC machine of 8G, wherein, 12 machines use Intel i3 Processor, 12 machines use Intel i5 processors, and 13 machines use Intel i7 processors, and one uses i7 processors Machine as NameNode, remaining 36 machine is as DataNode.
Use S altogether in an experiment1, S2Two datasets are tested, S1, S2For the data set generated at random, data volume point Other 40G and 50G.S is chosen respectively1, S2Querying attributes of 3 attributes of data set as data, based on these querying attributes, number According to collection data distribution in a 3-dimensional data space, 64 block are divided into by the data space is wide, every 8 block are One block group, and the data probability of each block is placed in 44 bucket catalogues, its placement probability is respectively 0.001 to 0.0043, and 0.054.
Present embodiment respectively verifies the present invention by two groups of experiments of Fig. 8 and Fig. 9.Fig. 8 uses S2As experiment Data set, illustrates the query performance of the data in the case where difference looks into full probability, and is compared with preferably most fast query time. Most fast query time is the influence for the random selection factor for eliminating trunk files, according to there is no descending suitable of probability Sequence selects trunk files, then the trunk file numbers being rejected are up to maximum, and data query performance also reaches optimal.It is logical Cross the experimental result that Fig. 8 is shown to understand, regardless of whether going the randomization of trunk files to select, data query performance is complete with looking into The relation that probability linearly successively decreases.
S is respectively adopted in Fig. 91And S2As experimental data set, the full probability of looking into for setting data is respectively 1 and 0.5 (in figure point Do not represented with " probabilistic query -1 " and " probabilistic query -0.5 "), with HBase, Cassandra, Hive and MongoDB database Data query performance is compared.The experimental result shown by Fig. 9, when it is 1 to look into full probability, data query performance connects It is bordering on other databases in addition to HBase;When it is 0.5 to look into full probability, data query performance is substantially better than all carry out in fact The database tested.

Claims (5)

  1. A kind of 1. big data querying method based on probability, it is characterised in that:Comprise the following steps:
    Step 1:Data set with multiple attributes is divided;
    Step 2:Data set after division is loaded;
    Step 3:Probabilistic query is carried out to data set;
    The step 1 includes the following steps:
    Step 1.1:Querying attributes of one or more attribute of data set as data set are selected, give each querying attributes The wide granularity of division of codomain;
    Step 1.2:Fill up the data of querying attributes value vacancy in data set, it is generally the case that by taking for these querying attributes Value is set to minimum value, maximum or null value of the querying attributes in its codomain;
    Step 1.3:Judge the data type of querying attributes value, the data type of querying attributes value shares numerical value and text two Type, if value type, then performs step 1.4, if text type, then performs step 1.5;
    Step 1.4:Size according to querying attributes value is ranked up, according to the granularity of division of querying attributes to querying attributes into The wide division of row, continues to execute step 1.6;
    Step 1.5:Lexcographical order according to querying attributes value initial is ranked up, according to the granularity of division of querying attributes to looking into Ask attribute and carry out wide division, continue to execute step 1.6;
    Step 1.6:The dimension information of each dimension is stored in distributed file system, dimension information mainly includes dimension title, dimension value The granularity of division of value type and dimension.
  2. 2. big data querying method based on probability according to claim 1, it is characterised in that:The step 2 includes Following steps:
    Step 2.1:All obtained deblockings that divide are concentrated to be grouped to data;
    A dimension using each querying attributes as multi-dimensional data space, then the data distribution in the data set is in a multidimensional In data space, the wide division of codomain progress to querying attributes namely carries out wide stroke to the valued space of each dimension in fact Point, based on the division of each dimension, the data being distributed in multi-dimensional data space are divided into multiple small data blocks, will draw herein The each small data block got is referred to as a block;
    The block in multi-dimensional data space is numbered based on hyperspace linearization technique, according to the size order of numbering Block is divided into one or more block group;
    Step 2.2:Create storage catalogue of the data set in distributed file system;
    Step 2.2.1:Judge that the root root catalogues of Database Systems storage data whether there is, if it does not exist, then holding Row step 2.2.2;If it is present perform step 2.2.3;
    Step 2.2.2:Database system data storage data root root catalogues are created, perform step 2.2.3;
    Step 2.2.3:Create the particular category table catalogues of the storage data under root root catalogues, the catalogue with Name nominating specified by the data set;
    Step 2.2.4:M bucket subdirectory is created for each block groups to store data, the name of this m subdirectory Rule is " the small group # subdirectories bucket numberings of block ";The bucket is the subdirectory of table catalogues, is block In data carry out the unit of probability placement, the data in a block may probability be placed into multiple bucket catalogues;
    Step 2.3:Data in each block in each block groups are placed into m different placement probability respectively In m different bucket subdirectories in table catalogues, data are stored in the trunk files of bucket subdirectories;It is described Trunk is elementary cell of the number of data sets according to storage, included in bucket catalogues, and each trunk files may store it is more Data in a block;
    For any a data in block, data may be stored in the different trunk texts in m bucket subdirectory In part, referred herein to this m trunk file is a trunk group;Any one block for being placed into the trunk groups Data, it is necessary to record placement number of the block data in the trunk groups;
    If any one trunk file has reached the size specified in trunk groups, step 2.4 is performed;Otherwise, continue Perform step 2.3;
    If completing the placement of all data in data set, step 2.5 is performed;
    Step 2.4:New trunk file storage datas are respectively created in m bucket subdirectory, perform step 2.3;
    Step 2.5:Each block in each block groups is stored in distributed text in the placement number of all trunk groups In part system.
  3. 3. big data querying method based on probability according to claim 2, it is characterised in that:The step 3 includes Following steps:
    Step 3.1:User sets querying condition by input inquiry sentence;
    Step 3.2:Whether the querying condition that judgment step 3.1 is set meets following constraints:
    Constraint 1:Target data set is necessarily present in Database Systems;
    Constraint 2:Querying attributes are the querying attributes specified, and are a nonvoid subsets of querying attributes set;
    Constraint 3:Clustered pattern is one in the method for congregating specified;
    Constraint 4:It must be a decimal more than 0 less than or equal to 1 to look into full probability;
    It is not specified or be unsatisfactory for constraint 4 if meeting constraint 1~constraint 3, then perform step 3.3;If meet above-mentioned 4 at the same time about Beam, then perform step 3.4;If being unsatisfactory for any one constraints of constraint 1~constraint 3, inquiry failure, terminates;
    Step 3.3:Full probability will be looked into and be set to 1, perform step 3.4;
    Step 3.4:Tables of data and querying attributes according to specified by query statement determine inquiry data belonging to block and Block groups;
    Step 3.5:Read placement number of the data in each trunk groups of the block in block groups;
    Step 3.6:Solve existing probability of the inquiry data in each trunk files;
    Step 3.7:According to data each trunk files existing probability, heuristically select trunk files, make selected Trunk files meet following two constraintss;
    Constraint 5:Inquire about data on selected trunk files look into full probability be more than or equal to look into full probability pr
    Constraint 6:For identical querying condition, it is not exactly the same that selected trunk files are inquired about every time so that each looks into Asking result has certain randomness, and guarantee meets that all data of querying condition are likely to be queried to;
    The heuristic system of selection specific steps of trunk files are described as follows:
    Step 3.7.1:The existing probability of all trunk files to that may store inquiry data is normalized;
    Step 3.7.2:Probability 1-p is not present in selectioneLess than or equal to looking into full probability prTrunk files, be added to MapSelect < trunk, peIn > set, other trunk files are added to MapNonSelect < trunk, pe> collection In conjunction;
    Step 3.7.3:In MapNonSelect < trunk, peTwo trunk files are randomly choosed in > set, if inquiry data There is no probability it is respectively p in the two trunk files1, p2, solve p1With p2Product p;
    Step 3.7.4:If it is more than there is no the product p of probability looks into full probability pr, then step 3.7.5 is performed;
    If it is less than there is no the product p of probability looks into full probability pr, then step 3.7.6 is performed;
    If it is equal to there is no the product p of probability looks into full probability pr, then step 3.7.7 is performed;
    If MapNonSelect < trunk, peThe trunk files that > set can not select, then perform step 3.8;
    Step 3.7.5:From MapNonSelect < trunk, peThe two elements are deleted in > set, are continued MapNonSelect < trunk, peRandom selection one makes p=p there is no the trunk files that probability is more than p in > set (1-pe), peFor the existing probability of selected trunk files, step 3.7.4 is performed;
    Step 3.7.6:Will be in MapNonSelect < trunk, peProbability 1-p is not present in > sete≤{min|p1, p2 All trunk files are added to MapSelect < trunk, peIn > set, and these trunk files are existed MapNonSelect < trunk, peDeleted in > set;In MapNonSelect < trunk, peRemaining trunk in > set File relay continue selection than min | p1, p2Bigger trunk files, orderPerform step 3.7.4;
    Step 3.7.7:If in MapNonSelect < trunk, peThere are non-selected trunk files all to add in > set It is added to MapSelect < trunk, peIn > set, step 3.8 is performed;
    Step 3.8:Pass through formulaCalculate inquiry error, wherein trunkikRepresent i-th Trunk files in bucket subdirectories in k-th of trunk group, piRepresent block data in i-th of bucket specific item The placement probability of record, wkRepresent placement number of the block data in k-th of trunk group;S represents the total of all trunk groups Number;
    Step 3.9:Based on MapReduce programming model parallel processing MapSelect < trunk, peTrunk texts in > set Part, inquiry meet the data of querying attributes.
  4. 4. big data querying method based on probability according to claim 3, it is characterised in that:Institute in the step 3.1 The query statement stated, including tetra- clauses of select, from, where and recall, wherein, select clause representations need to look into The attribute of inquiry and the type of aggregation operator, including avg, min, max, sum and count;From clause representations need what is inquired about Target data set;Where clause representations querying attributes and its value;Recall clause representations look into full probability, prExpression is looked into complete general The size of rate, it is that a value is more than 0 number for being less than or equal to 1 to look into full probability, represents to inquire all numbers for meeting querying condition According to possibility size.
  5. 5. big data querying method based on probability according to claim 3, it is characterised in that:Looked into the step 3.6 Ask data is in the solution formula of the existing probability of each trunk filesWherein piRepresenting should Block data are in the placement probability of i-th of bucket subdirectory, wkRepresent the block data putting in k-th trunk group Put number.
CN201510492377.8A 2015-08-12 2015-08-12 A kind of big data querying method based on probability Active CN105117442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510492377.8A CN105117442B (en) 2015-08-12 2015-08-12 A kind of big data querying method based on probability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510492377.8A CN105117442B (en) 2015-08-12 2015-08-12 A kind of big data querying method based on probability

Publications (2)

Publication Number Publication Date
CN105117442A CN105117442A (en) 2015-12-02
CN105117442B true CN105117442B (en) 2018-05-04

Family

ID=54665432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510492377.8A Active CN105117442B (en) 2015-08-12 2015-08-12 A kind of big data querying method based on probability

Country Status (1)

Country Link
CN (1) CN105117442B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677840B (en) * 2016-01-06 2019-02-05 东北大学 A kind of data query method based on the cumulative data model of multidimensional
CN106021488A (en) * 2016-05-19 2016-10-12 乐视控股(北京)有限公司 Key value database management method and apparatus
CN106294665A (en) * 2016-08-05 2017-01-04 浪潮软件股份有限公司 Method and device for storing student status data
CN107798019A (en) * 2016-09-07 2018-03-13 阿里巴巴集团控股有限公司 A kind of method and apparatus for being used to provide the node serve data for accelerating service node
CN107480220B (en) * 2017-08-01 2021-01-12 浙江大学 Rapid text query method based on online aggregation
CN110489449B (en) * 2019-07-30 2022-02-22 北京百分点科技集团股份有限公司 Chart recommendation method and device and electronic equipment
CN111931200B (en) * 2020-07-13 2024-02-23 车智互联(北京)科技有限公司 Data serialization method, mobile terminal and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073718A (en) * 2011-01-10 2011-05-25 清华大学 System and method for explaining, erasing and modifying search result in probabilistic database
CN103442331A (en) * 2013-08-07 2013-12-11 华为技术有限公司 Terminal equipment position determining method and terminal equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073718A (en) * 2011-01-10 2011-05-25 清华大学 System and method for explaining, erasing and modifying search result in probabilistic database
CN103442331A (en) * 2013-08-07 2013-12-11 华为技术有限公司 Terminal equipment position determining method and terminal equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"一种深层网络不确定性概率模型研究";王鹏鸣;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130215;摘要 *

Also Published As

Publication number Publication date
CN105117442A (en) 2015-12-02

Similar Documents

Publication Publication Date Title
CN105117442B (en) A kind of big data querying method based on probability
CN103177061B (en) Unique value estimation in partition table
CN105488043B (en) Data query method and system based on Key-Value data block
US8484252B2 (en) Generation of a multidimensional dataset from an associative database
CN102722531B (en) Query method based on regional bitmap indexes in cloud environment
US20230139783A1 (en) Schema-adaptable data enrichment and retrieval
CN109952569A (en) Technology for connection and polymerization based on dictionary
CN104281701B (en) Multiscale Distributed Spatial data query method and system
CN105404634B (en) Data managing method and system based on Key-Value data block
CN109062936B (en) Data query method, computer readable storage medium and terminal equipment
Ignatov et al. Can triconcepts become triclusters?
CN103562905B (en) Improved data visualization configuration system and method
CN112527783A (en) Data quality probing system based on Hadoop
Kuzochkina et al. Analyzing and Comparison of NoSQL DBMS
CN108874873B (en) Data query method, device, storage medium and processor
US10521455B2 (en) System and method for a neural metadata framework
CN106845787A (en) A kind of data method for automatically exchanging and device
Pedersen Managing complex multidimensional data
Khalil et al. New approach for implementing big datamart using NoSQL key-value stores
Wang et al. A resume recommendation model for online recruitment
Girsang et al. Decision support system using data warehouse for hotel reservation system
Mazurova et al. Research of ACID transaction implementation methods for distributed databases using replication technology
CN115481026A (en) Test case generation method and device, computer equipment and storage medium
CN111026759B (en) Report generation method and device based on Hbase
Bicevska et al. NoSQL-based data warehouse solutions: sense, benefits and prerequisites

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant