CN105117442B - A kind of big data querying method based on probability - Google Patents
A kind of big data querying method based on probability Download PDFInfo
- Publication number
- CN105117442B CN105117442B CN201510492377.8A CN201510492377A CN105117442B CN 105117442 B CN105117442 B CN 105117442B CN 201510492377 A CN201510492377 A CN 201510492377A CN 105117442 B CN105117442 B CN 105117442B
- Authority
- CN
- China
- Prior art keywords
- data
- trunk
- probability
- files
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24532—Query optimisation of parallel queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/278—Data partitioning, e.g. horizontal or vertical partitioning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of big data querying method based on probability, belongs to database technical field.This method includes:According to data model, the step of division to the data set with multiple attributes;Data set after division is placed into the step of model is loaded according to data probability;The step of probabilistic query is carried out to data set.This method is a kind of querying method of approximate integrality, and the query performance of data is improved by suitably losing inquiry integrality;Place model by a kind of data based on probability, realize data probability place and data each storage file existing probability solution;Pass through a kind of heuristic data querying method so that Database Systems can inquire about data by looking into full probability;And it ensure that the inquiry error of probabilistic query by probability calculation.
Description
Technical field
The invention belongs to database technical field, more particularly to a kind of big data querying method based on probability.
Background technology
The height fusion of people, machine, the thing ternary world have triggered the explosive growth of data scale and the height of data pattern
Complicate, the world has been enter into the big data epoch of networking.The arrival in big data epoch is brought to traditional data management system
Great challenge, NoSQL (Not only SQL) database is by its height extension, High Availabitity and flexible data model etc.
Feature has obtained the extensive favor of academia and industrial quarters.One of the core technology of data query technique as Database Systems,
With the development of cloud computing technology and NoSQL database technologys, the data query technique based on NoSQL receives much concern, and
And extensive research is in the industry cycle also obtained.
It is well known that the NoSQL databases of current main-stream are based primarily upon MapReduce programming models, distributed field system
The technologies such as system are managed big data, wherein, distributed file system is mainly used for the storage of big data, MapReduce
Programming model is used for the processing of big data.The data query performance of NoSQL databases and data store with Index Design, be based on
The problems such as query processing of MapReduce, query optimization, is closely related, and the research of big data inquiring technology at present is concentrated mainly on
In the performance optimization of these key technologies, and research extensively and profoundly is had been obtained at present on these problems, possess perhaps
More outstanding solution, paper " inquiring technology Review Study in cloud data management system " from index management, query processing, look into
Ask the research work of many aspects to inquiring technology in cloud data management system such as optimization and Online aggregate and summarize and divide
Analysis.However, for the inquiry mode of data, either traditional relevant database or new NoSQL databases, its
Used inquiry mode is all complete query, i.e., for given querying condition, defines the matching of querying condition anyway
Algorithm (accurate or approximate), anyway sorts query results, inquiry all will definitely return to all matched datas.Example
Such as, a certain user message table includes the fields such as identification card number, name, age, for any given querying condition, such as inquires about year
All users of the age more than 30 years old or all names are the users of Zhang San, and inquiry all will definitely return to all satisfaction inquiry bars
The data of part.
Under big data environment, since data scale is larger and the complexity of data structure, complete query needs to consume
Larger time cost.Many practical applications show that people are not it needs to be determined that complete query result, it is not required that to inquiry
As a result accurately sequence (such as Top-k inquiries), it is thus only necessary to meet the partial query result of certain integrity demands, or can fit
Locality loss inquires about integrality to meet performance requirement.For example, people, when airport inquiry meets the hotel of certain condition, they are simultaneously
The result set that need not be returned is total data, their the opposite requirements to the response time can higher.And current database system
The complete query mode of use can not meet this query demand, and there is an urgent need for define a kind of approximate integrity inquiries technology to make up
This vacancy.Approximate integrity inquiries are different from traditional complete query, its approximation is mainly reflected in data and looks into full possibility
Property on, that is, inquire the probability of all data for meeting querying condition, herein referred to as full probability is looked into, look into full probability description
Query results are the possibilities of complete data set.
The content of the invention
In view of the deficienciess of the prior art, the object of the present invention is to provide a kind of big data issuer based on probability
Method, to meet the needs of approximate integrity inquiries in big data environment.
The technical scheme is that so:
A kind of big data querying method based on probability, comprises the following steps:
Step 1:Data set with multiple attributes is divided;
Step 1.1:Querying attributes of one or more attribute of data set as data set are selected, give each inquiry
The wide granularity of division of attribute codomain;
Step 1.2:Fill up the data of querying attributes value vacancy in data set, it is generally the case that by these querying attributes
Value be set to minimum value, maximum or null value of the querying attributes in its codomain;
Step 1.3:Judge the data type of querying attributes value, the data type of querying attributes value shares numerical value and text
This two types;If value type, then step 1.4 is performed, if text type, then perform step 1.5;
Step 1.4:Size according to querying attributes value is ranked up, and inquiry is belonged to according to the granularity of division of querying attributes
Property carry out wide division, continue to execute step 1.6;
Step 1.5:Lexcographical order according to querying attributes value initial is ranked up, according to the granularity of division of querying attributes
Wide division is carried out to querying attributes, continues to execute step 1.6;
Step 1.6:The dimension information of each dimension is stored in distributed file system, dimension information mainly include dimension title,
The granularity of division of dimension value value type and dimension.
Step 2:Data set after division is loaded;
Step 2.1:All obtained deblockings that divide are concentrated to be grouped to data;
A dimension using each querying attributes as multi-dimensional data space, then the data distribution in the data set is at one
In multi-dimensional data space, the wide division of codomain progress to querying attributes in fact namely carries out the valued space of each dimension etc.
Width division, based on the division of each dimension, the data being distributed in multi-dimensional data space are divided into multiple small data blocks, herein
Each small data block that division obtains is referred to as a block;
The block in multi-dimensional data space is numbered based on hyperspace linearization technique, according to the size of numbering
Block is divided one or more block group by order;
Step 2.2:Create storage catalogue of the data set in distributed file system;
Step 2.2.1:Judge that the root root catalogues of Database Systems storage data whether there is, if it does not,
Then perform step 2.2.2;If it is present perform step 2.2.3;
Step 2.2.2:Database system data storage data root root catalogues are created, perform step 2.2.3;
Step 2.2.3:The particular category table catalogues of the storage data, the mesh are created under root root catalogues
Name nominating of the record specified by with the data set;
Step 2.2.4:M bucket subdirectory is created for each block groups to store data, this m subdirectory
Naming rule is " the small group # subdirectories bucket numberings of block ";
Step 2.3:Data in each block in each block groups are placed with m different placement probability respectively
Into m different bucket subdirectories in table catalogues, data are stored in the trunk files of bucket subdirectories;
For any a data in block, data may be stored in different in m bucket subdirectory
In trunk files, referred herein to this m trunk file is a trunk group;For being placed into any one of the trunk groups
The data of a block are, it is necessary to record placement number of the block data in the trunk groups;
If any one trunk file has reached the size specified in trunk groups, step 2.4 is performed;Otherwise,
Continue to execute step 2.3;
If completing the placement of all data in data set, step 2.5 is performed;
Step 2.4:New trunk file storage datas are respectively created in m bucket subdirectory, perform step 2.3;
Step 2.5:Each block in each block groups is stored in distribution in the placement number of all trunk groups
In formula file system.
Step 3:Probabilistic query is carried out to data set;
Step 3.1:User sets querying condition by input inquiry sentence;
Query statement includes tetra- clauses of select, from, where and recall, wherein, select clause representations need
The attribute of inquiry and the type of aggregation operator, including avg, min, max, sum and count;From clause representations need to inquire about
Target data set;Where clause representations querying attributes and its value;Recall clause representations look into full probability, and pr represents to look into complete
The size of probability, it is that a value is more than 0 number for being less than or equal to 1 to look into full probability, and expression, which inquires, meets all of querying condition
The size of the possibility of data;
Step 3.2:Whether the querying condition that judgment step 3.1 is set meets following constraints:
Constraint 1:Target data set is necessarily present in Database Systems;
Constraint 2:Querying attributes are the querying attributes specified, and are a nonvoid subsets of querying attributes set;
Constraint 3:Clustered pattern is one in the method for congregating specified;
Constraint 4:It must be a decimal more than 0 less than or equal to 1 to look into full probability;
It is not specified or be unsatisfactory for constraint 4 if meeting constraint 1~constraint 3, then perform step 3.3;If meet above-mentioned 4 at the same time
A constraint, then perform step 3.4;If being unsatisfactory for any one constraints of constraint 1~constraint 3, inquiry failure, terminates;
Step 3.3:Full probability will be looked into and be set to 1, perform step 3.4;
Step 3.4:Tables of data and querying attributes according to specified by query statement determine the block belonging to inquiry data
And block groups;
Step 3.5:Read placement number of the data in each trunk groups of the block in block groups;
Step 3.6:Solve existing probability of the inquiry data in each trunk files;
The solution formula of existing probability isWherein piRepresent the block data i-th
The placement probability of a bucket subdirectories, wkRepresent placement number of the block data in k-th of trunk group;
Step 3.7:According to data each trunk files existing probability, heuristically select trunk files, make institute
The trunk files of choosing meet following two constraintss;
Constraint 5:Inquire about data on selected trunk files look into full probability be more than or equal to look into full probability pr;
Constraint 6:For identical querying condition, it is not exactly the same that selected trunk files are inquired about every time so that every time
Query result there is certain randomness, guarantee meets that all data of querying condition are likely to be queried to;
The heuristic system of selection specific steps of trunk files are described as follows:
Step 3.7.1:The existing probability of all trunk files to that may store inquiry data is normalized;
Step 3.7.2:Probability 1-p is not present in selectioneLess than or equal to looking into full probability prTrunk files, added
To MapSelect < trunk, peIn > set, other trunk files are added to MapNonSelect < trunk, pe>
In set;
Step 3.7.3:In MapNonSelect < trunk, peTwo trunk files are randomly choosed in > set, if looking into
It is respectively p there is no probability that data, which are ask, in the two trunk files1, p2, solve p1With p2Product p;
Step 3.7.4:If it is more than there is no the product p of probability looks into full probability pr, then step 3.7.5 is performed;
If it is less than there is no the product p of probability looks into full probability pr, then step 3.7.6 is performed;
If it is equal to there is no the product p of probability looks into full probability pr, then step 3.7.7 is performed;
If MapNonSelect < trunk, peThe trunk files that > set can not select, then perform step
3.8;
Step 3.7.5:From MapNonSelect < trunk, peThe two elements are deleted in > set, are continued
MapNonSelect < trunk, peRandom selection one makes p=p there is no the trunk files that probability is more than p in > set
(1-pe), peFor the existing probability of selected trunk files, step 3.7.4 is performed;
Step 3.7.6:Will be in MapNonSelect < trunk, peProbability 1-p is not present in > sete≤{min|p1,
p2All trunk files be added to MapSelect < trunk, peIn > set, and these trunk files are existed
MapNonSelect < trunk, peDeleted in > set;In MapNonSelect < trunk, peRemaining trunk in > set
File relay continue selection than min | p1, p2Bigger trunk files, orderPerform step
3.7.4;
Step 3.7.7:If in MapNonSelect < trunk, peThere are the non-selected trunk files complete in > set
Portion is added to MapSelect < trunk, peIn > set, step 3.8 is performed;
Step 3.8:Pass through formulaCalculate inquiry error, wherein trunkikRepresent
Trunk files in i-th of bucket subdirectory in k-th of trunk group, piRepresent block data in i-th of bucket
The placement probability of subdirectory, wkRepresent placement number of the block data in k-th of trunk group;S represents all trunk groups
Sum;
Step 3.9:Based on MapReduce programming model parallel processing MapSelect < trunk, peIn > set
Trunk files, inquiry meet the data of querying attributes.
Beneficial effects of the present invention:A kind of big data querying method based on probability of the present invention, has the following advantages that:
1st, the present invention proposes a kind of approximate integrity inquiries method, and number is improved by suitably losing inquiry integrality
According to query performance.
2nd, the present invention devises a kind of data based on probability and places model, realizes probability placement and the data of data
In the solution of each storage file existing probability.
3rd, the present invention devises a kind of heuristic data querying method so that Database Systems can by look into full probability come
Inquire about data.
4th, the present invention ensure that the inquiry error of probabilistic query by probability calculation.
Brief description of the drawings
Fig. 1 is the big data querying method flow chart based on probability of the specific embodiment of the invention;
Fig. 2 is the data model figure of the specific embodiment of the invention, wherein:
Fig. 2 (a) figures are the logical model structure diagram of the data model of the specific embodiment of the invention;
Fig. 2 (b) figures are the part amplification displaying figure of Fig. 2 (a) figures;
Fig. 2 (c) figures are the physical model structure schematic diagram of the data model of the specific embodiment of the invention;
Fig. 3 is the Data Physical storage format schematic diagram of the specific embodiment of the invention;
Fig. 4 is that the probability of the specific embodiment of the invention places illustraton of model;
Fig. 5 is probability distribution graph of the specific embodiment of the invention time data in file system;
Fig. 6 is the heuristic search algorithm flow chart of the specific embodiment of the invention;
Fig. 7 is the heuristic selection strategy figure of the specific embodiment of the invention, wherein:
Fig. 7 (a) figures are in the heuristic trunk selection courses of the specific embodiment of the invention, and inquiry data are selected
Product in trunk files there is no probability is more than the situation figure for looking into full probability;
Fig. 7 (b) figures be invention embodiment in heuristic trunk selection courses, inquiry data in selected trunk
Product in file there is no probability is less than the situation figure for looking into full probability;
Fig. 8 is the actual queries time in the case where difference looks into full probability and the best queries timeliness of the specific embodiment of the invention
The experimental result picture of energy;
Fig. 9 be the specific embodiment of the invention when it is respectively 1 and 0.5 to look into full probability, the performance with other databases
Contrast and experiment figure.
Embodiment
The present invention proposes a kind of big data querying method based on probability, is looked into entirely generally in query process by specifying
Rate, the integrality of loss inquiry data improve the query performance of data, are a kind of new data query techniques, and have
Preferable versatility, applicability and scalability.The present invention is made with reference to the accompanying drawings and detailed description further detailed
Explanation.A kind of big data querying method based on probability of present embodiment, as shown in Figure 1, including:It is right according to data model
The step of data set with multiple attributes is divided;The step that data set is loaded according to data probability placement model
Suddenly;The step of probabilistic query is carried out to data set.
Data model definitions organizational forms of the data in Database Systems, the main logical model and thing for including data
Manage two parts of model.Fig. 2 is the data model figure of present embodiment, and Fig. 2 (a) represents the logical model of data, Fig. 2 (c) tables
The physical model of registration evidence.In specific implementation process, the data set with multiple attributes can be carried out according to logical model
Division;Organizational form and storage format of the data set in distributed file system are defined according to physical model.
Logical model illustrates the organizational form of data logically, for storing any one number to Database Systems
According to collection, data set usually contains multiple attributes in itself, and according to priori or expertise, (priori is that data set is carried out
The accumulation of the historical experience of operation, expertise are understanding of the domain expert to data set), select one of data set or more
Querying attributes (querying attributes be attribute that user when inquire about data relied on) of a attribute as data set, for example, for
Student information table, including multiple fields such as student name, student number, gender, age and credit, can pass through according to conventional history
Test or by school student information manager, select these attributes for being often queried to of the student number of student information table, name
Querying attributes as data set.Then, the wide granularity of division of each querying attributes codomain is given, and will each be inquired about respectively
A dimension of the attribute as multi-dimensional data space, then the data in data set are distributed in more than one according to these querying attributes
In dimension data space.
For each dimension of data space, the data type of its value is probably numeric data or text data, according to
Given granularity of division carries out wide division to the codomain of each dimension respectively, and division methods are specifically described as:If the value of dimension
It is numeric data, is ranked up according to the size of numeric data, then the codomain of data is carried out according to given granularity of division
Wide division;If the value of dimension is text data, then the lexcographical order according to text data initial is ranked up, Ran Hougen
Wide division is carried out to the codomain of data according to given granularity of division.
Based on the division of each dimension, multi-dimensional data space is divided into multiple small data blocks for being referred to as block, due to
Under big data environment, the diversity and complexity of data structure, data in data set may be on some querying attributes not
There are value, value of the data in the dimension is set to minimum value, maximum or null value of the data in the dimension value, based on this
Every data in data set is all divided into a definite block.Fig. 2 (a), which represents one, has three querying attributes
Data set, its data distribution is in a three-dimensional data space, by d1、d2、d3Three dimensions carry out wide division, data
The every data concentrated all is distributed in a definite block.
Physical model describes the organizational form of data physically, i.e., tissue and storage in distributed file system
Mode.The data organization that the present invention physically concentrates a data is table, bucket and trunk.Table represents one
The storage catalogue of a data set, it is corresponding with a multi-dimensional data space in logical model;Bucket is the son of table catalogues
Catalogue, is the unit that data in block carry out probability placement, the data in a block may probability be placed into it is multiple
In bucket catalogues;Trunk is elementary cell of the number of data sets according to storage, included in bucket catalogues, and each trunk texts
Part may store the data in multiple block.Under big data environment, due to the diversity of data, the structure per data is not
It is identical, therefore data physically carry out the storage of data with the form defined in SequenceFile.
SequenceFile is a series of binary file for the serializing that have recorded key-value pairs, for every number in Database Systems
According to before preferentially the key-value pair of querying attributes is placed on most, then storing other key-value pairs successively, its storage format is as shown in Figure 3.
By foregoing description, data are logically organized in the data space of a multidimensional by data model, and
Data are divided into a definite block;The data probability in each block is physically placed into corresponding table
In multiple bucket subdirectories of catalogue.Data in block are most important in the modes of emplacement of each bucket subdirectories
, its modes of emplacement is placed model by the probability of data and is determined, i.e., the data in data set are to place model loading according to probability
Into Database Systems.The probability that Fig. 4 illustrates data in present embodiment places model.
It is assumed that the data probability in data set in a block is placed into bucket1~mM bucket
In catalogue, its placement probability on each bucket is respectively p1~m, then the data in block are in a placement process
It may be placed into m trunk file of m bucket catalogue, this m trunk file is referred to as a trunk packet, Fig. 3
Shared G1~sS trunk packet.In the placement process of data, any one file in a trunk is grouped reaches
The file size specified, records the placement number that the data in block are grouped in the trunk, and creates a new trunk points
Group stores data, and Fig. 5 describes the probability distribution situation that data in distributed file system are engraved in some time.
In the described probability of Fig. 4 places model, the possible probability of each bucket catalogues places the number of multiple block
According to.Herein, it is necessary to methods of the block based on linearisation of multi-dimensional data space be numbered, according to the size order of numbering
Block in data space is divided into one or more block packets, block number of each block groups is identical.
The data probability of difference block is placed in m identical bucket catalogue during one block is grouped.
The data probability of block is placed in m bucket catalogue in each block groups, and data are in m bucket mesh
Recording playback puts probability based on Amdahl's law to solve.Amdahl's law describes calculating task and is parallelized processing
Afterwards, the relation between the speed-up ratio of calculating task and the number of parallel processing node, shown in its function expression such as formula (1),
Wherein, n is the number of calculate node, and p is the part that calculating task can be parallelized processing, and p is by nsIt is a to calculate section
The speed-up ratio speedup measured on pointmDetermine, shown in expression formula such as formula (2).Based on Amdahl's law, you can try to achieve
Data m bucket catalogue placement probability, shown in its solution formula such as formula (3).
It is placed on due to the data probability of each block in a block group in m identical bucket catalogue,
In the probability placement of data or query process, occurs the problem of hot spot is read to write with hot spot in order to prevent, probability places model will
Ask placement probability of the data in a block group in each block in same bucket not exactly the same therefore right
Each block in block groups, it is only necessary to ensure its incomplete phase of order in the placement probability of m bucket catalogue
With.Simplest mode is, if the numbering of block is odd number, according to the ascending order of placement probability sequentially by the block
In data probability be placed in m bucket catalogue;If the numbering of block is even number, according to the descending for placing probability
Data probability in the block is placed in m bucket catalogue by order.
The core of the present invention is to devise a kind of big data querying method based on probability, can be passed through in query process
It is given to look into full probability and look into full possibility to reduce data, improve the query performance of data.In the query process of data, user
Input inquiry sentence is needed to set querying condition, and the form of query statement is as shown in table 1, and select clause representations need to inquire about
Attribute and aggregation operator type, including avg, min, max, sum and count;From clause representations need the mesh inquired about
Mark data set;Where clause representations querying attributes and its value;Recall clause representations look into full probability, and it is one to look into full probability
Value is more than 0 number for being less than or equal to 1, represents to inquire the size of the possibility for all data for meeting querying condition.
1 query statement of table
It is clear that when giving querying condition by the query statement shown in table 1, pass through specified target data set name
Title and querying attributes and its value, are easy to where the block and block where judging data by data model
Block groups, so that it is determined that bucket catalogues where inquiry data.The key of big data inquiry based on probability is at these
Selection meets to look into the trunk files of full probability to carry out data query in bucket catalogues.
In order to select to meet the trunk files for looking into full probability, it is necessary first to solve inquiry data and deposited in each trunk files
In the size of probability, shown in its solution formula such as formula (4), wherein peRepresent the existing probability of data, piRepresent the block numbers
According to the placement probability in i-th of bucket subdirectory, wkRepresent placement number of the block data in k-th of trunk group.
Solving to data are inquired about after the existing probability of all trunk files, it is necessary to which existing probability is normalized, returning
One changes shown in formula such as formula (5), and n represents the sum of trunk files, pejRepresent the existing probability of j-th of trunk file.
Inquiry data are obtained after the existing probability of each trunk files, it is necessary to according to inquiry data each solving
The existing probability of a trunk files heuristically selects trunk files, and make selection obtain trunk files meet it is following two
Constraints:1. inquiry data selected trunk files look into full probability be more than or equal to look into full probability pr, ensure to look into
Ask result and meet querying condition;2. for identical querying condition, the incomplete phase of trunk files arrived every time selected by inquiry
Together so that each query result has certain randomness, and guarantee meets that all data of querying condition are likely to be looked into
Ask.
Fig. 6 is the heuristic search algorithm flow chart of present embodiment, and the pseudo-code of heuristic search algorithm is described such as the institute of table 2
Show.First, inquiry data are normalized in the existing probability of each trunk files by normalizing formula, secondly,
Probability 1-p is not present in selectioneLess than or equal to looking into full probability prTrunk files, be added to MapSelect <
Trunk, peIn > set, other non-selected trunk files are added to MapNonSelect < trunk, pe> gathers
In, finally, using heuristicTrunkSelect () function in MapNonSelect < trunk, peSelected in > set
Trunk files are added to MapSelect < trunk, peIn > set so that MapSelect < trunk, peTrunk in >
File meets the requirement for looking into full probability.
2 heuristic search algorithm of table
HeuristicTrunkSelect () function carries out the selection of trunk files using heuristic selection strategy, is selecting
During selecting, in fact it could happen that in Fig. 7 (a) figures, two kinds of situations in Fig. 7 (b) figures, first, in MapNonSelect < trunk,
peTwo trunk files are randomly choosed in > set, if the product p of probability is not present in it1·p2> pr, then by the two trunk
File is from MapNonSelect < trunk, peRemoved in > set, continue to select 1-pe> p1·p2Trunk files, such as
Shown in Fig. 7 (a) figures, if the product p of probability is not present in it1·p2< pr, then 1-p is removede≤{min|p1, p2It is all
Trunk files, and these trunk files are added to MapSelect < trunk, peIn > set, continue selection and compare 1-pe
The trunk files of bigger, as shown in Fig. 7 (b) figures.Until without selectable trunk files or product there is no probability etc.
In prWhen stop selection, by MapNonSelect < trunk, peGather non-selected trunk files in > to be added to
MapSelect < trunk, peIn >, then MapSelect < trunk, peTrunk files in > are that satisfaction looks into full probability
prAll trunk files.
The principle of heuristic trunk file selection methods is described as follows:Since the data in block are stored in this n
In a trunk files, so inquiry data are 1 in the full probability of looking into of this n trunk file, i.e. pe(pe1, pe2, pe3... pen)
=1.It is respectively p there is no probability that data, which are inquired about, in this n trunk filee1', pe2', pe3' ... pen', then in block
Data be stored in trunk1~n-1On probability be pen', that is, it is p to look into full probabilityen', inquiry data are stored in trunk1~n-2On
Probability be pen′·pen-1', that is, it is p to look into full probabilityen′·pen-1', and so on, data are stored in trunk1~n-mOn it is general
Rate isLooking into full probability isIt follows that data the looking into any number of trunk files in block
Full probability is the product there is no probability of block data data in remaining trunk files.
Inquiry error can be estimated in the recall ratio of each trunk files by calculating inquiry data,
MapSelect < trunk, peInquiry error caused by inquiring about data in trunk files selected in > set can be by public affairs
Formula (6) solves, wherein trunkikRepresent the trunk files in k-th of trunk group, p in i-th of bucket subdirectoryi
Represent block data in the placement probability of i-th of bucket subdirectory, wkRepresent block data in k-th trunk group
Place number;S represents the sum of all trunk groups.
Finally, the trunk files of selection are subjected to parallelization processing by MapReduce programming models, select satisfaction to look into
The data of inquiry condition.If necessary to carry out aggregation operator on some attributes, then using MapReduce programming models specific
Aggregation operator is carried out on attribute, returns to query result.
The embodiment of the present invention is realized based on Hadoop-2.6.0, is in Hadoop isomeric group environment
Saved as in 37 and experimental verification is carried out to the present invention under the experimental situation of the PC machine of 8G, wherein, 12 machines use Intel i3
Processor, 12 machines use Intel i5 processors, and 13 machines use Intel i7 processors, and one uses i7 processors
Machine as NameNode, remaining 36 machine is as DataNode.
Use S altogether in an experiment1, S2Two datasets are tested, S1, S2For the data set generated at random, data volume point
Other 40G and 50G.S is chosen respectively1, S2Querying attributes of 3 attributes of data set as data, based on these querying attributes, number
According to collection data distribution in a 3-dimensional data space, 64 block are divided into by the data space is wide, every 8 block are
One block group, and the data probability of each block is placed in 44 bucket catalogues, its placement probability is respectively
0.001 to 0.0043, and 0.054.
Present embodiment respectively verifies the present invention by two groups of experiments of Fig. 8 and Fig. 9.Fig. 8 uses S2As experiment
Data set, illustrates the query performance of the data in the case where difference looks into full probability, and is compared with preferably most fast query time.
Most fast query time is the influence for the random selection factor for eliminating trunk files, according to there is no descending suitable of probability
Sequence selects trunk files, then the trunk file numbers being rejected are up to maximum, and data query performance also reaches optimal.It is logical
Cross the experimental result that Fig. 8 is shown to understand, regardless of whether going the randomization of trunk files to select, data query performance is complete with looking into
The relation that probability linearly successively decreases.
S is respectively adopted in Fig. 91And S2As experimental data set, the full probability of looking into for setting data is respectively 1 and 0.5 (in figure point
Do not represented with " probabilistic query -1 " and " probabilistic query -0.5 "), with HBase, Cassandra, Hive and MongoDB database
Data query performance is compared.The experimental result shown by Fig. 9, when it is 1 to look into full probability, data query performance connects
It is bordering on other databases in addition to HBase;When it is 0.5 to look into full probability, data query performance is substantially better than all carry out in fact
The database tested.
Claims (5)
- A kind of 1. big data querying method based on probability, it is characterised in that:Comprise the following steps:Step 1:Data set with multiple attributes is divided;Step 2:Data set after division is loaded;Step 3:Probabilistic query is carried out to data set;The step 1 includes the following steps:Step 1.1:Querying attributes of one or more attribute of data set as data set are selected, give each querying attributes The wide granularity of division of codomain;Step 1.2:Fill up the data of querying attributes value vacancy in data set, it is generally the case that by taking for these querying attributes Value is set to minimum value, maximum or null value of the querying attributes in its codomain;Step 1.3:Judge the data type of querying attributes value, the data type of querying attributes value shares numerical value and text two Type, if value type, then performs step 1.4, if text type, then performs step 1.5;Step 1.4:Size according to querying attributes value is ranked up, according to the granularity of division of querying attributes to querying attributes into The wide division of row, continues to execute step 1.6;Step 1.5:Lexcographical order according to querying attributes value initial is ranked up, according to the granularity of division of querying attributes to looking into Ask attribute and carry out wide division, continue to execute step 1.6;Step 1.6:The dimension information of each dimension is stored in distributed file system, dimension information mainly includes dimension title, dimension value The granularity of division of value type and dimension.
- 2. big data querying method based on probability according to claim 1, it is characterised in that:The step 2 includes Following steps:Step 2.1:All obtained deblockings that divide are concentrated to be grouped to data;A dimension using each querying attributes as multi-dimensional data space, then the data distribution in the data set is in a multidimensional In data space, the wide division of codomain progress to querying attributes namely carries out wide stroke to the valued space of each dimension in fact Point, based on the division of each dimension, the data being distributed in multi-dimensional data space are divided into multiple small data blocks, will draw herein The each small data block got is referred to as a block;The block in multi-dimensional data space is numbered based on hyperspace linearization technique, according to the size order of numbering Block is divided into one or more block group;Step 2.2:Create storage catalogue of the data set in distributed file system;Step 2.2.1:Judge that the root root catalogues of Database Systems storage data whether there is, if it does not exist, then holding Row step 2.2.2;If it is present perform step 2.2.3;Step 2.2.2:Database system data storage data root root catalogues are created, perform step 2.2.3;Step 2.2.3:Create the particular category table catalogues of the storage data under root root catalogues, the catalogue with Name nominating specified by the data set;Step 2.2.4:M bucket subdirectory is created for each block groups to store data, the name of this m subdirectory Rule is " the small group # subdirectories bucket numberings of block ";The bucket is the subdirectory of table catalogues, is block In data carry out the unit of probability placement, the data in a block may probability be placed into multiple bucket catalogues;Step 2.3:Data in each block in each block groups are placed into m different placement probability respectively In m different bucket subdirectories in table catalogues, data are stored in the trunk files of bucket subdirectories;It is described Trunk is elementary cell of the number of data sets according to storage, included in bucket catalogues, and each trunk files may store it is more Data in a block;For any a data in block, data may be stored in the different trunk texts in m bucket subdirectory In part, referred herein to this m trunk file is a trunk group;Any one block for being placed into the trunk groups Data, it is necessary to record placement number of the block data in the trunk groups;If any one trunk file has reached the size specified in trunk groups, step 2.4 is performed;Otherwise, continue Perform step 2.3;If completing the placement of all data in data set, step 2.5 is performed;Step 2.4:New trunk file storage datas are respectively created in m bucket subdirectory, perform step 2.3;Step 2.5:Each block in each block groups is stored in distributed text in the placement number of all trunk groups In part system.
- 3. big data querying method based on probability according to claim 2, it is characterised in that:The step 3 includes Following steps:Step 3.1:User sets querying condition by input inquiry sentence;Step 3.2:Whether the querying condition that judgment step 3.1 is set meets following constraints:Constraint 1:Target data set is necessarily present in Database Systems;Constraint 2:Querying attributes are the querying attributes specified, and are a nonvoid subsets of querying attributes set;Constraint 3:Clustered pattern is one in the method for congregating specified;Constraint 4:It must be a decimal more than 0 less than or equal to 1 to look into full probability;It is not specified or be unsatisfactory for constraint 4 if meeting constraint 1~constraint 3, then perform step 3.3;If meet above-mentioned 4 at the same time about Beam, then perform step 3.4;If being unsatisfactory for any one constraints of constraint 1~constraint 3, inquiry failure, terminates;Step 3.3:Full probability will be looked into and be set to 1, perform step 3.4;Step 3.4:Tables of data and querying attributes according to specified by query statement determine inquiry data belonging to block and Block groups;Step 3.5:Read placement number of the data in each trunk groups of the block in block groups;Step 3.6:Solve existing probability of the inquiry data in each trunk files;Step 3.7:According to data each trunk files existing probability, heuristically select trunk files, make selected Trunk files meet following two constraintss;Constraint 5:Inquire about data on selected trunk files look into full probability be more than or equal to look into full probability pr;Constraint 6:For identical querying condition, it is not exactly the same that selected trunk files are inquired about every time so that each looks into Asking result has certain randomness, and guarantee meets that all data of querying condition are likely to be queried to;The heuristic system of selection specific steps of trunk files are described as follows:Step 3.7.1:The existing probability of all trunk files to that may store inquiry data is normalized;Step 3.7.2:Probability 1-p is not present in selectioneLess than or equal to looking into full probability prTrunk files, be added to MapSelect < trunk, peIn > set, other trunk files are added to MapNonSelect < trunk, pe> collection In conjunction;Step 3.7.3:In MapNonSelect < trunk, peTwo trunk files are randomly choosed in > set, if inquiry data There is no probability it is respectively p in the two trunk files1, p2, solve p1With p2Product p;Step 3.7.4:If it is more than there is no the product p of probability looks into full probability pr, then step 3.7.5 is performed;If it is less than there is no the product p of probability looks into full probability pr, then step 3.7.6 is performed;If it is equal to there is no the product p of probability looks into full probability pr, then step 3.7.7 is performed;If MapNonSelect < trunk, peThe trunk files that > set can not select, then perform step 3.8;Step 3.7.5:From MapNonSelect < trunk, peThe two elements are deleted in > set, are continued MapNonSelect < trunk, peRandom selection one makes p=p there is no the trunk files that probability is more than p in > set (1-pe), peFor the existing probability of selected trunk files, step 3.7.4 is performed;Step 3.7.6:Will be in MapNonSelect < trunk, peProbability 1-p is not present in > sete≤{min|p1, p2 All trunk files are added to MapSelect < trunk, peIn > set, and these trunk files are existed MapNonSelect < trunk, peDeleted in > set;In MapNonSelect < trunk, peRemaining trunk in > set File relay continue selection than min | p1, p2Bigger trunk files, orderPerform step 3.7.4;Step 3.7.7:If in MapNonSelect < trunk, peThere are non-selected trunk files all to add in > set It is added to MapSelect < trunk, peIn > set, step 3.8 is performed;Step 3.8:Pass through formulaCalculate inquiry error, wherein trunkikRepresent i-th Trunk files in bucket subdirectories in k-th of trunk group, piRepresent block data in i-th of bucket specific item The placement probability of record, wkRepresent placement number of the block data in k-th of trunk group;S represents the total of all trunk groups Number;Step 3.9:Based on MapReduce programming model parallel processing MapSelect < trunk, peTrunk texts in > set Part, inquiry meet the data of querying attributes.
- 4. big data querying method based on probability according to claim 3, it is characterised in that:Institute in the step 3.1 The query statement stated, including tetra- clauses of select, from, where and recall, wherein, select clause representations need to look into The attribute of inquiry and the type of aggregation operator, including avg, min, max, sum and count;From clause representations need what is inquired about Target data set;Where clause representations querying attributes and its value;Recall clause representations look into full probability, prExpression is looked into complete general The size of rate, it is that a value is more than 0 number for being less than or equal to 1 to look into full probability, represents to inquire all numbers for meeting querying condition According to possibility size.
- 5. big data querying method based on probability according to claim 3, it is characterised in that:Looked into the step 3.6 Ask data is in the solution formula of the existing probability of each trunk filesWherein piRepresenting should Block data are in the placement probability of i-th of bucket subdirectory, wkRepresent the block data putting in k-th trunk group Put number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510492377.8A CN105117442B (en) | 2015-08-12 | 2015-08-12 | A kind of big data querying method based on probability |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510492377.8A CN105117442B (en) | 2015-08-12 | 2015-08-12 | A kind of big data querying method based on probability |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105117442A CN105117442A (en) | 2015-12-02 |
CN105117442B true CN105117442B (en) | 2018-05-04 |
Family
ID=54665432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510492377.8A Active CN105117442B (en) | 2015-08-12 | 2015-08-12 | A kind of big data querying method based on probability |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105117442B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677840B (en) * | 2016-01-06 | 2019-02-05 | 东北大学 | A kind of data query method based on the cumulative data model of multidimensional |
CN106021488A (en) * | 2016-05-19 | 2016-10-12 | 乐视控股(北京)有限公司 | Key value database management method and apparatus |
CN106294665A (en) * | 2016-08-05 | 2017-01-04 | 浪潮软件股份有限公司 | Method and device for storing student status data |
CN107798019A (en) * | 2016-09-07 | 2018-03-13 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus for being used to provide the node serve data for accelerating service node |
CN107480220B (en) * | 2017-08-01 | 2021-01-12 | 浙江大学 | Rapid text query method based on online aggregation |
CN110489449B (en) * | 2019-07-30 | 2022-02-22 | 北京百分点科技集团股份有限公司 | Chart recommendation method and device and electronic equipment |
CN111931200B (en) * | 2020-07-13 | 2024-02-23 | 车智互联(北京)科技有限公司 | Data serialization method, mobile terminal and readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073718A (en) * | 2011-01-10 | 2011-05-25 | 清华大学 | System and method for explaining, erasing and modifying search result in probabilistic database |
CN103442331A (en) * | 2013-08-07 | 2013-12-11 | 华为技术有限公司 | Terminal equipment position determining method and terminal equipment |
-
2015
- 2015-08-12 CN CN201510492377.8A patent/CN105117442B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073718A (en) * | 2011-01-10 | 2011-05-25 | 清华大学 | System and method for explaining, erasing and modifying search result in probabilistic database |
CN103442331A (en) * | 2013-08-07 | 2013-12-11 | 华为技术有限公司 | Terminal equipment position determining method and terminal equipment |
Non-Patent Citations (1)
Title |
---|
"一种深层网络不确定性概率模型研究";王鹏鸣;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130215;摘要 * |
Also Published As
Publication number | Publication date |
---|---|
CN105117442A (en) | 2015-12-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105117442B (en) | A kind of big data querying method based on probability | |
CN103177061B (en) | Unique value estimation in partition table | |
CN105488043B (en) | Data query method and system based on Key-Value data block | |
US8484252B2 (en) | Generation of a multidimensional dataset from an associative database | |
CN102722531B (en) | Query method based on regional bitmap indexes in cloud environment | |
US20230139783A1 (en) | Schema-adaptable data enrichment and retrieval | |
CN109952569A (en) | Technology for connection and polymerization based on dictionary | |
CN104281701B (en) | Multiscale Distributed Spatial data query method and system | |
CN105404634B (en) | Data managing method and system based on Key-Value data block | |
CN109062936B (en) | Data query method, computer readable storage medium and terminal equipment | |
Ignatov et al. | Can triconcepts become triclusters? | |
CN103562905B (en) | Improved data visualization configuration system and method | |
CN112527783A (en) | Data quality probing system based on Hadoop | |
Kuzochkina et al. | Analyzing and Comparison of NoSQL DBMS | |
CN108874873B (en) | Data query method, device, storage medium and processor | |
US10521455B2 (en) | System and method for a neural metadata framework | |
CN106845787A (en) | A kind of data method for automatically exchanging and device | |
Pedersen | Managing complex multidimensional data | |
Khalil et al. | New approach for implementing big datamart using NoSQL key-value stores | |
Wang et al. | A resume recommendation model for online recruitment | |
Girsang et al. | Decision support system using data warehouse for hotel reservation system | |
Mazurova et al. | Research of ACID transaction implementation methods for distributed databases using replication technology | |
CN115481026A (en) | Test case generation method and device, computer equipment and storage medium | |
CN111026759B (en) | Report generation method and device based on Hbase | |
Bicevska et al. | NoSQL-based data warehouse solutions: sense, benefits and prerequisites |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |