CN107818181A

CN107818181A - Indexing means and its system based on Plcient interactive mode engines

Info

Publication number: CN107818181A
Application number: CN201711203695.3A
Authority: CN
Inventors: 官辉; 顾正
Original assignee: Shenzhen Huayi Technology Co Ltd
Current assignee: Shenzhen Huayi Technology Co Ltd
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2018-03-20

Abstract

The present invention relates to the indexing means and its system based on Plcient interactive mode engines, this method includes obtaining HiveQL sentences；Plcient compilings are carried out to HiveQL sentences, obtain execution task；Execution task is submitted into control node；Execution task is handed to executive process engine and performed by control node, obtains metadata information；Metadata information is submitted into task tracker or explorer and performed；Read file in HDFS to be operated accordingly, obtain and return to implementing result.The present invention defines syntax rule using Antlr open source softwares, it enormously simplify the compiling resolving of morphology and grammer, design stage by stage makes whole compilation process code easily safeguard, each logical operator only completes single function, simplify whole MapReduce programs, realize the ageing of enhancing big data retrieval so that inquiry mode is more flexible, and execution efficiency is higher.

Description

Indexing means and its system based on Plcient interactive mode engines

Technical field

The present invention relates to indexing means, more specifically refer to indexing means based on Plcient interactive mode engines and its System.

Background technology

Data are required for index to be aided in during inquiry or processing at present, but traditional relationship type number The problem of following be present according to the index in storehouse, first, index is stored in local hard drive, index does not allow manageability, disaster tolerance and High Availabitity Realize that cost is higher, the moving costs of index and the capacity of unit hard disk constrain its index scale and size, if logical To cross the modes such as redundancy (" master/slave " or " double to write ") and realize data disaster tolerance, the design difficulty of data consistency is larger, If there is " bad point " problem, it is abnormal, operation system that certain segment data that a certain moment reads, which wherein has byte value, System can not timely find that, but it is also possible that causing the pointer entirely indexed abnormal, the data inquired are inaccurate；Second, The management of table, index, scheduling were once mixed in together, and it is too many to dispatch the thing of system administration, should manage index, manage the heart again Jump, also to safeguard disaster tolerance, cause not come in the machine scale of scheduling system, same computing resource is assigned to only fixed index Data cause the too many waste of computing resource；Third, too high to hardware requirement, data long-lasting must be trapped in internal memory, Otherwise quickly can not load and inquire about data, it is higher to hardware requirement be typically all need big internal memory (more than 128G) and SSD hard disks, the data of 10,000,000,000 scales even need hundreds machine to support quick inquiry, come for the data of trillion scales Say that cost is too high；Fourth, when carrying out computing using Spark, because code quality problem long-play can often malfunction, in frame In terms of structure, because mass data is buffered in RAM, the Java recovery slow situation of rubbish is serious, causes Spark performances unstable It is fixed；Big data can not be handled, independent machine processing data are excessive, or cause intermediate result to exceed because data go wrong During RAM size, usually there is ram space deficiency or can not obtain a result；The SQL statistics of complexity can not be supported；Spark at present The SQL syntax integrated degree of support can't be applied in complex data analysis.In terms of manageability, SparkYARN knot Imperfection is closed, this just to bury secret worry during use, various problems easily occurs；Fifth, when using Hive, due to Hive framves For structure on MapReduce Framework, the flexibility of executive plan is poor, and the selection that optimizer can be done is seldom.

Therefore, it is necessary to design the indexing means based on Plcient interactive mode engines, realize enhancing big data retrieval when Effect property so that inquiry mode is more flexible, and execution efficiency is higher.

The content of the invention

The defects of it is an object of the invention to overcome prior art, there is provided the index side based on Plcient interactive mode engines Method and its system.

To achieve the above object, the present invention uses following technical scheme：Index side based on Plcient interactive mode engines Method, methods described include：

Obtain HiveQL sentences；

Plcient compilings are carried out to HiveQL sentences, obtain execution task；

Execution task is submitted into control node；

Execution task is handed to executive process engine and performed by control node, obtains metadata information；

Metadata information is submitted into task tracker or explorer and performed；

Read file in HDFS to be operated accordingly, obtain and return to implementing result.

Its further technical scheme is：The step of obtaining HiveQL sentences, including step in detail below：

Query task is submitted to control node；

Obtain query task；

Hive metadata informations corresponding to being obtained according to query task from Metadata Repository, form HiveQL sentences.

Its further technical scheme is：To HiveQL sentences carry out Plcient compilings, obtain execution task the step of, bag Include step in detail below：

HiveQL sentences are converted into abstract syntax tree；

Abstract syntax tree is converted into query block；

Query block is converted into logical query plan and rewrites logical query plan；

Logic plan is converted into physics plan, forms execution task.

Its further technical scheme is：The step of HiveQL sentences are converted into abstract syntax tree, including walk in detail below Suddenly：

The syntax rule of HiveQL sentences is defined using Antlr；

Morphology and syntax parsing are carried out to HiveQL sentences according to syntax rule, form abstract syntax tree.

Its further technical scheme is：The step of abstract syntax tree is converted into query block, including step in detail below：

Ergodic abstract syntax tree in sequence；

Different nominal nodes is obtained, is saved in corresponding attribute, forms outer query block and subquery block.

Its further technical scheme is：Query block is converted into the inquiry plan of logic and rewrites the step of logical query plan Suddenly, including in detail below step：

Traversal queries block, query block is translated as to perform operation tree；

Conversion performs operation tree, union operator；

Traversal performs operation tree, will perform operation tree and is translated as MapReduce tasks, forms logical query plan and rewrite Logical query plan.

Present invention also offers the directory system based on Plcient interactive mode engines, including sentence acquiring unit, compiling list Member, submit unit, handover unit, execution unit and operation reading unit；

The sentence acquiring unit, for obtaining HiveQL sentences；

The compilation unit, for carrying out Plcient compilings to HiveQL sentences, obtain execution task；

The submission unit, for execution task to be submitted into control node；

The handover unit, execution task is handed into executive process engine for control node and performed, obtains first number It is believed that breath；

The execution unit, for metadata information to be submitted into task tracker or explorer and performed；

The operation reading unit, operated accordingly for reading file in HDFS, obtain and return to implementing result.

Its further technical scheme is：The sentence acquiring unit include task submit module, task acquisition module and Data obtaining module；

The task submits module, for submitting query task to control node；

The task acquisition module, for obtaining query task；

Described information acquisition module, for Hive metadata corresponding to being obtained according to query task from Metadata Repository Information, form HiveQL sentences.

Its further technical scheme is：The compilation unit includes statement converter modular, abstract modular converter, query block and turned Change the mold block and physical transformation module；

The statement converter modular, for HiveQL sentences to be converted into abstract syntax tree；

The abstract modular converter, for abstract syntax tree to be converted into query block；

The query block modular converter, for query block to be converted into logical query plan and rewrites logical query plan；

The physical transformation module, for logic plan to be converted into physics plan, form execution task.

Its further technical scheme is：The statement converter modular includes defining submodule and analyzing sub-module；

The definition submodule, for defining the syntax rule of HiveQL sentences using Antlr；

The analyzing sub-module, for carrying out morphology and syntax parsing to HiveQL sentences according to syntax rule, formed Abstract syntax tree.

Compared with the prior art, the invention has the advantages that：The index based on Plcinet interactive mode engines of the present invention Method, by obtaining HiveQL sentences, the sentence is compiled using Plcinet, forms execution task, perform the task, Implementing result is obtained, defines syntax rule using Antlr open source softwares, the compiling that enormously simplify morphology and grammer parses Journey, design stage by stage make whole compilation process code easily safeguard, each logical operator only completes single function, simplify Whole MapReduce programs, realize the ageing of enhancing big data retrieval so that inquiry mode is more flexible, execution efficiency is more It is high.

The invention will be further described with specific embodiment below in conjunction with the accompanying drawings.

Brief description of the drawings

Fig. 1 is the flow chart for the indexing means based on Plcient interactive mode engines that the specific embodiment of the invention provides；

Fig. 2 is the flow chart for the acquisition HiveQL sentences that the specific embodiment of the invention provides；

Fig. 3 is carrying out Plcient compilings to HiveQL sentences and obtaining execution task for specific embodiment of the invention offer Flow chart；

Fig. 4 is the structured flowchart for the directory system based on Plcient interactive mode engines that the specific embodiment of the invention provides.

Embodiment

In order to more fully understand the technology contents of the present invention, technical scheme is entered with reference to specific embodiment One step introduction and explanation, but it is not limited to this.

Specific embodiment as shown in figures 1-4, the index side based on Plcient interactive mode engines that the present embodiment provides Method, during data directory being used in, realize the ageing of enhancing big data retrieval so that inquiry mode is more flexible, Execution efficiency is higher.

Plclient is to prolong cloud for user to big data heuristic, the demand of extemporaneous analysis and the analysis software developed. Plclient applies traditional database index technology in big data technology, breaks the deadlock of current big data computing technique. Big data is retrieved to ageing stronger, inquiry mode is more flexible, the higher direction evolution of execution efficiency.Technical plclient Write using Java language, be grounded gas, SQL interfaces, user is also easier to left-hand seat use, while the daily total amount of hundred billion increment trillion Data volume can also meet the needs of high-end user.Plclient major techniques direction is advantageous in that in massive index, massive index The speed of retrieval is accelerated, packet, statistics and sorting time in inquiry are reduced, by performance and the response time of improving system To economize on resources.Utilizing one's abilities for massive index technology makes plclient still keep inquiry to ring under such large-scale data volume At several seconds between seasonable, data import delay in a few minutes.Plclient full name prolong cloud plclient, are one and are divided based on Hadoop Real-time, multidimensional, interactive inquiry under cloth framework, statistics, analysis engine, there is the second level under trillion data scales Performance, and possess the reliable and stable performance of enterprise-level.Plclient is a fine-grained index, the rope of fmer-granularity Draw.Data are imported immediately, index in-time generatin, and efficiently related data is navigated to by index.Plclient and Spark depth collection Into Spark is calculated plclient retrieval sets Direct Analysis, and same scene allows Spark performances to accelerate hundred times.Applicable pair As follows：

First, traditional relational data, more data can not be accommodated, the serious affected user of search efficiency；

Second, doing full-text search using SOLR, ES at present, think that analytic function that solr and ES is provided very little, can not be complete SOLR and ES becomes unstable after into the service logic of complexity, or data quantitative change more, fall piece with it is balanced in continuous pernicious follow Ring, it is impossible to which automatic to recover service, operation maintenance personnel needs frequent midnight to get up to restart the situation of cluster；

Third, based on the analysis to mass data, but suffer from existing off-line calculation platform speed and the response time without Meet the user of business need；

Fourth, need to do user's portrait behavior class data the user of multidimensional orientation analysis；

Fifth, need the user retrieved to substantial amounts of UGC (User Generate Content) data；

Sixth, when needing to carry out quick, interactive inquiry on large data sets；

Seventh, when needing to carry out data analysis, and more than the storage of simple key-value pair when；

Eigth, when needing to analyze caused data in real time.

As shown in figure 1, present embodiments providing the indexing means based on Plcient interactive mode engines, this method includes：

S1, obtain HiveQL sentences；

S2, Plcient compilings are carried out to HiveQL sentences, obtain execution task；

S3, execution task submitted into control node；

Execution task is handed to executive process engine and performed by S4, control node, obtains metadata information；

S5, metadata information is submitted into task tracker or explorer and performed；

File is operated accordingly in S6, reading HDFS, is obtained and is returned to implementing result.

In certain embodiments, above-mentioned S1 steps, the step of obtaining HiveQL sentences, including step in detail below：

S11, query task is submitted to control node；

S12, obtain query task；

S13, obtained from Metadata Repository according to query task corresponding to Hive metadata informations, form HiveQL languages Sentence.

Specifically, the tasks such as inquiry are submitted to control node by user, after compiler obtains the query task of the user, root Go in Metadata Repository to obtain the Hive metadata informations of needs according to user task.

Further, in certain embodiments, above-mentioned S2 steps, Plcient compilings is carried out to HiveQL sentences, obtained The step of taking execution task, including step in detail below：

S21, HiveQL sentences are converted into abstract syntax tree；

S22, abstract syntax tree is converted into query block；

S23, query block is converted into logical query plan and rewrites logical query plan；

S24, logic plan is converted into physics plan, forms execution task.

Specifically, first, control node can input a character string SQL, then become abstract syntax tree by resolver, Completed particular by Antlr, that is, SQL according to grammar file is become abstract syntax tree, abstract syntax by Anltr Tree becomes query block into core.One most simple query block, generally, a From clause can generate a query block. It is a recursive procedure to generate query block, and the query block of generation passes through logical query plan process, becomes an execution figure, It is a directed acyclic graph.OPDAG passes through logic optimization device, and the side on this figure or node are adjusted, and order is revised, Become the directed acyclic graph after an optimization.These optimization process may include predicate under push away, subregion is cut out, association is sorted Deng have passed through logic optimization, this directed acyclic graph will be also able to carry out.So there is the process of generation physics executive plan.

The Hive practice is typically to encounter the place for needing to distribute, and cuts a knife, generates one of MapReduce operation.Such as Group By partial applications, Join partial applications, Distribute By partial applications, Distinct partial applications.So many knives are cut down Afterwards, that logic executive plan just now, that is, that logic directed acyclic graph, many subgraphs have been diced up, each Subgraph forms a node.These nodes have been linked to be an executive plan figure, that is, Task Tree again, and these Task Trees are entered one Step optimization, for example based on input selection execution route, increase backup job etc., be adjusted, this optimization is exactly by physics meter Transfer and bring completion, changed by physics plan, this each node is exactly a MapReduce operation or local work Industry, it is possible to perform.

It is that the data of different tables mark in map output valve that the above-mentioned Join that refers to, which is, catabolic phase according to Marker for judgment data source.By the output key value that GroupBy field combination is map, using MapReduce sequence, The reduce stages preserve crucial list and distinguish different key points.When an only distinct field, discounting for Map The Hash GroupBy in stage, it is only necessary to GroupBy fields and Distinct field combinations are exported into key value for map, utilized Mapreduce sequence, while the key value using GroupBy fields as reduce, crucial list is preserved in the reduce stages Duplicate removal can be completed；If multiple Distinct fields, such as following SQL：select dealid,count (distinct uid),count(distinct date)from order group by dealid；Then using following two Mode carries out duplicate removal：

If, can not be with first, still according to the method for a Distinct field above, i.e. this implementation of figure below Sorted respectively according to UID and daily record, also can not just pass through crucial list duplicate removal, it is still desirable to pass through in internal memory in the reduce stages Cryptographic Hash duplicate removal；

Second, can be to all Distinct field numbers, each row of data generation n row data, then same field will Sort respectively, at this moment only needing to record crucial list in the reduce stages can duplicate removal；This implementation make use of well MapReduce sequence, save the memory consumption of reduce stage duplicate removals.

It should be noted that when generating reduce values, being expert at except first Distinct field needs to retain key Value, remaining Distinct data rows value field can be sky.

Further, in certain embodiments, above-mentioned S21 steps, HiveQL sentences are converted into abstract syntax tree The step of, including step in detail below：

S211, the syntax rule for defining using Antlr HiveQL sentences；

S212, morphology and syntax parsing, formation abstract syntax tree are carried out to HiveQL sentences according to syntax rule.

For above-mentioned S211 steps, the definition file of syntax rule was Hive.g mono- before 0.10 version in Hive File, as syntax rule becomes increasingly complex, java class file can be can exceed that most by the Java parsings class of syntax rule generation The big upper limit, 0.11 version have splitted into Hive.g 4 files of 5 files, morphological rule HiveLexer.g and syntax rule SelectClauseParser.g, FromClauseParser.g, IdentifiersParser.g, HiveParser.g.

One section of following grammer is the syntax rule of SelectStatement in Hive SQL, there it can be seen that SelectStatement includes the clauses such as select, from, where, groupby, having, orderby.Antlr is to Hive The code of SQL parsings is as follows：

HiveLexerX, HiveParser are the morphology automatically generated after Antlr compiles to grammar file Hive.g respectively Parsing and syntax parsing class, the parsing of complexity is carried out in the two classes.Internal layer subquery can also generate a TOK_ DESTINATION nodes, this node are the nodes specially added in grammer rewriting, and reason is all in Hive looks into The data of inquiry can be stored in the interim files of HDFS, and either middle subquery still inquires about final result, Insert sentences are eventually write data under the HDFS catalogues where table.

Above-mentioned S212 steps, morphological analysis is carried out based on SQL lexical analyzers, it is specific as follows：

Input source program text.In many cases, in order to preferably be identified to word symbol, input string is pre-processed one Under.Pretreatment mainly filters space, skips annotation, newline etc..During morphological analysis, sometimes for part of speech is determined, it need to surpass Preceding several characters of scanning.For formula translation, keyword can use, space symbol not as reserved word as identifier Without in all senses.In order to determine part of speech, several characters of pre-scanning are needed.

In FORTRAN：

1 DO99K=1,10；

2 IF (5.EQ.M) I=10；

3 DO99K=1.10；

4 IF (5)=55；

Sentence 1 and 2 is DO and IF statement respectively, and sentence 3 and 4 is assignment statement.In order to correctly distinguish 1 and 3,2 and 4 languages Sentence, needs several characters of pre-scanning.

1 DO99K=1,10；2 IF (5.EQ.M) I=10；

3 DO99K=1.10；4 IF (5)=55；

The difference of sentence 1 and 3 is first boundary symbol after symbol：One is comma, and another is accorded with for end of the sentence.Sentence 2 and 4 main distinction is the first character after right parenthesis：One is letter, and another is equal sign.In order to identify in 1,2 Keyword, it is necessary to the multiple characters of pre-scanning.In advance untill it can affirm the place of part of speech.In order to distinguish 1 and 3, it is necessary to At first boundary's symbol after pre-scanning to equal sign.For sentence 2,4, it is necessary to pre-scanning to the left bracket phase after IF Untill first character after that corresponding right parenthesis.

Use state transition diagram identifies word symbol, and state transition graph is a limited directional diagram.In state transition graph In, there are an initial state, at least one final state.Wherein 0 is initial state, and 2 be final state.This transition diagram identifies the mistake of (receiving) identifier Cheng Shi：Since initial state 0, if it is a letter that character is inputted under state 0, read into it, and be transferred to state 1.In state 1 Under, if next input character is letter or number, read into it, and reenter state 1.It is straight that this process is repeated always (this character has also been entered by reading) is put into state 2 when finding that input character is no longer letter or number to state 1.State 2 is Final state, it means to have identified an identifier to this, and identification process calls off.Final state tie beat asterisk mean it is more Read, into a character for being not belonging to identifier portion, it should be returned in input port.If input character not in state 0 For " letter ", then mean not identify identifier, in other words, this conversion work is unsuccessful.

Regular expression is a kind of important representation (mark) for illustrating word, is the instrument for defining regular set.In word In method analysis, regular expression is used for describing the form that indications may have.Definition (regular formula and it represented by it is regular Collection)：If alphabet is S；E andAll it is the regular formula on S, the regular set represented by them is respectively { e } and { }；It is anyS, A is a regular formula on S, and the regular set represented by it is { a }；It is assumed that U and V are the regular formulas on S, represented by them Regular set is respectively L (U) and L (V), then, (U), U | V, UV, U* are also regular formulas, the regular set represented by them point Not Wei L (U),L (U) L (V) and (L (U)) *；The expression formula only defined by limited number of time using above-mentioned three step The regular formula being only on S, only the word collection represented by these regular formulas are only the regular set on S." the 1/ of the operator of regular formula 2 " read to be "or", and it is " connection " that " ", which reads,；" * " is read as " closure " (that is, the deadweight of arbitrary finite time is multiply-connected to be connect).It will not obscure When, bracket can save, but the priority of regulation operator is " (", ") ", " * ", " ", " 1/2 ".Connector " " can typically save Do not write slightly." * ", " " and " 1/2 " is all left combination.

Further, in certain embodiments, above-mentioned S22 steps, abstract syntax tree is converted into the step of query block Suddenly, including in detail below step：

S221, in sequence ergodic abstract syntax tree；

S222, different nominal nodes is obtained, be saved in corresponding attribute, form outer query block and subquery Block.

Query block is the most basic component units of a SQL, including three parts：Input source, calculating process, output.Letter A query block is exactly a subquery for list.

QB#aliasToSubq (the aliasToSubq attributes for representing query block class) preserves the inquiry block object of subquery, AliasToSubq key values are the alias of subquery；

QB#qbp is the abstract syntax to an operation part in query block ParseInfo one basic SQL unit of preservation Tree construction, this HashMap of query block ParseInfo#nameToDest preserve the output of query unit, and the form of key value is Inclause-i (because Hive supports Multi Insert sentences, it is possible that there is multiple outputs), the value is corresponding ASTNode nodes, i.e. TOK_DESTINATION nodes, remaining HashMap attribute of class query block ParseInfo preserve defeated respectively Go out the corresponding relation with the ASTNode nodes of each operation.

QBParseInfo#JoinExpr is used to preserve TOK_JOIN nodes.

QB#qbJoinTree is the structuring to Join syntax trees.

QB#qbm preserves the metamessage of each input table, such as path of the table on HDFS, preserves the tray of table data Formula etc..

This object of QB Expr is to represent that Union is operated.

The process of abstract syntax tree generation query block is a recursive process, preorder traversal abstract syntax tree, is run into not Same nominal node, is saved in corresponding attribute, mainly comprising following process：

TOK_QUERY=>Create inquiry block object, circular recursion child node；

TOK_FROM=>Table name Grammar section is saved in the TOK_INSERT=of inquiry block object>Circular recursion section Point；

TOK_DESTINATION=>The Grammar section for exporting target is stored in query block ParseInfo objects In nameToDest attributes；

TOK_SELECT=>Respectively by the Grammar section of query expression be stored in destToAggregationExprs, TOK_WHERE=>The grammer of Where parts is stored in the destToWhereExpr attributes of query block ParseInfo objects In.

By said process, final sample SQL generates two inquiry block objects.

Further, above-mentioned S23 steps, query block is converted into the inquiry plan of logic and rewrites Boolean query meter The step of drawing, including step in detail below：

S231, traversal queries block, query block is translated as to perform operation tree；

S232, conversion perform operation tree, union operator；

S233, traversal perform operation tree, will perform operation tree and are translated as MapReduce tasks, form logical query plan And rewrite logical query plan.

Above-mentioned S231 steps, should by Map stages and Reduce stages by operation tree composition is performed to S233 steps Perform in operation tree and be provided with several logical operators, that is, complete in Map stages or Reduce stages it is single specifically Operation, basic operator include TableScanOperator, SelectOperator, FilterOperator, JoinOperator, GroupByOperator and ReduceSinkOperator；Above-mentioned TableScanOperator is The data of table are originally inputted from the Map interfaces of MapReduce frameworks, control the number of data lines of scan table, mark is from former table Access evidence；JoinOperator completes Join operations；FilterOperator completes filter operation；ReduceSinkOperator It is that the field combination sequence at Map ends is turned into decomposition value, partition value, is only possible to appear in the Map stages, while also indicate Hive The end in Map stages in the MapReduce programs of generation.

Data transfer of the logical operator between the Map Reduce stages is all the process of a streaming.Each logic Operator passes data to sub- logical operator calculating afterwards after operation is completed to data line.

The underlying attribute and method of logical operator are as follows：

RowSchema represents Operator output field；

NputObjInspector and outputObjInspector is parsing input and output field；

ProcessOp receives the data of father's logical operator transmission, and forward gives the data transfer handled well to sub- logic Operator processing；

Hive after a logical operator processing, can be renumberd, colExprMap per data line to field Record each expression formula and pass through the title corresponding relation of current logic operator before and after the processing, optimize rank in next phase logic Section is used for recalling field name；

Because Hive MapReduce programs are a dynamic programs, that is, not knowing a MapReduce task can enter Row is what computing, it may be possible to Join, it is also possible to GroupBy, so the parameter needed when logical operator is by all operations is protected Exist in logical operator form, logical operator form is submitting task presequence to HDFS, in MapReduce tasks Before performing simultaneously unserializing is read from HDFS.Position of the execution operation tree in Map stages on HDFS is in Job.getConf (“hive.exec.plan”)+“/map.xml”。

Above-mentioned S23 steps, specifically travel through the guarantor of the query block generated during one and QBParseInfo objects The attribute of grammer is deposited, is comprised the following steps：

QB#aliasToSubq=>There are subquery, recursive call；

QB#aliasToTabs=>TableScanOperator；

QBParseInfo#joinExpr=>QBJoinTree=>ReduceSinkOperator+JoinOperator；

QBParseInfo#destToWhereExpr=>FilterOperator；

QBParseInfo#destToGroupby=>ReduceSinkOperator+GroupByOperator；

QBParseInfo#destToOrderby=>ReduceSinkOperator+ExtractOperator.

Due to Join, either GroupBy or OrderBy is required to complete in the Reduce stages, so in the corresponding behaviour of generation A decomposition logical operator all can be first generated before the logical operator of work, by field groups merge serializing for decomposition value or Partition value.

It is specific first according to subquery block, preorder traversal QBJoinTree, class QBJoinTree for above-mentioned S231 steps Preserve the ASTNode of left and right table and the alias of this inquiry, preamble traversal detail.usersequence_client, in generation Between table and dim.user Join operation trees, according to query block 2FilterOperator, now the traversal of query block 2 is completed.

Whether selection logical operator can need to parse field in some scenarios according to some condition judgments.According to inquiry The QBParseInfo#destToGroupby generation ReduceSinkOperator+GroupByOperato of block 1, after having parsed, A FileSinkOperator can be generated, writes data into HDFS.

Above-mentioned S232 steps, specifically converted by logical layer optimizer and perform operation tree, after performing operation tree by traversal The Key values and subregion Key values of former and later two RS outputs can be found,

Optimizer detects that the Key values of iRS outputs completely include cRS Key values, and clooating sequence is consistent；2. main RS subregions Key values completely include sub- RS subregions Key values.Meet optimal conditions, executive plan can be optimized.By sub- RS and main RS and son Logical operator between RS deletes, and the RS of reservation Key values are key, value field, and subregion Key values are key fields.

For above-mentioned S233 steps, following steps are specifically included：

S2331, output table is generated, realize that the HDFS temporary files that will be ultimately generated are moved under object table catalogue；

S2332, the downward depth-first traversal of one of root node from execution operation tree；

S2333, the boundary for indicating Map/Reduce, the boundary between multiple tasks；

S2334, other root nodes are traveled through, met and encounter Join logical operators merging MapReduce tasks；

S2335, generation signal task more new metadata；

The relation of logical operator between S2336, cutting Map and Reduce.

To above-mentioned S2332 steps, specifically, all root nodes of operation tree will be performed and be stored in a toWalk array In, the element in array is taken out in circulation, takes out last element T S [p] and is put into stack opStack { TS [p] }, finds in stack Element meet setting rule, such as " " .join ([t+ " % " for t in opStack])==" TS% "；Then generate one MapReduceTask [Stage-1] object；Continue traversal TS [p] sub- logical operator, sub- logical operator is stored in stack In opStack, after first RS pushes on, i.e. during stack opStack={ TS [p], FIL [18], RS [4] }, it will meet to set Rule, such as " " .join ([t+ " % " for t in opStack])==" TS%.*RS% ", and protected in task resolution attribute Deposit；Continue traversal JOIN [5] sub- logical operator, sub- logical operator is stored in stack opStack when second RS is put During stacking, i.e., when stack meet " " .join ([t+ " % " for t in opStack])==" RS%.*RS% ", and searching loop During opStack each Suffix array clustering, create a new JOIN [5] and JOIN [5] and generate a sub- logical operator RS [6] TS [20] of MapReduceTask [Stage-2] object reference is generated.Continue traversal RS [6] son Operator, sub- Operator is stored in stack opStack, after most all sub- logical operators are stored in stacks at last, met Rule " " .join ([t+ " % " for t in opStack])==" FS% " when, by MapReduceTask [Stage-3] even Pick up and, and generate a merging phase, opStack stacks are emptied, toWalk second element is added into stack.From During opStack={ TS [du], RS [7] }, meet regular R2TS%.*RS%, now by MapReduceTask's [Stage-5] Map<Logical operator；MapReduceWork>MapReduceTask [Stage-2] is found in object.

For above-mentioned S2333 steps, the execution operation tree in Map tasks and Reduce tasks is cut by boundary of RS Open, perform operation tree generation MapReduce task overall pictures, in the present embodiment, final symbiosis is into 3 MapReduce tasks.

Further, above-mentioned S24 steps, it is that logic plan is converted into physics plan, shape by physical layer optimizer It is exactly briefly that small table is read in into internal memory in the Map stages, the big table of sequential scan completes Join into the task of execution.

Physical layer optimizer conversion logic plan is divided into two stages：By MapReduce logic plans, small table is read in Internal memory, generation Hash forms are uploaded in distributed cache, and this process needs to be compressed Hash forms.

MapReduce tasks read Hash tables in Map stages, each combination from Distributed cache memories Lattice the big table of sequential scan, are directly combined into internal memory in the Map stages, are passed data to next MapReduce and are appointed Business.

If it is interim table that Join two tables one, which open table, a conditioning tasks will be generated, during operation judge be It is no to use MapJoin, now need optimizer that common Join is converted into MapJoin, conversion process is as follows：

Depth-first traversal Task Tree；

Join logical operators are found, judge left and right table data volume size；

Pair with small table+big table=>MapJoinTask, for small/big table+middle table=>ConditionalTask, traversal The MapReduce tasks of upper stage generation, finding has a table in JOIN [8] be interim table, first Stage-2 is carried out deep Degree copy, it is backup tasks due to needing to retain original execution plan, executive plan has been copied into portion), generate one MapJoin logical operators substitute Join logical operators, then generate a MapReduce locals task and read small table generation Hash forms are uploaded in Distributed cache memories.

Also need to, using optimizer traversal Task Tree, all local MapReduce tasks be splitted into two during this Task.

Above-mentioned S2 steps, syntax rule is defined using Antlr open source softwares, enormously simplify the compiling of morphology and grammer Resolving, it is thus only necessary to safeguard a grammar file；Design stage by stage makes whole compilation process code easily safeguard, So that follow-up various optimizers are easily with pluggable mode switch, for example characteristic newest Hive0.13 Vectorization and support to Tez engines are all pluggable, and each logical operator only completes single function, letter Whole MapReduce programs are changed.

The above-mentioned indexing means based on Plcinet interactive mode engines, by obtaining HiveQL sentences, using Plcinet The sentence is compiled, forms execution task, performs the task, implementing result is obtained, language is defined using Antlr open source softwares Method rule, enormously simplify the compiling resolving of morphology and grammer, design stage by stage makes whole compilation process code easy Safeguard, each logical operator only completes single function, simplifies whole MapReduce programs, realizes enhancing big data inspection Rope it is ageing so that inquiry mode is more flexible, and execution efficiency is higher.

As shown in figure 4, the present embodiment additionally provides the directory system based on Plcient interactive mode engines, it includes sentence Acquiring unit 1, compilation unit 2, submit unit 3, handover unit 4, execution unit 5 and operation reading unit 6.

Sentence acquiring unit 1, for obtaining HiveQL sentences.

Compilation unit 2, for carrying out Plcient compilings to HiveQL sentences, obtain execution task.

Unit 3 is submitted, for execution task to be submitted into control node.

Handover unit 4, execution task is handed into executive process engine for control node and performed, obtain metadata letter Breath.

Execution unit 5, for metadata information to be submitted into task tracker or explorer and performed.

Reading unit 6 is operated, is operated accordingly for reading file in HDFS, is obtained and return to implementing result.

In certain embodiments, above-mentioned sentence acquiring unit 1 includes task submission module, task acquisition module and letter Cease acquisition module.

Task submits module, for submitting query task to control node.

Task acquisition module, for obtaining query task.

Data obtaining module, for Hive metadata letter corresponding to being obtained according to query task from Metadata Repository Breath, form HiveQL sentences.

Further, in certain embodiments, above-mentioned compilation unit 2 includes statement converter modular, abstract modulus of conversion Block, query block modular converter and physical transformation module.

Statement converter modular, for HiveQL sentences to be converted into abstract syntax tree.

Abstract modular converter, for abstract syntax tree to be converted into query block.

Query block modular converter, for query block to be converted into logical query plan and rewrites logical query plan.

Physical transformation module, for logic plan to be converted into physics plan, form execution task.

Further, in certain embodiments, above-mentioned statement converter modular includes defining submodule and parsing Module.

Submodule is defined, for defining the syntax rule of HiveQL sentences using Antlr；

Analyzing sub-module, for carrying out morphology and syntax parsing to HiveQL sentences according to syntax rule, formed abstract Syntax tree.

In addition, in certain embodiments, above-mentioned abstract modular converter also includes the first spider module and node obtains Module.

First spider module, for ergodic abstract syntax tree in sequence.

Node acquisition module, for obtaining different nominal nodes, it is saved in corresponding attribute, forms outer query block And subquery block.

QBParseInfo#JoinExpr is used to preserve TOK_JOIN nodes.

QB#qbJoinTree is the structuring to Join syntax trees.

This object of QB Expr is to represent that Union is operated.

TOK_QUERY=>Create inquiry block object, circular recursion child node；

By said process, final sample SQL generates two inquiry block objects.

Further, in certain embodiments, above-mentioned query block modular converter includes the second spider module, becomes mold changing Block and the 3rd spider module.

Second spider module, for traversal queries block, query block is translated as to perform operation tree.

Conversion module, operation tree, union operator are performed for converting.

3rd spider module, operation tree is performed for traveling through, operation tree will be performed and be translated as MapReduce tasks, formed Logical query plan simultaneously rewrites logical query plan.

By Map stages and Reduce stages by operation tree composition is performed, is patrolled in execution operation tree provided with several Operator is collected, that is, single specific operation is completed in Map stages or Reduce stages, basic operator includes TableScanOperator、SelectOperator、FilterOperator、JoinOperator、GroupByOperator And ReduceSinkOperator；Above-mentioned TableScanOperator is original from the Map interfaces of MapReduce frameworks The data of table are inputted, control the number of data lines of scan table, mark is evidence of being fetched from former table；JoinOperator completes Join behaviour Make；FilterOperator completes filter operation；ReduceSinkOperator is to turn to the field combination sequence at Map ends point Solution value, partition value, it is only possible to appear in the Map stages, while also indicates the Map stages in the MapReduce programs of Hive generations End.

The underlying attribute and method of logical operator are as follows：

RowSchema represents Operator output field；

NputObjInspector and outputObjInspector is parsing input and output field；

For above-mentioned physical transformation module, small table is read in into internal memory in the Map stages, the big table of sequential scan is completed Join。

Depth-first traversal Task Tree；

In summary, for compilation unit 2, syntax rule is defined using Antlr open source softwares, enormously simplify word The compiling resolving of method and grammer, it is thus only necessary to safeguard a grammar file；Design stage by stage makes whole compiled Range code is easily safeguarded so that follow-up various optimizers are easily with pluggable mode switch, and for example Hive 0.13 is newest Characteristic Vectorization and support to Tez engines are all pluggable, and each logical operator only completes single work( Can, simplify whole MapReduce programs.

The above-mentioned directory system based on Plcinet interactive mode engines, by obtaining HiveQL sentences, using Plcinet The sentence is compiled, forms execution task, performs the task, implementing result is obtained, language is defined using Antlr open source softwares Method rule, enormously simplify the compiling resolving of morphology and grammer, design stage by stage makes whole compilation process code easy Safeguard, each logical operator only completes single function, simplifies whole MapReduce programs, realizes enhancing big data inspection Rope it is ageing so that inquiry mode is more flexible, and execution efficiency is higher.

The above-mentioned technology contents that the present invention is only further illustrated with embodiment, in order to which reader is easier to understand, but not Represent embodiments of the present invention and be only limitted to this, any technology done according to the present invention extends or recreation, by the present invention's Protection.Protection scope of the present invention is defined by claims.

Claims

1. the indexing means based on Plcient interactive mode engines, it is characterised in that methods described includes：

Obtain HiveQL sentences；

Execution task is submitted into control node；

2. the indexing means according to claim 1 based on Plcient interactive mode engines, it is characterised in that obtain The step of HiveQL sentences, including step in detail below：

Query task is submitted to control node；

Obtain query task；

3. the indexing means according to claim 1 or 2 based on Plcient interactive mode engines, it is characterised in that right HiveQL sentences carry out Plcient compilings, the step of obtaining execution task, including step in detail below：

HiveQL sentences are converted into abstract syntax tree；

Abstract syntax tree is converted into query block；

Logic plan is converted into physics plan, forms execution task.

4. the indexing means according to claim 3 based on Plcient interactive mode engines, it is characterised in that by HiveQL Sentence is converted to the step of abstract syntax tree, including step in detail below：

The syntax rule of HiveQL sentences is defined using Antlr；

5. the indexing means according to claim 4 based on Plcient interactive mode engines, it is characterised in that by abstract language The step of method tree is converted into query block, including step in detail below：

Ergodic abstract syntax tree in sequence；

6. the indexing means according to claim 5 based on Plcient interactive mode engines, it is characterised in that by query block The step of being converted into the inquiry plan of logic and rewriteeing logical query plan, including step in detail below：

Conversion performs operation tree, union operator；

Traversal performs operation tree, will perform operation tree and is translated as MapReduce tasks, forms logical query plan and rewrite logic Inquiry plan.

7. the directory system based on Plcient interactive mode engines, it is characterised in that including sentence acquiring unit, compilation unit, carry Presentate member, handover unit, execution unit and operation reading unit；

The sentence acquiring unit, for obtaining HiveQL sentences；

The submission unit, for execution task to be submitted into control node；

The handover unit, execution task is handed into executive process engine for control node and performed, obtain metadata letter Breath；

8. the directory system according to claim 7 based on Plcient interactive mode engines, it is characterised in that the sentence Acquiring unit includes task and submits module, task acquisition module and data obtaining module；

The task submits module, for submitting query task to control node；

The task acquisition module, for obtaining query task；

Described information acquisition module, for Hive metadata letter corresponding to being obtained according to query task from Metadata Repository Breath, form HiveQL sentences.

9. the directory system according to claim 8 based on Plcient interactive mode engines, it is characterised in that the compiling Unit includes statement converter modular, abstract modular converter, query block modular converter and physical transformation module；

10. the directory system according to claim 9 based on Plcient interactive mode engines, it is characterised in that the sentence Modular converter includes defining submodule and analyzing sub-module；