CN109063201A

CN109063201A - A kind of impala online interaction formula querying method based on mixing storage scheme

Info

Publication number: CN109063201A
Application number: CN201811058357.XA
Authority: CN
Inventors: 李开; 邹复好; 訚实松; 刘鹏坤; 孙斌
Original assignee: Wuhan Charm Pupil Technology Co Ltd
Current assignee: Wuhan Charm Pupil Technology Co Ltd
Priority date: 2018-09-11
Filing date: 2018-09-11
Publication date: 2018-12-21
Anticipated expiration: 2038-09-11
Also published as: CN109063201B

Abstract

The embodiment of the invention provides a kind of impala online interaction formula querying methods based on mixing storage scheme, comprising: establishes hbase table with hadoop order, and builds table on hdfs with impala；External table is established in HIVE to be associated, and external table has been checked whether in impala；If having external table in impala, script is created by same day data and is directed into the hdfs；When user issues inquiry request, hdfs and hbase are inquired respectively, and query result combination is shown to user.Impala online interaction formula querying method provided in an embodiment of the present invention based on mixing storage scheme, the characteristics of making full use of hbase and hdfs carry out mixing storage to the data of increment, improve the speed of impala interactive inquiry.

Description

A kind of impala online interaction formula querying method based on mixing storage scheme

Technical field

The present embodiments relate to big data processing technology fields more particularly to a kind of based on mixing storage scheme Impala online interaction formula querying method.

Background technique

In recent years, with the promotion of computer storage capacity and the development of information technology, data volume exponentially type increases, greatly The trend of data makes scientific technological advance make rapid progress, and big data technology is risen, and subversiveness variation also has occurred in business model.

What big data not only represented is the data of magnanimity, more represents the technology of storage to mass data, processing.Greatly Data are flooded with the from cellar to rafter of human economic society, and how to go to extract valuable information from mass data is one and urgently solves Certainly the problem of.The processing of big data is different with traditional processing mode, and the powerful parallel meter of more machines is mainly utilized in it Calculation ability.By development these years, there are various big data processing platforms in big data field, such as hadoop, spark, Storm, these frames are handled generally directed to certain class big data problem.It is big that the problem of generally big data is handled is divided into three Class: real time data stream process problem, offline batch data handle problem, large-scale data interactive inquiry problem.Impala is A member of hadoop ecosystem, mainly for solving the problems, such as third class: large-scale data interactive inquiry, it can be to storage Formula inquiry is interacted with the data on distributed file system hdfs with similar SQL statement in hadoop database hbase.

The problem of we encounter in practice is that it is next that database has a large amount of message to deposit into daily, and user needs to database All data interact formula inquiry, it is desirable that response as quickly as possible, in the patient time range of user, traditional method be by These data are stored in hbase and are perhaps inquired with impala hbase or hdfs on hdfs.Storage side based on hbase Case, when data volume, which constantly increases, reaches million ranks, query time of the impala on hbase, which dramatically increases, reaches tens Second, it is unable to satisfy demand.And the storage method based on hdfs building, inquiry velocity ratio inquire fast very much, Er Qieke on hbase To use time-based partitioned storage strategy, a file is written into each message, in the selected inquiry period, can be made It is only related with this time to obtain inquiry scale, it will be apparent that shorten query time.But every message is as a file, Namenode can be deposited in memory for each file maintenance metadata, and a large amount of file can consume a large amount of memory of namenode, Serious problems are brought to the scalability and performance of hadoop.

Therefore, the characteristics of how comprehensively utilizing hdfs and hbase is the emphasis studied at present to solve the above problems.

Summary of the invention

To solve the above-mentioned problems, the embodiment of the present invention provides one kind and overcomes the above problem or at least be partially solved State a kind of impala online interaction formula querying method based on mixing storage scheme of problem, comprising:

Hbase table is established with hadoop order, and builds table on hdfs with impala；

External table is established in HIVE to be associated, and external table has been checked whether in impala；

If having external table in impala, script is created by same day data and is directed into the hdfs；

When user issues inquiry request, hdfs and hbase are inquired respectively, and query result combination is shown to User.

Wherein, described to build table on hdfs with impala and include:

It creates the corresponding inquiry field of querying condition and data is subjected to subregion according to daily.

Wherein, described when user issues inquiry request, hdfs and hbase are inquired respectively, and by query result Combination is shown to user, comprising:

Detect querying condition in whether include time conditions, if comprising time conditions and inquiry in need data The table of hdfs is copied to, then only hdfs is inquired；

If comprising time conditions and inquiry in need data be not copied in hdfs if only HBase is looked into It askes；

It is stored in if not including data a part in time conditions or time conditions and being stored in another part in hdfs Conjunctive query then is carried out to hbase and hdfs in hbase.

Wherein, hdfs and hbase are inquired respectively when user issues inquiry request described, and inquiry is tied Fruit combination is shown to after user, the method also includes:

Counting user issues the corresponding data bulk of inquiry request, and combines respectively from single hdfs, list hbase and hdfs Qualified number of data is taken out in hbasee.

Wherein, the corresponding data bulk of inquiry request is issued in the counting user, and respectively from single hdfs, list hbase And hdfs joint hbasee in take out qualified number of data after, the method also includes:

The step-length and initial address of inquiry are set in impala, to provide page turning and formfeed operation.

Wherein, the method also includes:

If not having external table in impala, INVALIDATE METADATA refresh metadata is used.

Wherein, the method also includes:

Script task is timed to the script.

Impala online interaction formula querying method provided in an embodiment of the present invention based on mixing storage scheme, makes full use of The characteristics of hbase and hdfs, carries out mixing storage to the data of increment, improves the speed of impala interactive inquiry.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of impala online interaction formula querying method based on mixing storage scheme provided in an embodiment of the present invention Flow diagram；

Fig. 2 is the implementation steps flow diagram of the mixing storage scheme proposed in the embodiment of the present invention；

Fig. 3 is the flow diagram of the mixing inquiry proposed in the embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is the present invention A part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.

Currently, the storage scheme based on hbase, when data volume, which constantly increases, reaches million ranks, impala is in hbase On query time dramatically increase and reach tens seconds, be unable to satisfy demand.And the storage method based on hdfs building, inquiry velocity It is fast more many than being inquired on hbase, and time-based partitioned storage strategy can be used, a text is written into each message Part may make that inquiry scale is only related with this time, it will be apparent that shorten query time in the selected inquiry period. But every message, as a file, namenode can be deposited in memory for each file maintenance metadata, a large amount of file The a large amount of memory of namenode can be consumed, brings serious problems to the scalability and performance of hadoop.

For above-mentioned problems of the prior art, Fig. 1 is provided in an embodiment of the present invention a kind of based on mixing storage The impala online interaction formula querying method flow diagram of scheme, Fig. 2 are the mixing storage side proposed in the embodiment of the present invention The implementation steps flow diagram of case, Fig. 3 are the flow diagram of the mixing inquiry proposed in the embodiment of the present invention, referring to Fig.1, Shown in Fig. 2 and Fig. 3, the impala online interaction formula querying method packet provided in an embodiment of the present invention based on mixing storage scheme It includes:

It should be noted that the overall plan thinking of the embodiment of the present invention is using a kind of hbase and hdfs mixing storage Scheme improve impala online interaction formula inquiry response speed data.Storage section is combined using HDFS and HBase deposits Storage, Hbase are only responsible for interim storage and work as day data, previous day data are automatically imported hdfs as one in setting script second day A big file.When we inquire all data, needs to inquire hbase and hdfs respectively, then close query result And together.

Table is established in hbase first, in accordance with the sheet format being pre-designed, and all column are all designed in the same column family Under, it is 25 hours that a TTL, which is arranged, as 90000 to hbase, and the data that TTL represents more than 25 hours can be deleted, because of number According to hdfs is had been introduced into, the data at this moment deleting hbase are feasible, and more stay a hour is in order to by the data on the same day Export is complete.After Hbase establishes table, need to be associated with impala.After table in Impala in mapping Hbase again on hdfs Table is established, equally establishes an Impala table of same field, and subregion is carried out to the date with partitioned by sentence, And table is stored as parquet format using stored as parquet sentence.It is all established in hbase and hdfs It needs to be arranged timing after impala table to copy to the data in hbase in hdfs.The system level tasks in (SuSE) Linux OS Dispatch command is under/etc/crontab file.When we interact formula inquiry, we can use impala pairs Hdfs is inquired, and is then inquired again the data on the day of hbase, and the result of the two inquiry is linked togather, and is exactly inquired As a result, although impala inquires hbase relatively slow, hbase has only deposited the same day newest data, and data volume reduces It is very much, so bulk velocity is quickly.Due to using the daily data of history as a parquet file there are hdfs, without It is each record as a file and improves the performance of hadoop entirety so avoiding hdfs large amount of small documents problem.

In order to make it easy to understand, emphasis noun therein is explained in the embodiment of the present invention.

Hbase: the distributed database towards column being built upon on Hadoop file system.

Hdfs:Hadoop distributed file system, hadoop distributed file system.

Impala:impala is the novel inquiry system of the leading exploitation of Cloudera company, it provides SQL semanteme, can look into Ask the PB grade big data being stored in the hdfs and hbase of Hadoop.

A kind of parquet: column file memory format.

HIVE: the data file of structuring can be mapped as a number by a Tool for Data Warehouse based on Hadoop According to library table, and provide simple sql query function.

Method operation provided in an embodiment of the present invention carries out in ubuntu14.04 system.

So specifically, firstly the need of the table for establishing hbase in scheme provided in an embodiment of the present invention, orders and be Hadoop, it may be assumed that create ' hadoop ', { NAME=> ' info ', TTL=> 90000 }.

And table is built on hdfs with impala, in our application scenarios, there are name, fromip, fromport, The meaning of the fields such as toip, toport, time, ciphertext, plaintext, these fields is as shown in table 1:

Each field meanings of table 1hbase table

Column name	Meaning
		Name	Type
fromip	Source ip
		fromport	Source port
toip	Purpose ip
		toport	Destination port
time	Time
		ciphertext	In plain text
plaintext	Ciphertext

Then external table is built in HIVE to be associated, table name info_today, indicate that the table deposited is the number of today According to order is as follows:

Then table info_today has been checked whether in impala, it is daily to create script if having checked table info_today Morning will lead hdfs when day data, this script file is named as copy.sh, and assign script and permission can be performed Sudochmod+x copy.sh, script file content are as follows:

#！/bin/sh

impala-shell–q“insert into table sho.hadoop_hdfspartition(year,month, day)

selectname,fromip,fromport,toip,toport,length,time,

ciphertext,plaintext,id,year(time),month(time),day(time)

fromsho.info_today

where time<to_date(now())and time>to_date(adddate(now(),-1))”&&

impala-shell–q“compute stats sho.hadoop_hdfs”。

Thereby realize mixing storage scheme.Although incorporating hbase for newest data, when second When its morning, these data will be imported into hdfs, that is to say, that hdfs storage is all data before today, daily Data as a file, hbase storage is newest data today.

It is stored using mixing, in order to carry out mixing inquiry and improve inquiry velocity.In our application scenarios, use Family needs to be combined inquiry to the types range of choice such as fromip, toip, time, name.So when user inquires, Hdfs and hbase can inquired respectively from the background, then result is combined and is shown to user.

On the basis of the above embodiments, described to build table on hdfs with impala and include:

The corresponding inquiry field of querying condition has been set a new column by the embodiment of the present invention as can be seen from Table 1 Race, referred to as cf.It is optional condition by the setting of some of which field, then can be carried out so in inquiry below Arbitrary query composition.

It should be noted that the embodiment of the present invention carries out subregion, concrete operations are as follows: building table statement according to daily to data " partitioned by " is added below, and storage format is appointed as parquet compressed format, is added behind sentence " as parquet " is ordered as follows:

create table sho.hadoop_hdfs(

name string,

fromip string,

fromportint,

toip string,

toportint,

lengthint,

time timestamp,

ciphertext string,

plaintext string,

id string

)

partitionedby(year int,monthint,dayint)stored as parquet；

So that occupied space is smaller when impala is inquired, network transmission is faster.

On the basis of the above embodiments, described when user issues inquiry request, hdfs and hbase are looked into respectively It askes, and query result combination is shown to user, comprising:

Detection mode used in the embodiment of the present invention is first to detect to the time, is detected whether in querying condition sometimes Between, setting one by one then is carried out to other undefined term conditions again and is inquired.Construct a structural body query_info store from Each querying condition that the page obtains, when having time in querying condition, if query context is before today inquiry in need Data have copied in the table of hdfs, need to only inquire at this time hdfs；If the data of only today of inquiry, this All data inquired that query statement need all are stored in hbase and do not include the data on hdfs, need to only look at this time Ask Hbase；Otherwise with regard to needing respectively to inquire hbase and hdfs.

On the basis of the above embodiments, hdfs and hbase is carried out respectively when user issues inquiry request described Inquiry, and by query result combination be shown to user after, the method also includes:

It is understood that the embodiment of the present invention in query process, also provides statistical function, that is, counts the querying condition Under how many data shared, respectively from single hdfs, single hbase and hdfs joint hbase in take out qualified data strip Number.

To be inquired parallel when hereinbefore being inquired, inquiry velocity is promoted.

On the basis of the above embodiments, the corresponding data bulk of inquiry request is issued in the counting user, and respectively After taking out qualified number of data in single hdfs, list hbase and hdfs joint hbasee, the method also includes:

Likewise, the embodiment of the present invention also provides the page turning and formfeed that just can be carried out inquiry after obtaining total inquiry item number The function of operation.Specifically, taking out a part of data from database every time, page turning and formfeed each time is once new look into It askes, the step-length and initial address of inquiry can be set by limit in impala and offset, and pass through the parameter of inquiry It is transmitted, the result that the data query of a step-length comes out is shown as one page.When user jumps to nth page, at this time Offset is equal to limit*n, indicates the limit data after display limit*n.

On the basis of the above embodiments, the method also includes:

Script task is timed to the script.

Specifically: 000***root/usr/bin/copy.sh.

In conclusion the impala online interaction formula querying method provided in an embodiment of the present invention based on mixing storage scheme It has the advantages that compared with prior art

1, the scheme of the invention based on hbase and hdfs mixing storage, improves impala interactive inquiry speed, both The problem of impala inquires hdfs the feature faster than hbase, in turn avoids hdfs large amount of small documents is utilized.

2, data have been carried out parquet format compression by the scheme of the invention based on hbase and hdfs mixing storage, The time of network transmission when not only having reduced the space of storage, but also having reduced inquiry.

3, hdfs has daily carried out partitioned storage to data, can be only to relevant time period in querying condition set period Data are inquired, and inquiry velocity is improved.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of impala online interaction formula querying method based on mixing storage scheme characterized by comprising

When user issues inquiry request, hdfs and hbase are inquired respectively, and query result combination is shown to use Family.

2. building table on hdfs the method according to claim 1, wherein described with impala and including:

3. according to the method described in claim 2, it is characterized in that, it is described when user issue inquiry request when, respectively to hdfs It is inquired with hbase, and query result combination is shown to user, comprising:

Whether detect in querying condition includes time conditions, if comprising time conditions and the data of inquiry in need replicated To the table of hdfs, then only hdfs is inquired；

If comprising time conditions and inquiry in need data be not copied in hdfs if only HBase is inquired；

It is stored in hbase if not including data a part in time conditions or time conditions and being stored in another part in hdfs Conjunctive query then is carried out to hbase and hdfs.

4. the method according to claim 1, wherein it is described when user issue inquiry request when, it is right respectively Hdfs and hbase are inquired, and by query result combination be shown to user after, the method also includes:

5. according to the method described in claim 4, it is characterized in that, issuing the corresponding data of inquiry request in the counting user Quantity, and respectively after taking out qualified number of data in single hdfs, list hbase and hdfs joint hbasee, it is described Method further include:

6. the method according to claim 1, wherein the method also includes:

7. the method according to claim 1, wherein the method also includes:

Script task is timed to the script.