CN105447172A

CN105447172A - Data processing method and system under Hadoop platform

Info

Publication number: CN105447172A
Application number: CN201510892226.1A
Authority: CN
Inventors: 朱大勇; 完献忠; 滕一勤
Original assignee: BEIJING ADVANCED DIGITAL TECHNOLOGY Co Ltd
Current assignee: BEIJING ADVANCED DIGITAL TECHNOLOGY Co Ltd
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2016-03-30

Abstract

The present application provides a data processing method under a Hadoop platform, and belongs to the field of data processing. The method comprises: acquiring Hive table structure information of inventory data of a Hadoop platform, comparing structure information of to-be-stored data with the Hive table structure information, and obtaining data structure change information; updating a Hive table structure according to the obtained data structure change information and the acquired Hive table structure information of the inventory data; and formatting the to-be-stored data according to the updated Hive table structure, and storing the formatted to-be-stored data. With adoption of the method disclosed by the present application, compatibility of the data structure of the stored data is effectively ensured, and when the archived and stored data needs to be analyzed and queried, the data format does not need to be counted and converted, so that not only are computing resources saved, but also analysis and query results can be rapidly fed back, and data query and analysis efficiency is improved.

Description

Data processing method under a kind of Hadoop platform and system

Technical field

The application relates to data processing field, particularly relates to the data processing method under a kind of Hadoop platform and system.

Background technology

Along with the development of computer technology, need the data grows of Storage and Processing many, and the structure of the data that different times, different terminals or business produce also may be different.

Such as, when utilizing Hadoop cluster to carry out data management, the historical data of operation system needs filing in filing system to preserve, and in prior art, the mass data from operation system needs to adopt Hive table to store, and is convenient to the management of data, inquiry.But due to the reason such as change of business demand, can there is list structure change in some table of operation system, and then the data layout filing source data during making each does not mate unavoidably.

Show to carry out the data filing operations such as data storage, management utilizing Hive while, must consider that data structure change causes the matching problem between filing data file and Hive list structure pattern, the efficiently easy to use of filing data should be guaranteed, guarantee that the filing data of historical storage can be analyzed with the data structure of the arbitrary time point of history again.

In prior art, common way is, filing data stores according to self form, when needing the data analysis to filing, obtain the data structure of all filing datas, according to all data structures obtained, determine a common data structure, and according to common data structure, amendment Historical archiving data, afterwards, then with the common data structure determined to all data analysis.

The defect that way of the prior art exists is: when needs carry out filing data inquiry, analysis, will carry out form statistics and conversion to the data of having filed, calculated amount is very large, and result feedback not in time.

Summary of the invention

Technical problems to be solved in this application are to provide the method and system of the data processing under a kind of Hadoop platform, and when solving data filing data query and analyze, large, the result feedback of calculated amount not in time.

In order to solve the problem, this application provides the data processing method under a kind of Hadoop platform, comprising:

Obtain the Hive list structure information of Hadoop platform data on stock, the structural information of data to be stored and described Hive list structure information are compared, obtain data structure change information;

According to the Hive list structure information of described data on stock of the described data structure change information obtained and acquisition, upgrade Hive list structure;

Described data to be stored are formatd according to the Hive list structure after upgrading, and the data described to be stored after storage formatting.

Described data structure change information comprises: unchanged, or: newly to add, combination that delete columns, column position adjust any one or more mark.

In an embodiment of the application, the Hive list structure information of described data on stock of described described data structure change information according to obtaining and acquisition, upgrades Hive list structure, comprising:

When described data structure change information is unchanged, the Hive list structure after renewal is identical with the Hive list structure of the described data on stock of acquisition;

When described data structure change information comprises delete columns, the Hive list structure after renewal is identical with the Hive list structure of the described data on stock of acquisition;

When described data structure change information comprise newly add time, after the Hive list structure of the described data on stock obtained, increase corresponding data row, upgrade Hive list structure;

When described data structure change information comprises column position adjustment, the Hive list structure after renewal is identical with the Hive list structure of the described data on stock of acquisition.

Further, described according to the described data to be stored of Hive list structure format after renewal, comprising:

When described data structure change information comprises delete columns, in data to be stored, increase corresponding data row, and to arrange this column data be null value;

When described data structure change information comprise newly add time, after the newly-increased data rows in described data to be stored is moved to other data rows;

When described data structure change information comprises column position adjustment, show the position of the corresponding data row of the described data to be stored of position adjustment of each data rows according to Hive.

In another embodiment of the application, the data processing method under described Hadoop platform also comprises:

Determine the Hive list structure storing data that the time point of specifying in data query instruction is corresponding;

Read the data stored, and according to the Hive list structure determined storing data end increase assigning null data row or Delete superfluous data rows, obtain data query.

In the another embodiment of the application, the data processing method under described Hadoop platform also comprises: according to the Hive list structure after renewal, again store and store data.

In order to solve the problem, disclosed herein as well is the data handling system under a kind of Hadoop platform, comprising:

Data structure extraction module, for obtaining the Hive list structure information of Hadoop platform data on stock, compares the structural information of data to be stored and described Hive list structure information, obtains data structure change information;

Hive shows update module, for the Hive list structure information of the described data on stock according to the described data structure change information obtained and acquisition, upgrades Hive list structure;

Data memory module, for formaing described data to be stored according to the Hive list structure after described renewal, and the data described to be stored after storage formatting.

Wherein, described data structure change information comprises: unchanged, or: newly to add, combination that delete columns, column position adjust any one or more mark.

Described Hive shows update module and comprises further:

Delete columns submodule, for when judging that described data structure change information comprises delete columns, the Hive list structure after renewal is identical with the Hive list structure of the described data on stock of acquisition;

Newly add submodule, for when judge described data structure change information comprise newly add time, after the Hive list structure of the described data on stock obtained, increase corresponding data row, upgrade Hive list structure;

Column position adjustment submodule, for when judging that described data structure change information comprises column position adjustment, the Hive list structure after renewal is identical with the structure of the described data on stock of acquisition;

Constant beggar's module, for when judging that described data structure change information is unchanged, the Hive list structure after renewal is identical with the Hive list structure of the described data on stock of acquisition.

Further, described data memory module comprises further: data rows adjustment submodule, for:

Compared with prior art, the application has the following advantages:

By obtaining the structural information of data on stock, the structural information of the structural information of data to be stored and described data on stock being compared, obtaining data structure change information; And according to the structural information of described data on stock of the described data structure change information obtained and acquisition, upgrade the data structure of Hive table; Then, data to be stored described in the data structure storage shown according to the Hive after upgrading, effectively ensure that the compatibility of the data structure storing data, when needs are to the data analysis of filing storage and inquiry, do not need to add up data layout and change, not only save computational resource, all right rapid feedback analysis, Query Result, improve data query and analysis efficiency.

Accompanying drawing explanation

Fig. 1 is the data processing method process flow diagram of the embodiment of the present application one;

Fig. 2 is the data processing method process flow diagram of the embodiment of the present application three;

Fig. 3 is the data on stock structural representation of the embodiment of the present application four;

Fig. 4 is the data structure schematic diagram to be stored of the embodiment of the present application four;

Fig. 5 is the Hive list data structure schematic diagram after the embodiment of the present application four adjusts;

Fig. 6 is the structural representation of the data handling system of the embodiment of the present application five.

Embodiment

For enabling above-mentioned purpose, the feature and advantage of the application more become apparent, below in conjunction with the drawings and specific embodiments, the application is described in further detail.

Embodiment one:

With reference to Fig. 1, show the process flow diagram of the data processing method under a kind of Hadoop platform of the application.Described method is applied to the scene needing to carry out filing storage and query analysis to data, comprises following step.

Step 100, obtains the Hive list structure information of Hadoop platform data on stock, the structural information of data to be stored and described Hive list structure information is compared, and obtains data structure change information.

Wherein, the Hive list structure information of data at least comprises: the attribute of data rows, data rows position.Described data structure change information comprises: unchanged, or comprises: newly add, any mark or combination of multiple mark of delete columns, column position adjustment.

Data rows attribute in the data structure information of data to be stored is identical with the attribute of the data rows in the Hive list structure information of data on stock, and data rows position corresponding to the data rows of respective attributes also identical time, determine that the data structure of data to be stored is identical with the data structure of data on stock, then Identification Data structure change information is " unchanged "; Data rows attribute in the data structure information of data to be stored is less than the attribute of the data rows in the Hive list structure information of data on stock, then Identification Data structure change information comprises: " delete columns "; Data rows attribute in the data structure information of data to be stored is more than the attribute of the data rows in the Hive list structure information of data on stock, then Identification Data structure change information comprises: " newly adding "; When the position of the data rows of a certain attribute in the data structure information of data to be stored is different from the position of the respective attributes data rows in the Hive list structure information of data on stock, then Identification Data structure change information comprises: " column position adjustment ".

During concrete enforcement, the data structure of data to be stored compared to the Hive list structure of data on stock, after may there is deleting certain or some data rows, newly-increased certain or some data rows again, and the situation that the position of the identical data rows of attribute is different.That is, the data structure change information of acquisition comprises: newly-increased, delete, column position adjustment mark.

Step 120, according to the Hive list structure information of described data on stock of the described data structure change information obtained and acquisition, upgrades Hive list structure.

According to data structure change information, before the described data to be stored of storage, need to upgrade Hive list structure: when described data structure change information is for " unchanged ", the Hive list structure after renewal is identical with the Hive list structure of the described data on stock of acquisition; When described data structure change information comprises " newly adding ", after the Hive list structure of the described data on stock obtained, increase corresponding data row, upgrade Hive list structure; When described data structure change information comprises " delete columns ", the Hive list structure after renewal is identical with the Hive list structure of the described data on stock of acquisition; When described data structure change information comprises " column position adjustment ", the Hive list structure after renewal is identical with the structure of the described data on stock of acquisition.

Step 140, according to the described data to be stored of Hive list structure format after described renewal, and the data described to be stored after storage formatting.

The data structure of data to be stored and the Hive list structure after upgrading may be distinct, therefore when storing data to be stored, need to carry out format conversion to described data to be stored, store described data to be stored according to the Hive list structure after upgrading.Concrete formatting method is as follows: when data structure change information is " unchanged ", and the form of described data to be stored is constant; When data structure change information comprises " delete columns ", in data to be stored, increase corresponding data row, and to arrange this column data be null value; When data structure change information comprises " newly adding ", after the newly-increased data rows in described data to be stored is moved to other data rows; When data structure change information comprises " column position adjustment ", show the position of the corresponding data row of the described data to be stored of position adjustment of each data rows according to Hive.

The Hive list structure information of the structural information of data to be stored and described data on stock, by obtaining the structural information of data on stock, compares by the present embodiment, obtains data structure change information; And according to the Hive list structure information of described data on stock of the described data structure change information obtained and acquisition, upgrade Hive list structure; Then, described data to be stored are stored according to the Hive list structure after upgrading, effectively ensure that the compatibility of the data structure storing data, when needs are to the data analysis of filing storage and inquiry, do not need to add up data layout and change, not only save computational resource, all right rapid feedback analysis, Query Result, improve data query and analysis efficiency.

Embodiment two:

Shown in Figure 2, based on the data processing method of the embodiment of the present application one, in another embodiment of the application, also comprise:

Step 160, according to the Hive list structure after renewal, again stores and stores data.

When the data layout of data to be stored is different with the data layout of the data of storage, namely when the described data structure change information obtained comprises " newly adding " mark, by increasing corresponding data row after the structure of the described data on stock obtained, upgrade Hive list structure.When namely adopting method disclosed in the embodiment of the present application one to carry out data storage, along with the increase of memory data output and the passing of time, the data rows of Hive table may increase.For the ease of carrying out compliance management to storing data, preferably, all data structures having stored data are needed to be adjusted to the Hive list structure after renewal, namely after the step 120 of the embodiment of the present application one, increase step 160, after last data rows of the data stored, increase corresponding data rows.

When the embodiment of the present application two is specifically implemented, the execution sequence of step 160 and step 140 is not construed as limiting.

By according to upgrade after Hive list structure, the data structure of the data stored described in adjustment, effectively can carry out compliance management to data.

Embodiment three:

Based on the data processing method under the Hadoop platform of the embodiment of the present application one, in an embodiment again of the application, as shown in Figure 2, also comprise:

Step 180, determines the Hive list structure storing data that the time point of specifying in data query instruction is corresponding;

Step 200, reads the data stored, and according to the Hive list structure determined storing data end increase assigning null data row or Delete superfluous data rows, obtains data query.

When data filing stores, when the data layout of data to be stored is different with the data layout of the data of storage, namely when the described data structure change information obtained comprises " newly adding " mark, by increasing corresponding data row after the structure of the described data on stock obtained, upgrade Hive list structure, namely along with the increase of memory data output, the data rows of Hive table may increase.Therefore, when data query, analysis, after determining the Hive list structure that the query time point of specifying in data query instruction is corresponding, compare the data structure storing data of reading and Hive list structure corresponding to described query time point, determine the situation of change of data structure.Stored data may exist and show few data rows or many data rows or the identical situation of data rows than Hive, now, the data rows of minimizing is the data rows at Hive table end, and unnecessary data rows is arranged in and stores end.Showing the situation of few data rows for storing data than Hive, increasing at the end storing data of reading after corresponding data arranges, and these data are set are classified as null value; For the situation storing data and to show than Hive many data rows, delete read store the unnecessary data rows in data end.After the storage data read carry out simple process, can obtain meeting all of Hive list structure determined and store data.

The embodiment of the application, at data query or when analyzing, by increasing corresponding empty data rows at the end of the data stored read, the consistance of the data structure effectively ensured, is convenient to data analysis; Do not need the data to filing storage to carry out form statistics and conversion, data-handling efficiency is high.

Embodiment four:

Below in conjunction with the application adopting Hive system storage data in Hadoop cluster, further illustrate the data processing method of the application.

The present embodiment comprises 8 fields with data on stock, and two groups of data instances that data to be stored comprise 9 fields are described in detail this step.

Fig. 3 is a data structural representation of data on stock.As shown in Figure 3, comprise 8 data rows with data on stock, data rows attribute respectively: C1, C2, C3, C4, C5, C6, C7, C8 are that example is described, and the Hive table storing these data needs to set up the data that 8 list items are shown for storage figure 3.Fig. 4 is a data structural representation of data to be stored.As shown in Figure 4, comprise 9 data rows with data to be stored, data rows attribute respectively: C1 ', C2 ', C3, C6, C7 ', C8 ', C7, C9, C10 be that example is described.The data rows of symbology same alike result identical in Fig. 3 with Fig. 4, i.e. same field.

The data processing method of the present embodiment comprises the following steps.

Step 510, obtains Hive and shows data rows attribute, the data rows attribute of data to be stored and Hive are shown data rows attribute and compares, obtain data structure change information.

Hive is a set of Analysis of Data Warehouse system built based on Hadoop, it provide abundant SQL query mode and carry out the data of analyzing stored in Hadoop distributed file system, structurized data file can be mapped as a database table, and complete SQL query function is provided, SQL statement can be converted to MapReduce task to run, the content of going query analysis to need by the SQL of oneself.When Hive system carries out data store and management, call data rows number and data rows attribute that DDL instruction can obtain Hive table.And call DDL instruction, compare by the data rows attribute of data to be stored and the Hive of acquisition are shown data rows attribute, obtain data structure change information.

Wherein, data structure change information comprises: unchanged, or comprises: newly add, any mark or combination of multiple mark of delete columns, column position adjustment.

Data as shown in Figure 3 and Figure 4, data rows attribute in the data structure information of data to be stored is compared with the attribute of the data rows in the data structure information of data on stock, lack C1, C2, C4, C5, C8, then Identification Data structure change information comprises: " delete columns "; Data rows attribute in the data structure information of data to be stored is compared with the attribute of the data rows in the data structure information of data on stock, add C1 ', C2 ', C7 ', C8 ', C9, C10, then Identification Data structure change information comprises: " newly adding ".The data structure change information obtained after relatively comprises: newly add, delete columns.Record newly-increased data rows and the data rows of deletion respectively, wherein, newly-increased data rows comprises: C1 ', C2 ', C7 ', C8 ', C9, C10; The data rows of deleting comprises: C1, C2, C4, C5, C8.

Step 520, if described data structure change information comprises " delete columns ", according to the data rows attribute that described Hive shows, retain these row in Hive table constant, format described data to be stored, in described data to be stored, the relevant position original position of these row supplements a data rows, and arranges this and be classified as null value, is used for occupy-place.With the data instance shown in Fig. 3 and Fig. 4, for C1, C2, C4, C5, C8 of lacking, the value of the corresponding data row after format is set to sky.

Step 530, if described data structure change information comprises " newly adding ", finally increases data rows at described Hive table, and after data rows newly-increased in mobile described data to be stored to other data rows.With the data instance shown in Fig. 3 and Fig. 4, increase by 6 row at end that former Hive shows, be used for storing successively C1 ', the C2 ' of increase, C7 ', C8 ', C9, C10 data rows.

Step 540, if described data structure change information comprises " column position adjustment ", according to the data rows attribute that described Hive shows, adjust described data to be stored corresponding data row, make its position and Hive show corresponding data column position consistent.

If data structure change information mark comprise: newly add, delete columns, column position adjustment, then represent Hive show data rows number or the attribute of data rows and data to be stored or sequentially there is no one_to_one corresponding, need, by different situations adjustment text or Hive table, to be consistent with the data making Hive show data layout and the data to be stored stored.

In the present embodiment, data to be stored show with Hive in storage data compared with, existing newly-increased data rows, delete data rows in addition, for the ease of follow-up Data Management Analysis, inquiry, need data layout unified, namely upgrade Hive list structure, make the Hive list structure of the data of itself and storage compatible, and adopt the Hive list structure after upgrading to format data to be stored.Specifically: for the data rows lacked, in data to be stored, increase corresponding data row, and be empty by the curriculum offering of these row; For newly-increased data rows, sequentially increase the corresponding data row lacked according to the data rows of data to be stored at the end of Hive table.With the data instance shown in Fig. 3 and Fig. 4, increase by 6 row at the end that former Hive shows, be used for storing C1 ', C2 ', C7 ', C8 ', C9, C10 successively; For C1, C2, C4, C5, C8 of lacking, the value that corresponding data arranges is set to sky, obtains the data structure that current point in time is corresponding, namely store the Hive list structure of data to be stored, comprise 14 data rows altogether, as shown in Figure 5.

Merit attention, if data structure change information mark comprises: column position adjusts, the position number of such as, C6 in data to be stored is 3, the position number of C3 is 4, namely the position that C3 and C6 two of data to be stored arranges there occurs adjustment, then need the structure by the data structure of data to be stored is shown according to Hive to rearrange.When data structure change information is " unchanged ", the data rows representing data rows and the data to be stored of Hive table one by one ordered pair is answered, and does not now process data to be stored, directly stores.

Step 550, according to data to be stored described in the data rows property store that the Hive after upgrading shows.

By adopting method disclosed in the present embodiment, do not need the structure adjusting the data stored, and effectively can ensure the compatibility of the data structure storing data, when needs are to the data analysis of filing storage and inquiry, do not need to add up data layout and change, not only save computational resource, all right rapid feedback analysis, Query Result, improve data query and analysis efficiency.

Embodiment five:

Data processing method disclosed in corresponding previous embodiment, the embodiment of the present application also discloses the data handling system under a kind of Hadoop platform, as shown in Figure 6, comprising:

Data structure extraction module 600, for obtaining the Hive list structure information of Hadoop platform data on stock, compares the structural information of data to be stored and described Hive list structure information, obtains data structure change information;

Hive shows update module 610, for the Hive list structure information of the described data on stock according to the described data structure change information obtained and acquisition, upgrades Hive list structure;

Data memory module 620, for formaing described data to be stored according to the Hive list structure after renewal, and the data described to be stored after storage formatting.

During concrete enforcement, described Hive shows update module 610 and comprises further:

Newly add submodule, for when judge described data structure change information comprise newly add time, after the newly-increased data rows in described data to be stored is moved to other data rows;

Described data memory module 620 comprises further: data rows adjustment submodule, for:

When described data structure change information comprises delete columns, in data to be stored, increase corresponding data row, and to arrange this column data be null value; When described data structure change information comprise newly add time, after the newly-increased data rows in described data to be stored is moved to other data rows; When described data structure change information comprises column position adjustment, show the position of the corresponding data row of the described data to be stored of position adjustment of each data rows according to Hive.

Embodiment six:

In another embodiment of the application, the data handling system under described Hadoop platform also comprises: data read module, for determining the Hive list structure storing data that the time point of specifying in described data query instruction is corresponding; Read the data stored, and according to the Hive list structure determined storing data end increase assigning null data row or Delete superfluous data rows, obtain data query.

Based on the embodiment five of the application, in the specific embodiment of the data handling system under another Hadoop platform, described system also comprises: store data update module, for showing the Hive list structure after update module 610 renewal according to Hive, again stores and stores data.By according to upgrade after Hive list structure, the data structure of the data stored described in adjustment, effectively can carry out compliance management to data.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For system embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

System embodiment described above is only schematic, the wherein said module illustrated as separating component can or may not be physically separates, parts as module display can be or may not be physical module, namely can be positioned at a place, or also can be distributed on multiple mixed-media network modules mixed-media.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying performing creative labour, are namely appreciated that and implement.

Through the above description of the embodiments, those skilled in the art can be well understood to the mode that each embodiment can add required general hardware platform by software and realize, and can certainly pass through hardware implementing.Based on such understanding, technique scheme can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product can store in a computer-readable storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform the method described in some part of each embodiment or embodiment.

Data processing method under a kind of Hadoop platform above the application provided and system, be described in detail, apply specific case herein to set forth the principle of the application and embodiment, the explanation of above embodiment is just for helping method and the core concept thereof of understanding the application; Meanwhile, for one of ordinary skill in the art, according to the thought of the application, all will change in specific embodiments and applications, in sum, this description should not be construed as the restriction to the application.

Claims

1. the data processing method under Hadoop platform, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, described data structure change information comprises: unchanged, or: newly to add, combination that delete columns, column position adjust any one or more mark.

3. method as claimed in claim 2, is characterized in that, the Hive list structure information of the described data on stock of described described data structure change information according to obtaining and acquisition, upgrades Hive list structure, comprising:

4. method as claimed in claim 3, is characterized in that, described according to the described data to be stored of Hive list structure format after renewal, comprising:

5. the method as described in any one of Claims 1-4, is characterized in that, also comprises:

6. the method for claim 1, is characterized in that, also comprises: according to the Hive list structure after renewal, again store and store data.

7. the data handling system under Hadoop platform, is characterized in that, comprising:

8. system as claimed in claim 7, it is characterized in that, described data structure change information comprises: unchanged, or: newly to add, combination that delete columns, column position adjust any one or more mark.

9. system as claimed in claim 8, is characterized in that, described Hive shows update module and comprises further:

10. system as claimed in claim 9, it is characterized in that, described data memory module comprises further: data rows adjustment submodule, for: