CN106649355A

CN106649355A - Method and device for processing data

Info

Publication number: CN106649355A
Application number: CN201510728875.8A
Authority: CN
Inventors: 赖华贵
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2015-10-30
Filing date: 2015-10-30
Publication date: 2017-05-10

Abstract

The invention discloses a method and device for processing data, and relates to the field of computers. The problem that data processing efficiency is low is solved. According to the specific scheme, the first data stored in a relational database is acquired, the first data is stored in a temporary data table in a specific HDFS catalogue, and the first data in a first data table is inserted into a second data table in other HDFS catalogues, wherein the second data table is a formal data table in a catalogue different from the specific HDFS catalogue; then the data in the second data table is finished, and the second data table including the complete data is generated. The method and device are used for the data processing process.

Description

A kind of data processing method and device

Technical field

The present invention relates to computer realm, more particularly to a kind of data processing method and device.

Background technology

With the development of information technology, information data quantity sharp increase, the analysis for data is more next It is more important, then each component in the Distributed Calculation Hadoop ecosystem is (such as Hive, Impala Deng) in data analysis field also just occupy increasingly consequence.

When needing to be stored using Hadoop analyzing web site daily record, developer can manually by website day Will is imported in Hadoop clusters, the MapReduce used in the Hadoop clusters or similar meter Calculate framework to be parsed, restore in Hive or Impala to carry out data analysis.

As can be seen that when needing to be analyzed data by Hadoop clusters, needing to lead again Enter initial data, and initial data needs to carry out dissection process twice, so as to cause treatment effeciency compared with Low problem.

The content of the invention

In view of the above problems, it is proposed that the present invention overcomes the problems referred to above or at least portion to provide one kind The data processing method for solving the above problems with dividing.

By above-mentioned technical proposal, on the one hand, the present invention provides a kind of data processing method, including：

The first data are obtained, first data are the data stored in relevant database；

First data are stored in the first tables of data, first tables of data is specified distributed text Temporary data table under part system HDFS catalogue；

First tables of data is inserted in the second tables of data, second tables of data is to specify with described Official data table under HDFS catalogue different directories；

The data in second tables of data are arranged, the 3rd tables of data is generated.

On the other hand, the present invention also provides a kind of data processing equipment, including：

First data for obtaining the first data, and are supplied to memory cell by acquiring unit, First data are the data stored in relevant database；

The memory cell, for first data to be stored in the first tables of data, and by described One tables of data is supplied to insertion unit, first tables of data to be specified distributed file system HDFS Temporary data table under catalogue；

The insertion unit, for first tables of data to be inserted in the second tables of data, and will insertion Second tables of data of first tables of data is supplied to the first signal generating unit, second tables of data be with Official data table under the specified HDFS catalogues different directories；

A kind of data processing method provided in an embodiment of the present invention and device, by obtaining in relationship type number According to the first data stored in storehouse, first data are stored in into the nonce under specified HDFS catalogues According to table, the first data in the first tables of data are inserted into the second data under other HDFS catalogues In table, wherein the second tables of data is and the official data table under specified HDFS catalogues different directories, so Data in the tables of data of Final finishing second, generation includes the second tables of data of partial data.As can be seen that By above-mentioned steps, directly the parsed data stored in relevant database can be stored in Under HDFS catalogues, while being easy to subsequently carry out data analysis, reducing carries out weight to initial data The step of parsing again, so as to improve the efficiency of processing data, it is to avoid the waste of certain resource, with And reduce cost.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the present invention's Technological means, and being practiced according to the content of specification, and in order to allow the above-mentioned of the present invention and Other objects, features and advantages can become apparent, below especially exemplified by the specific embodiment of the present invention.

Description of the drawings

By the detailed description for reading hereafter preferred embodiment, various other advantage and benefit for Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for illustrating the mesh of preferred embodiment , and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol Number represent identical part.In the accompanying drawings：

Fig. 1 shows a kind of flow chart of data processing method provided in an embodiment of the present invention；

Fig. 2 shows the flow chart of another kind of data processing method provided in an embodiment of the present invention；

Fig. 3 shows the flow chart of another data processing method provided in an embodiment of the present invention；

Fig. 4 shows also a kind of flow chart of data processing method provided in an embodiment of the present invention；

Fig. 5 shows the flow chart of another data processing method provided in an embodiment of the present invention；

Fig. 6 shows the flow chart of also another data processing method provided in an embodiment of the present invention；

Fig. 7 shows a kind of structural representation of data processing equipment provided in an embodiment of the present invention；

Fig. 8 shows the structural representation of another kind of data processing equipment provided in an embodiment of the present invention.

Specific embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing The exemplary embodiment of the disclosure is shown, it being understood, however, that may be realized in various forms the disclosure And should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to more Thoroughly understand the disclosure, and can be by the complete technology for conveying to this area of the scope of the present disclosure Personnel.

Distributed Calculation Hadoop is a distributed parallel computing platform increased income, its Map/Reduce calculation functions are widely used in data analysis and process field, and Hadoop is Develop into excellent big data analysis method.

Hadoop is the complete Open Framework for big data analysis.It includes a HDFS (Hadoop Distributed File System, distributed file system), a parallel processing framework (Apache HadoopMapReduce) and various different components, support data acquisition, workflow coordination, appoint The function such as business management and cluster monitoring.

In order to the data parsed in relevant database are adopted with the data processing group of Hadoop clusters Being processed, the application provides a kind of data processing method to part, as shown in figure 1, the method includes；

S101, obtains the first data, and first data are the data stored in relevant database.

, wherein it is desired to explanation, first data can be document form, but the specific present invention Embodiment is not limited to this, and when the first data are document form, the data of acquisition first can be with By but be not limited to following mode and realize, which is to derive the data in relevant database For a batch file, the form of this document can be but be not limited to csv file or Txt files.

Relevant database, the database being built upon on the basis of relational model, by means of algebra of sets Deng the data that mathematical concept and method come in processing data storehouse.

It is appreciated that the first data are the data parsed to initial data in this step.

S102, first data are stored in the first tables of data, and first tables of data is divided to specify Temporary data table under cloth file system HDFS catalogue.

Wherein, it is intended that HDFS catalogues are the catalogue specified, it is the corresponding catalogue of the first tables of data, It is considered as the catalogue as the first data of storage.

S103, first tables of data is inserted in the second tables of data, and second tables of data is to specify with this Official data table under HDFS catalogue different directories.

S104, arranges the data in second tables of data, generates the 3rd tables of data.

It is understood that the 3rd tables of data be with it is formal under the specified HDFS catalogues different directories Tables of data, is to further comprises the first data with the second tables of data difference.

A kind of data processing method provided in an embodiment of the present invention, by obtaining in relevant database First data of storage, first data are stored in the temporary data table under specified HDFS catalogues, The first data in first tables of data are inserted in the second tables of data under other HDFS catalogues, its In the second tables of data be and the official data table under specified HDFS catalogues different directories then to arrange Data in two tables of data, generation includes the second tables of data of partial data.As can be seen that by upper Step is stated, directly the parsed data stored in relevant database HDFS mesh can be stored in into Under record, while being easy to subsequently carry out data analysis, reduce carries out repeated resolution to initial data Step, so as to improve the efficiency of processing data, it is to avoid the waste of certain resource, and reduce Cost.

As one embodiment of the present invention, after S101, i.e., after the first data are obtained, Can also include：

By first data according to being necessarily sequentially generated at least one data file, in each data file Comprising table it is identical with the table structure of the table included in the relevant database, and the number of data file Amount is identical with the quantity of the corresponding table of the first data.

Optionally, in the present invention data file can for CSV (Comma-Separated Values, Comma separated value) file.

Optionally, first data can be derived according to row order in the present invention, obtains at least one Individual data file.It is corresponding, the row set of the row that data file is included table corresponding with relevant database It is identical.Also, each table one data file (CVS files) of correspondence in relevant database.

As one embodiment of the present invention, before S102, i.e., first is stored in saying the first data Before in tables of data, the method also includes：

In the data handling component of Hadoop clusters, set up and include in the relevant database Table table structure identical described in the first tables of data, first tables of data is to include the specified HDFS The tables of data of total data under catalogue.

Wherein, the data handling component of Hadoop clusters can be Hive, Impala etc..Hive is A Tool for Data Warehouse based on Hadoop, can be mapped as one by structurized data file Database table, and simple SQL (Structured Query Language, structuralized query language are provided Speech) query function.SQL statement is converted into MapReduce tasks to be run.Impala is one Individual MPP formula SQL big data analysis engine.Certainly, the invention is not restricted to this.Hadoop The data handling component of cluster is prior art, and here is not repeated one by one.

In addition, it is necessary to explanation, the first tables of data is an outside temporary data table, therefore, the One tables of data can include specifying the total data under HDFS catalogues.

And, optionally, the invention is not restricted to the quantity of the first tables of data, it is pass in HDFS It is one catalogue of each tables of data correspondence establishment in type database, in the data of Hadoop clusters It is that each tables of data sets up respectively outside nonce in corresponding HDFS catalogues in process assembly According to table (i.e. the first tables of data).

For by taking Hive as an example, under specified HDFS catalogues, the external data table for setting up Hive is made For a temporary data table.Wherein in the table structure and relevant database of this outside temporary data table Including table table structure it is consistent, and then ensure that and be stored in the first data in relevant database During the outside temporary data table, it is ensured that the integrality of data, and the place to the first data can be reduced Reason.

As one embodiment of the present invention, as shown in Fig. 2 the first data (are stored in by S102 One tables of data) can refine including：

S1021, at least one data file is copied in the first tables of data.

S1022, the operation in the data analysis tool of Hadoop clusters refreshes Refresh orders.

S1023, refreshes the first tables of data.

Now, the first data are also contains in the first tables of data, the first tables of data are refreshed again, Just can confirm that the first data whether in the first tables of data.

As a kind of implementation of the present invention, as shown in figure 3, S103 is (by the insertion of the first tables of data In second tables of data) can refine including：

S1031, the operation insertion Insert orders in the data analysis tool of Hadoop clusters.

S1032, the first data in the first tables of data are inserted in second tables of data.

What deserves to be explained is, in shown in Fig. 3, the refinement step in S102 is not included on, but It is the only one of which implementation shown in Fig. 3, optionally, the present invention can also include one kind Implementation, as shown in Figure 4.Because step content is identical, will not be described here.

As a kind of implementation of the present invention, as shown in figure 5, after S104, that is, the is being arranged Data in two tables of data, after generating the 2nd the 3rd tables of data, also include：

S105, specified HDFS catalogues are emptied.

Specifically, after it is determined that the first data have all imported the second tables of data, empty each The corresponding HFDS catalogues of one tables of data, run Refresh orders, refresh the first tables of data.

As a kind of implementation of the present invention, present invention is generally directed to, will parse and How the first data being stored in relevant database are stored in Hadoop clusters in the case where not parsing Data handling component under, the data handling component in order to Hadoop clusters is carried out to the first data Data processing.

However, operating for convenience, extra cost is not increased, the scheme that the present invention is provided is still applicable In the data not stored in relevant database.As shown in fig. 6, the method also includes：

S601, receives the second data.

Second data are the data not stored in relevant database.

S602, the second data are stored in the first tables of data.

S603, the first tables of data is inserted in the second tables of data, and the second tables of data is and specified HDFS Official data table under catalogue different directories.

S604, arranges the data in the second tables of data, generates the 3rd tables of data.

In order to directly the parsed data stored in relevant database are stored in into HDFS Under catalogue, while being easy to subsequently carry out data analysis, reduce carries out repeated resolution to initial data The step of, improve the efficiency of processing data, it is to avoid the waste of certain resource, and the mesh of reduces cost , the present invention also provides a kind of data processing equipment 70, as shown in fig. 7, the device includes obtaining single Unit 701, memory cell 702 inserts unit 703, the first signal generating unit 704.

First data for obtaining the first data, and are supplied to memory cell 702 by acquiring unit 701, First data are the data stored in relevant database；

Memory cell 702, for the first data to be stored in the first tables of data, and the first tables of data is carried Supply insertion unit 703, the first tables of data is to specify facing under distributed file system HDFS catalogue When tables of data；

Insertion unit 703, for the first tables of data to be inserted in the second tables of data, and insertion first is counted It is supplied to the first signal generating unit 704, the second tables of data to be and specified HDFS mesh according to the second tables of data of table Official data table under record different directories；

First signal generating unit 704, for the data in the second tables of data of arrangement, generates the 3rd tables of data.

Optionally, the present invention also provides a kind of data processing equipment 80, as shown in figure 8, the device 80 also include the second signal generating unit 705, empty unit 706, receiving unit 707；And storage is single Unit 702 includes replication module 7021, the first operation module 7022, refresh module 7023；Insertion is single Unit 703 includes the second operation module 7031, inserts module 7032.

Wherein the second signal generating unit 705, for by the first data according to being necessarily sequentially generated at least one number According to file, the table structure phase of the table included in the table included in each data file and relevant database Together, and the quantity of data file is identical with the quantity of the corresponding table of the first data；And,

It is additionally operable in the data handling component of Hadoop clusters, sets up and wrapped with relevant database Table structure first tables of data of identical of the table for containing, the first tables of data is to include under specified HDFS catalogues The tables of data of total data.

Further, the replication module 7021 in memory cell 702, for will at least one data text Part is copied in the first tables of data.

First operation module 7022, refreshes for the operation in the data analysis tool of Hadoop clusters Refresh orders.

Refresh module 7023, for refreshing the first tables of data.

Further, the second operation module 7031 in unit 703 is inserted, in Hadoop collection Operation insertion Insert orders in the data analysis tool of group.

Insertion module 7032, for the first data in the first tables of data to be inserted in the second tables of data.

Explanation is needed further exist for, in the present embodiment, unit 706 is emptied, for by specified HDFS Catalogue is emptied.

Second data for receiving the second data, and are supplied to memory cell 702 by receiving unit 707, Second data are the data not stored in relevant database.

Memory cell 702, is additionally operable to that the second data are stored in the first tables of data.

By the embodiment of the present invention, can directly by the parsed number stored in relevant database According to being stored under HDFS catalogues, while being easy to subsequently carry out data analysis, reduce to initial data The step of carrying out repeated resolution, so as to improve the efficiency of processing data, it is to avoid the wave of certain resource Take, and reduce cost.

The data processing equipment includes processor and memory, above-mentioned acquiring unit, memory cell, Insertion unit, the first signal generating unit, the second signal generating unit, empty the conduct such as unit, receiving unit Program unit is stored in memory, by computing device storage said procedure unit in memory To realize corresponding function.

Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can To arrange one or more, the efficiency of processing data is improved by adjusting kernel parameter, it is to avoid certain The waste of resource, and reduces cost.

Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one storage chip.

Present invention also provides a kind of computer program, when performing on data processing equipment, It is adapted for carrying out obtaining the first data, first data are the data stored in relevant database； First data are stored in the first tables of data, first tables of data is specified distributed field system Temporary data table under system HDFS catalogues；First tables of data is inserted in the second tables of data, institute Stating the second tables of data is and the official data table under the specified HDFS catalogues different directories；Arrange institute The data in the second tables of data are stated, the 3rd tables of data is generated.

Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system, Or computer program.Therefore, the application can be implemented using complete hardware embodiment, complete software Example or with reference to the form of the embodiment in terms of software and hardware.And, the application can be adopted at one Or it is multiple wherein include computer usable program code computer-usable storage medium (including but not Be limited to magnetic disc store, CD-ROM, optical memory etc.) on the computer program implemented Form.

The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart and/or block diagram of product is describing.It should be understood that can be realized flowing by computer program instructions In each flow process and/or square frame and flow chart and/or block diagram in journey figure and/or block diagram Flow process and/or square frame combination.Can provide these computer program instructions to all-purpose computer, specially With the processor of computer, Embedded Processor or other programmable data processing devices producing one Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices It is raw to be used to realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple sides The device of the function of specifying in frame.

These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process In the computer-readable memory that equipment is worked with specific mode so that be stored in the computer-readable and deposit Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one The function of specifying in flow process or one square frame of multiple flow processs and/or block diagram or multiple square frames.

These computer program instructions can also be loaded into computer or other programmable data processing devices On so that series of operation steps is performed on computer or other programmable devices to produce computer The process of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing Specify in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames The step of function.

In a typical configuration, computing device include one or more processors (CPU), input/ Output interface, network interface and internal memory.

Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is the example of computer-readable medium.

Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, Can be used to store the information that can be accessed by a computing device.Define according to herein, computer-readable Medium does not include temporary computer readable media (transitory media), the such as data-signal and load of modulation Ripple.

Embodiments herein is these are only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model Within enclosing.

Claims

1. a kind of data processing method, it is characterised in that include：

2. data processing method according to claim 1, it is characterised in that obtain the described After one data, methods described also includes：

By first data according to being necessarily sequentially generated at least one data file, each data file In the table that includes it is identical with the table structure of the table included in the relevant database, and data file Quantity it is identical with the quantity of the corresponding table of first data.

3. data processing method according to claim 1, it is characterised in that it is described will be described Before first data are stored in the first tables of data, methods described also includes：

In the data handling component of Distributed Calculation Hadoop cluster, set up and the relationship type number According to the table included in storehouse table structure identical described in the first tables of data, first tables of data is to include The tables of data of total data under the specified HDFS catalogues.

4. data processing method according to claim 2, it is characterised in that described by described One data are stored in the first tables of data, including：

At least one data file is copied in first tables of data；

The operation in the data analysis tool of the Hadoop clusters refreshes Refresh orders；

Refresh first tables of data；

It is described that first tables of data is inserted into the second tables of data, including：

The operation insertion Insert orders in the data analysis tool of the Hadoop clusters；

First data in first tables of data are inserted in second tables of data.

5. data processing method according to claim 4, it is characterised in that in the arrangement institute The data in the second tables of data are stated, after generating the 3rd tables of data, methods described also includes：

The specified HDFS catalogues are emptied.

6. the data processing method according to any one of claim 1 to 5, it is characterised in that Second tables of data first tables of data inserted under the specified HDFS catalogues, described the Two tables of data are that methods described also includes before the official data table under the specified HDFS catalogues：

The second data are received, second data are the data not stored in the relevant database；

Second data are stored in first tables of data.

7. a kind of data processing equipment, it is characterised in that include：

First signal generating unit, for arranging second tables of data in data, generate the 3rd count According to table.

8. data processing equipment according to claim 7, it is characterised in that described device is also wrapped Include：

Second signal generating unit, for by first data according to being necessarily sequentially generated at least one data The table structure of the table included in file, the table included in each data file and the relevant database It is identical, and the quantity of data file is identical with the quantity of the corresponding table of first data；

Second signal generating unit, is additionally operable to the data processing group in Distributed Calculation Hadoop cluster In part, the first data described in the table structure identical with the table included in the relevant database are set up Table, first tables of data is the tables of data for including total data under the specified HDFS catalogues.

9. data processing equipment according to claim 8, it is characterised in that the memory cell Including：

Replication module, at least one data file to be copied in first tables of data；

First operation module, refreshes for the operation in the data analysis tool of the Hadoop clusters Refresh orders；

Refresh module, for refreshing first tables of data；

The insertion unit, including：

Second operation module, for the operation insertion in the data analysis tool of the Hadoop clusters Insert orders；

Insertion module, for first data insertion described second in first tables of data to be counted According in table.

10. the data processing equipment according to any one of claim 7 to 9, it is characterised in that Described device also includes：Empty unit, receiving unit；

It is described to empty unit, for the specified HDFS catalogues to be emptied；

Second data for receiving the second data, and are supplied to described depositing by the receiving unit Storage unit, second data are the data not stored in the relevant database；

The memory cell, is additionally operable to that second data are stored in first tables of data.