CN108133050A

CN108133050A - A kind of extracting method of data, system and device

Info

Publication number: CN108133050A
Application number: CN201810043493.5A
Authority: CN
Inventors: 王飞
Original assignee: Beijing Net Letter Cloud Suit Mdt Infotech Ltd
Current assignee: Beijing Net Letter Cloud Suit Mdt Infotech Ltd
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2018-06-08

Abstract

The invention discloses a kind of extracting method of data, including：When receiving the request of the extraction to target data, data source set corresponding with the target data is obtained, the data source set includes the data source information of the store path of at least one set of data table name and the data table name；According to the data source set, the target data set with the data source sets match is searched in preset data warehouse, the target data set includes at least one first object data, wherein, there are correspondences with the first object data in the target data set for the data source information in the data source set；According to preset computation rule, Distributed Calculation is carried out to the first object data in the target data set, obtains target data；Extract the target data.Above-mentioned method does not need to write extraction code under map reduce frames again, avoids the problem of extraction difficulty is big, extraction efficiency is low in the prior art.

Description

A kind of extracting method of data, system and device

Technical field

The present invention relates to a kind of big data field more particularly to extracting method of data, system and devices.

Background technology

With the development of Information technology, in different industries, a large amount of data can be all generated all the time, it is general to incite somebody to action The mass data of generation is stored in the data warehouse pre-established.And with the development of every profession and trade information system, produce pair The demand that target data extracts in the big data stored in the database, in the prior art, it is necessary first to analysis extraction Demand writes corresponding code according to the extraction demand under map-reduce frames, according to the code, realizes to counting greatly According to the extraction of middle target data.

Inventor carries out existing data extraction procedure the study found that according to extraction generation is write under map-reduce frames Code, has certain professional, and general user is not easy to grasp, and leads to extract that difficulty is big, extraction efficiency is low.

Invention content

In view of this, the present invention provides a kind of extracting method of data, to solve in the prior art according to map- Extraction code is write under reduce frames, has certain professional, general user is not easy to grasp, cause to extract difficulty it is big, The problem of extraction efficiency is low, concrete scheme are as follows：

A kind of extracting method of data, applied to big data cluster, including：

When receiving the request of the extraction to target data, data source set corresponding with the target data, institute are obtained State the data source information that data source set includes the store path of at least one set of data table name and the data table name；

According to the data source set, the number of targets with the data source sets match is searched in preset data warehouse According to set, the target data set includes at least one first object data, wherein, the data source in the data source set There are correspondences with the first object data in the target data set for information；

According to preset computation rule, Distributed Calculation is carried out to the first object data in the target data set, Obtain target data；

Extract the target data.

Above-mentioned method, it is preferred that when receiving the request of the extraction to target data, obtain and the target data pair The data source set answered includes：

Extraction code corresponding with the extraction request is searched in request task queue；

The extraction code is uploaded to big data cluster；

At the end of upload, the extraction code is parsed, obtains each group of data source information included in the extraction code；

It will be in each group of data source information storage to the data source set.

Above-mentioned method, it is preferred that according to the data source set, searched and the data in preset data warehouse The target data set of source sets match includes：

The data source set is parsed, obtains each data table name included in the data source set and each number According to the store path of table name；

For each store path data table name corresponding with its, judge to whether there is in the preset data warehouse The identical data path of the store path, if so, searching in the data path with the presence or absence of identical with the tables of data name Data source, if so, using the data source identical with the tables of data name as target data.

Above-mentioned method, it is preferred that according to preset computation rule, to the first object number in the target data set According to Distributed Calculation is carried out, obtain target data and include：

Obtain corresponding each target keywords of the first object data；

It is searched and the matched son of the target keywords in the target data set for each target keywords Data source, during subdata source storage is gathered to subdata source；

When receiving lookup completion instruction, the data source during the subdata source is gathered is integrated, and obtains target Data.

Above-mentioned method, it is preferred that further include：

By target data storage to preset area to be extracted in the big data cluster.

A kind of extraction system of data, applied to big data cluster, including：

Acquisition module, for when receiving the request of the extraction to target data, obtaining corresponding with the target data Data source set, the data source set include the data source of the store path of at least one set of data table name and the data table name Information；

Searching module, for according to the data source set, being searched and the data source collection in preset data warehouse Matched target data set is closed, the target data set includes at least one first object data, wherein, the data source There are correspondences with the first object data in the target data set for data source information in set；

Computing module, for according to preset computation rule, to the first object data in the target data set into Row Distributed Calculation, obtains target data；

Extraction module, for extracting the target data.

Above-mentioned system, it is preferred that the acquisition module includes：

First searching unit, for searching extraction code corresponding with the extraction request in request task queue；

Uploading unit, for the extraction code to be uploaded to big data cluster；

First acquisition unit at the end of uploading, parsing the extraction code, is obtained and is included in the extraction code Each group of data source information；

Storage unit, for storing each group of data source information into the data source set.

Above-mentioned system, it is preferred that the searching module includes：

Second acquisition unit for parsing the data source set, obtains each number included in the data source set According to table name and the store path of each data table name；

Judging unit for being directed to the corresponding data table name of each store path and its, judges the preset data With the presence or absence of the data path that the store path is identical in warehouse, whether there is and institute if so, searching in the data path The identical data source of tables of data name is stated, if so, using the data source identical with the tables of data name as target data.

Above-mentioned system, it is preferred that the computing module includes：

Third acquiring unit, for obtaining corresponding each target keywords of the first object data；

Second searching unit is searched and the mesh for being directed to each target keywords in the target data set The subdata source of keyword match is marked, during subdata source storage is gathered to subdata source；

Integral unit, for when receiving lookup completion instruction, the data source during the subdata source is gathered to carry out It integrates, obtains target data.

A kind of data extraction device, applied to Web server and big data cluster, including：Input unit, issuing means and Processor, wherein：

The input unit, operates in Web server, and extraction code is write, and by described in for providing authoring tool Extraction code is stored in request task queue；

The issuing means is operated on the submission machine of the big data cluster, in the request task queue The extraction code is obtained, and the extraction code is submitted into the processor；

The processor is operated on the submission machine of the big data cluster, and target data is carried for working as to receive When taking request, data source set corresponding with the target data is obtained, the data source set includes at least one set of tables of data The data source information of the store path of name and the data table name, according to the data source set, in preset data warehouse The target data set with the data source sets match is searched, the target data set includes at least one first object number According to, wherein, the data source information in the data source set and the first object data presence pair in the target data set It should be related to, according to preset computation rule, Distributed Calculation be carried out to the first object data in the target data set, is obtained To target data, the target data is extracted.

Compared with prior art, the present invention includes advantages below：

The invention discloses a kind of extracting method of data, including：When receiving the request of the extraction to target data, obtain Data source set corresponding with the target data is taken, the data source set includes at least one set of data table name and the data The data source information of the store path of table name；According to the data source set, searched and the number in preset data warehouse According to the target data set of source sets match, the target data set includes at least one first object data, wherein, it is described There are correspondences with the first object data in the target data set for data source information in data source set；According to pre- If computation rule, in the target data set first object data carry out Distributed Calculation, obtain target data；It carries Take the target data.Above-mentioned method does not need to write extraction code under map-reduce frames again, avoids the prior art The problem of middle extraction difficulty is big, extraction efficiency is low.

Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.

Description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of extracting method flow chart of data disclosed in the embodiment of the present application；

Fig. 2 is a kind of another method flow diagram of the extracting method of data disclosed in the embodiment of the present application；

Fig. 3 is a kind of another method flow diagram of the extracting method of data disclosed in the embodiment of the present application；

Fig. 4 is a kind of extraction system structure diagram of data disclosed in the embodiment of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work Embodiment shall fall within the protection scope of the present invention.

The foregoing description of the disclosed embodiments enables professional and technical personnel in the field to realize or use the present invention. A variety of modifications of these embodiments will be apparent for those skilled in the art, it is as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and the principles and novel features disclosed herein phase one The most wide range caused.

The present invention provides the extracting method and system of a kind of data, the extracting method is applied to number in big data cluster According to extraction process in, the big data cluster can be Hadoop and spark etc., and the executive agent of the extracting method is base In the data extraction tool of data warehouse ETL technologies or processor etc..The ETL technologies are used for describing data from source terminal By the process for extracting (extract), conversion (transform), loading (load) to destination.The data extraction method Method flow is as shown in Figure 1, including step：

S101, when receiving the request of the extraction to target data, obtain corresponding with target data data source collection It closes, the data source set includes the data source information of the store path of at least one set of data table name and the data table name；

In the embodiment of the present invention, the extraction request is a kind of extraction code using unified input mode input, preferably , the extraction code is write using sql like language.In the embodiment of the present invention, it is preferred that using bql (baize-query- Language), the data source letter comprising at least one set of data table name and the data table name store path in the extraction code Breath, the data source set can be a database, storage list or other preferred forms.

S102, according to the data source set, searched in preset data warehouse and the data source sets match Target data set, the target data set include at least one first object data, wherein, in the data source set There are correspondences with the first object data in the target data set for data source information；

In the embodiment of the present invention, the preset data warehouse is a big data platform, and the big data platform uses Distributed storage, the data format of storage can be parquet, CSV, hbase and json etc..Wherein, one group of data source information A corresponding target data, the target data can include multiple data.

S103, according to preset computation rule, the first object data in the target data set are carried out distributed It calculates, obtains target data；

In the embodiment of the present invention, the preset computation rule is a kind of Distributed Calculation based on map-reduce algorithms Rule,

S104, the extraction target data.

In the embodiment of the present invention, it is preferred that after the target data is obtained, the target data is stored to described Preset area to be extracted in big data cluster can be multiplexed for other data pick-ups.

In the embodiment of the present invention, the extraction code can be following form

SELECT*FROM csv_tb1AS tb1

LEFT JOIN hbase_tb2AS tb2

ON tb1.id=tb2.id

INNER IOIN tb3as tb3

ON tb1.id=tb3.id

WHERE tb1.id=xxx

In the embodiment of the present invention, the data source information is as shown in the data source information of table 1：

Data table name	Path
		tb3	Parquet
csv_tb1	csv
		hbase_tb2	hbase

Table 1

In the embodiment of the present invention, when receiving the request of the extraction to target data, obtain corresponding with the target data Data source set method flow as shown in Fig. 2, including step：

S201, extraction code corresponding with the extraction request is searched in request task queue；

In the embodiment of the present invention, extraction code corresponding with the extraction request is stored in request task queue , there may be multiple extraction codes in the request task queue, and the principle of lookup can be according to extraction keyword or extraction Mark is searched, and the extraction mark can be the sequence number of server or preset number etc..The extraction code Writing principle determines the extraction code according to the extraction corresponding keyword of request.

S202, the extraction code is uploaded to big data cluster；

S203, at the end of upload, parse the extraction code, obtain each group of data source included in the extraction code Information；

In the embodiment of the present invention, the extraction code is write according to unified input mode, the extraction code In include at least one set of data source information.Each group of data source is all write using preset mode, traverses the extraction generation Code, obtains all data source informations for including in the extraction code, and the data source information in the data source set can be with It can also be multigroup for one group.

S204, each group of data source information is stored into the data source set.

In the embodiment of the present invention, according to the data source set, searched and the data source in preset data warehouse The process of the target data set of sets match：Parse the data source set, obtain included in the data source set it is each The store path of a data table name and each data table name；For each store path data table name corresponding with its, Judge with the presence or absence of the data path that the store path is identical in the preset data warehouse, if so, searching the data With the presence or absence of the data source identical with the tables of data name in path, if so, the data source identical with the tables of data name is made For first object data.

Wherein described preset data warehouse uses distributed storage, and the data or file of each type are corresponding One store path is stored respectively, and multiple data table names can be corresponded under each path.

In the embodiment of the present invention, according to preset computation rule, to the first object data in the target data set Distributed Calculation is carried out, obtains the method flow of target data as shown in figure 3, including step：

S301, the corresponding each target keywords for obtaining the first object data；

S302, it searches in the target data set for each target keywords and is matched with the target keywords Subdata source, by the subdata source storage to subdata source gather in.

In the embodiment of the present invention, in target data only partial data be with the keyword match, for each A keyword determines subdata source corresponding with the keyword.

S303, when receive search complete instruction when, by the subdata source gather in data source integrate, obtain Target data.

In the embodiment of the present invention, when receiving lookup completion instruction, each subnumber in gathering the subdata source It is integrated according to source, the principle of integration：The intersection in each subdata source in the subdata source set is taken, the intersection is target Data.

Corresponding with the extracting method of above-mentioned data in the embodiment of the present invention, the present invention also provides a kind of data Extraction system, the structure diagram of the extraction system as shown in figure 4, including：

Acquisition module 401, searching module 402, computing module 403 and extraction module 404.

Wherein,

The acquisition module 401, for when receiving the request of the extraction to target data, obtaining and the target data Corresponding data source set, the data source set include the store path of at least one set of data table name and the data table name Data source information；

The searching module 402, for according to the data source set, being searched and the number in preset data warehouse According to the target data set of source sets match, the target data set includes at least one first object data, wherein, it is described There are correspondences with the first object data in the target data set for data source information in data source set；

The computing module 403, for according to preset computation rule, to the first object in the target data set Data carry out Distributed Calculation, obtain target data；

The extraction module 404, for extracting the target data.

The invention discloses a kind of extraction system of data, including：When receiving the request of the extraction to target data, obtain Data source set corresponding with the target data is taken, the data source set includes at least one set of data table name and the data The data source information of the store path of table name；According to the data source set, searched and the number in preset data warehouse According to the target data set of source sets match, the target data set includes at least one first object data, wherein, it is described There are correspondences with the first object data in the target data set for data source information in data source set；According to pre- If computation rule, in the target data set first object data carry out Distributed Calculation, obtain target data；It carries Take the target data.Above-mentioned system does not need to write extraction code under map-reduce frames again, avoids the prior art The problem of middle extraction difficulty is big, extraction efficiency is low.

In the embodiment of the present invention, the acquisition module 401 includes：

First searching unit 405, uploading unit 406, first acquisition unit 407 and storage unit 408.

Wherein,

First searching unit 405, for being searched in request task queue and the extraction corresponding extraction of request Code；

The uploading unit 406, for the extraction code to be uploaded to big data cluster；

The first acquisition unit 407, at the end of uploading, parsing the extraction code, obtaining the extraction generation The each group of data source information included in code；

The storage unit 408, for storing each group of data source information into the data source set.

In the embodiment of the present invention, the searching module 402 includes：

Second acquisition unit 409 and judging unit 410.

Wherein,

The second acquisition unit 409 for parsing the data source set, obtains what is included in the data source set The store path of each data table name and each data table name；

The judging unit 410 for being directed to the corresponding data table name of each store path and its, judges described default Data warehouse in the presence or absence of the identical data path of the store path, whether deposited if so, searching in the data path In the data source identical with the tables of data name, if so, using the data source identical with the tables of data name as target data.

In the embodiment of the present invention, the computing module 403 includes：

Third acquiring unit 411, the second searching unit 412 and integral unit 413.

Wherein,

The third acquiring unit 411, for obtaining corresponding each target keywords of the first object data；

Second searching unit 412 is searched for being directed to each target keywords in the target data set It, will be in subdata source storage to subdata source set with the matched subdata source of the target keywords.

The integral unit 413, for when receive search complete instruction when, by the subdata source gather in data Source is integrated, and obtains target data.

In the embodiment of the present invention, based on above-mentioned data extraction method and system, the present invention also provides a kind of data to carry Device is taken, applied to Web server and big data cluster, including：Input unit, issuing means and processor, wherein：

It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight Point explanation is all difference from other examples, and just to refer each other for identical similar part between each embodiment. For device class embodiment, since it is basicly similar to embodiment of the method, so description is fairly simple, related part is joined See the part explanation of embodiment of the method.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only include that A little elements, but also including other elements that are not explicitly listed or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except also there are other identical elements in the process, method, article or apparatus that includes the element.

For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit is realized can in the same or multiple software and or hardware during invention.

As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can It is realized by the mode of software plus required general hardware platform.Based on such understanding, technical scheme of the present invention essence On the part that the prior art contributes can be embodied in the form of software product in other words, the computer software product It can be stored in storage medium, such as ROM/RAM, magnetic disc, CD, be used including some instructions so that a computer equipment (can be personal computer, server either network equipment etc.) performs the certain of each embodiment of the present invention or embodiment Method described in part.

The extracting method and system of a kind of data provided by the present invention are described in detail above, it is used herein Specific case is expounded the principle of the present invention and embodiment, to understand the explanation of above example is only intended to helping The method and its core concept of the present invention；Meanwhile for those of ordinary skill in the art, thought according to the present invention is having There will be changes in body embodiment and application range, in conclusion the content of the present specification should not be construed as to the present invention Limitation.

Claims

1. a kind of extracting method of data, which is characterized in that applied to big data cluster, including：

When receiving the request of the extraction to target data, data source set corresponding with the target data, the number are obtained According to the data source information of store path of the source set comprising at least one set of data table name and the data table name；

According to the data source set, the target data set with the data source sets match is searched in preset data warehouse It closes, the target data set includes at least one first object data, wherein, the data source information in the data source set There are correspondences with the first object data in the target data set；

According to preset computation rule, Distributed Calculation is carried out to the first object data in the target data set, is obtained Target data；

Extract the target data.

2. according to the method described in claim 1, it is characterized in that, when receiving the request of the extraction to target data, obtain Data source set corresponding with the target data includes：

The extraction code is uploaded to big data cluster；

3. according to the method described in claim 1, it is characterized in that, according to the data source set, in preset data warehouse Middle lookup and the target data set of the data source sets match include：

The data source set is parsed, obtains each data table name included in the data source set and each tables of data The store path of name；

For each store path data table name corresponding with its, judge in the preset data warehouse with the presence or absence of described The identical data path of store path whether there is the number identical with the tables of data name if so, searching in the data path According to source, if so, using the data source identical with the tables of data name as target data.

4. according to the method described in claim 1, it is characterized in that, according to preset computation rule, to the target data set First object data in conjunction carry out Distributed Calculation, obtain target data and include：

Obtain corresponding each target keywords of the first object data；

It is searched and the matched subdata of the target keywords in the target data set for each target keywords Source, during subdata source storage is gathered to subdata source；

5. it according to the method described in claim 1, it is characterized in that, further includes：

By target data storage to preset area to be extracted in the big data cluster.

6. a kind of extraction system of data, which is characterized in that applied to big data cluster, including：

Acquisition module, for when receiving the request of the extraction to target data, obtaining data corresponding with the target data Source is gathered, and the data source set includes the data source letter of the store path of at least one set of data table name and the data table name Breath；

Searching module, for according to the data source set, being searched and the data source set in preset data warehouse The target data set matched, the target data set include at least one first object data, wherein, the data source set In data source information and the target data set in first object data there are correspondences；

Computing module, for according to preset computation rule, dividing the first object data in the target data set Cloth calculates, and obtains target data；

Extraction module, for extracting the target data.

7. system according to claim 6, which is characterized in that the acquisition module includes：

Uploading unit, for the extraction code to be uploaded to big data cluster；

First acquisition unit at the end of uploading, parsing the extraction code, includes each in the acquisition extraction code Group data source information；

8. system according to claim 6, which is characterized in that the searching module includes：

Second acquisition unit for parsing the data source set, obtains each tables of data included in the data source set The store path of name and each data table name；

Judging unit for being directed to the corresponding data table name of each store path and its, judges the preset data warehouse In with the presence or absence of the identical data path of the store path, whether there is in the data path and the number if so, searching According to the identical data source of table name, if so, using the data source identical with the tables of data name as target data.

9. system according to claim 6, which is characterized in that the computing module includes：

Second searching unit is searched and target pass for being directed to each target keywords in the target data set The matched subdata source of key word, during subdata source storage is gathered to subdata source；

Integral unit, for when receiving lookup completion instruction, the data source during the subdata source is gathered to be integrated, Obtain target data.

10. a kind of data extraction device, which is characterized in that applied to Web server and big data cluster, including：Input unit, Issuing means and processor, wherein：

The input unit, operates in Web server, and extraction code is write for providing authoring tool, and by the extraction Code is stored in request task queue；

The issuing means is operated on the submission machine of the big data cluster, for being obtained in the request task queue The extraction code, and the extraction code is submitted into the processor；

The processor is operated on the submission machine of the big data cluster, please for working as the extraction received to target data When asking, obtain corresponding with target data data source set, the data source set include at least one set of data table name with The data source information of the store path of the data table name according to the data source set, is searched in preset data warehouse With the target data set of the data source sets match, the target data set includes at least one first object data, Wherein, the data source information in the data source set is closed with the first object data in the target data set there are corresponding System according to preset computation rule, carries out Distributed Calculation to the first object data in the target data set, obtains mesh Data are marked, extract the target data.