CN104182502B

CN104182502B - A kind of data pick-up method and device

Info

Publication number: CN104182502B
Application number: CN201410406481.6A
Authority: CN
Inventors: 曹连超; 辛国茂; 亓开元; 刘伟; 李占强; 卢军佐
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-08-18
Filing date: 2014-08-18
Publication date: 2017-10-27
Anticipated expiration: 2034-08-18
Also published as: CN104182502A

Abstract

The present invention provides a kind of data pick-up method, and applied to relevant database, methods described includes：According to the codomain distribution of certain field in the tables of data of selection, the tables of data is divided into M data partition, the type of the field can be converted into numerical value for the value of numeric type or the field；The weight of each data partition is calculated according to the number of data lines of each data partition；It is each data partition distribution Thread Count according to the weight of each data partition；The summation of each Thread Count of each data partition distribution is equal to default total Thread Count N, wherein M≤N；N number of thread is opened, according to the Thread Count distributed, data pick-up is carried out using the thread of respective numbers to each data partition respectively.The present invention is by the way that to tables of data is divided into some data partitions, the Thread Count of each data partition of dynamically distributes solves the problem of each thread distribution data are uneven, improves the data pick-up efficiency of relational data.

Description

A kind of data pick-up method and device

Technical field

The present invention relates to data pick-up field, and in particular to the data pick-up method and device of relevant database.

Background technology

Data integration is that the data of separate sources, form and feature logically or are physically organically concentrated, so that Comprehensive data sharing is provided, is enterprise commerce intelligence, the important component of data warehouse.ETL is business data collection Into primary solutions.That three letters are represented respectively in ETL is Extract, Transform, Load, that is, extract, change, Loading.Data pick-up is the process that data are extracted from data source.In practical application, data source is more to use relationship type number According to storehouse.

The mode of data is extracted from relevant database can be divided into directly export Backup Data and be connect by JDBC etc. Mouth reads the modes such as data.It is wherein more flexible by way of the reading of the interfaces such as ODBC or JDBC, it can not only carry out data Full dose extract, increment extraction can be carried out again.However, if not by way of the interfaces such as ODBC or JDBC extract data Using multi-threaded parallel, efficiency can be than relatively low, today that particularly the big data epoch arrive, it is often necessary to extract with upper The database table of hundred million datas.Multi-threaded parallel, which extracts data, to be needed to carry out pre-segmentation to the data in data source, if each The Data Entry skewness of thread distribution, the efficiency of multithreading can have a greatly reduced quality；But if it is intended to allow each thread to distribute Data it is visibly homogeneous, it is necessary to calculate the detailed distribution situation of data in tables of data, so need to do big before data are extracted The efficiency of data is extracted in the database manipulation of amount, influence.This patent proposes the concept of the pre- subregion of data, passes through simple database Pre-operation obtains the data strip mesh number of each data partition, and is that each subregion dynamically distributes extract data according to data strip mesh number Thread, can effectively solve above-mentioned problem.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of data pick-up method of relevant database, to improve number According to the efficiency of extraction.

In order to solve the above-mentioned technical problem, the invention provides a kind of data pick-up method, applied to relevant database, Methods described includes：

The weight of each data partition is calculated according to the number of data lines of each data partition；

It is each data partition distribution Thread Count according to the weight of each data partition；For each data partition point The summation for the Thread Count matched somebody with somebody is equal to default total Thread Count N, wherein M≤N；

N number of thread is opened, according to the Thread Count distributed, the thread of respective numbers is used to each data partition respectively Carry out data pick-up.

It is preferred that

The weight for calculating each data partition according to the number of data lines of each data partition includes：

Obtain the number of data lines C of each data partition_m, 1≤m≤M；

The weight of than the m-th data subregion is w_m,C=C₁+…+C_m+…+C_M, the weight sum of each data partition For 1；

The weight according to each data partition includes for each data partition distribution Thread Count：

For than the m-th data subregion distribution Thread Count INT (w_mN), INT is to round downwards；

By remaining unappropriated Thread Count N_oIt is assigned to the N in all data partitions_oIn individual data partition, wherein,

It is preferred that

Distributed according to the weight of each data partition for each data partition after Thread Count, it is described to open N number of line Cheng Qian, in addition to：

If the Thread Count of data partition distribution is more than or equal to 2, the data partition is divided into data child partition, the number It is the Thread Count that the data partition is distributed, each data child partition pair of the data partition according to the number of the data child partition of subregion Answer a thread.

It is preferred that

I-th of data partition is merged with j-th of data partition, wherein the Thread Count of i-th of data partition distribution For 0, the Thread Count of j-th of data partition distribution is not 0, and 1≤i≤M, 1≤j≤M, i is not equal to j.

It is preferred that

It is described according to the Thread Count distributed, data are carried out using the thread of respective numbers to each data partition respectively Extraction includes：

Respectively according to the Thread Count for each data child partition for distributing to each data partition, using the thread of respective numbers Data pick-up is carried out to each data child partition.

The present invention also provides a kind of data pick-up device, applied to relevant database, described device include division module, Distribute module and abstraction module, wherein,

The codomain that the division module is used for certain field in the tables of data according to selection is distributed, and the tables of data is divided into M Individual data partition, the type of the field can be converted into numerical value for the value of numeric type or the field；

The distribute module further comprises weight calculation unit and thread allocation unit；

The weight calculation unit is used to calculate each data partition according to the number of data lines of each data partition Weight；

The thread allocation unit is used to distribute thread according to the weight of each data partition for each data partition Number；The summation of each Thread Count of each data partition distribution is equal to default total Thread Count N, wherein M≤N；

The abstraction module is used to open N number of thread, and according to the Thread Count distributed, each data partition is adopted respectively Data pick-up is carried out with the thread of respective numbers

It is preferred that

The weight calculation unit is used to calculate each data partition according to the number of data lines of each data partition Weight refers to：

Obtain the number of data lines C of each data partition_m, 1≤m≤M；

The thread allocation unit is used to distribute thread according to the weight of each data partition for each data partition Number refers to：

It is preferred that

Described device also includes child partition module,

The child partition module is used for when thread allocation unit is that the Thread Count that data partition is distributed is more than or equal to 2, then The data partition is divided into data child partition, the number of the data child partition of the data partition is the line that the data partition is distributed Number of passes, each data child partition one thread of correspondence of the data partition.

It is preferred that

Described device also includes merging module,

The merging module is used to merge i-th of data partition with j-th of data partition, wherein i-th of data The Thread Count of subregion distribution is 0, and the Thread Count of j-th of data partition distribution is not 0,1≤i≤M, 1≤j≤M, i In j.

It is preferred that

The abstraction module uses the thread of respective numbers to each data partition respectively according to the Thread Count distributed Data pick-up is carried out to refer to：

Such scheme by tables of data to being divided into some data partitions, and the Thread Count of each data partition of dynamically distributes is solved Each thread distributes the problem of data are uneven, improves the data pick-up efficiency of relational data.

Brief description of the drawings

Fig. 1 is the flow chart of the data pick-up method in the embodiment of the present invention one；

Fig. 2 is the data partition schematic diagram of the data pick-up method in the embodiment of the present invention one；

Fig. 3 is the data partition schematic diagram of the data pick-up method in the embodiment of the present invention one；

Fig. 4 is the structural representation of the data pick-up device in the embodiment of the present invention one.

Embodiment

For the purpose, technical scheme and advantage of the application are more clearly understood, below in conjunction with accompanying drawing to the application Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application Feature can mutually be combined.

It is of the invention to avoid distributing data uneven caused inefficiency during multithreading extraction data between thread for effect Problem, proposes that the data interval that will extract data carries out the concept of subregion, and then the weight for calculating each subregion is each point Area's dynamically distributes extract the thread of data, and user can set number of partitions and the quantity of thread according to actual conditions, set number According to subregion as global issue can be regarded to local problem one by one and go solution, provided for the distribution thread being reasonably distributed of data Source.Below in conjunction with the accompanying drawings, the implementation steps to the present invention are described in detail.

Embodiment one

As shown in figure 1, the data pick-up method in the present invention applied to relevant database includes：

S101：According to the codomain distribution of certain field in the tables of data of selection, the tables of data is divided into M data partition； The type of the field can be converted into numerical value for the value of numeric type or the field；

Total Thread Count N that user can be distributed with the number M of preliminary setting data subregion and needs.

Specifically, and after a certain field id is selected, minimum values and maximum Min of the inquiry field id in database (id) SQL statement and Max (id), is performed in relevant database by ODBC or JDBC interfaces：

Select max (id), min (id) from [table name]

Field id codomain [Min (id), Max (id)] is averagely divided into M data partition.As shown in Fig. 2 according to word Section id minimum M in (id) and the interval of maximum Max (id) M data partition of mean allocation, and 1 to M is set as each point The numbering in area.

IfThe interval of than the m-th data subregion is RG (m), and interval right boundary is respectively R_leftAnd R (m)_right(m), then the interval expression formula of than the m-th data subregion is：

S102：The weight of each data partition is calculated according to the number of data lines of each data partition；

The number of data lines C of each data partition is stated firstly the need of acquisition_m, 1≤m≤M；

The weight of than the m-th data subregion is w_m,C=C₁+…+C_m+…+C_M, the weight of each data partition it With for 1.

Be in practical operation, can be parallel by the database interfaces such as ODBC or JDBC perform SQL query statement obtain Take the number of data lines of M data partition.Subregion (1≤the m for being m for numbering<M), corresponding thread passes through ODBC or JDBC Interface performs SQL query statement in relevant database：

Select count (*) from [table name] where id>=R_left(m)and id<R_right(m)

The subregion for being m=M for numbering, corresponding thread is held by ODBC or JDBC interfaces in relevant database Row SQL query statement：

Select count (*) from [table name] where id>=R_left(m)and id<=Max (id)

If the line number of m-th of the subregion obtained is C_m.The total line number C for the tables of data then to be extracted value is：

C=C₁+…+C_m+…+C_M,1≤m≤M

The weights that than the m-th data subregion can be set according to formula below are w_m, w_mMeet following multinomial：

In the present embodiment, according to above-mentioned calculation formula, the number of data lines of data partition is more, and its corresponding weight is got over Greatly.

The weight of each data partition can also be set according to Else Rule in other embodiments.

S103：It is each data partition distribution Thread Count according to the weight of each data partition；Each data point The summation of each Thread Count of area's distribution is equal to default total Thread Count N, wherein M≤N；

It is the Thread Count that each subregion dynamically distributes extract data according to the weight of each data partition.

Ideally, it is than the m-th data subregion distribution Thread Count INT (w_mN), INT is to round downwards；

Due to w_mN is possible for decimal, if n_dec(m)=w_mN-INT(w_mN),

To set { n_dec(1),…,n_dec(m),…,n_dec(M) } element in (1≤m≤M) is traveled through, from big to small Take preceding N_oThe partition number m of individual element value constitutes new set K, ifk_x∈ K, be by partition number k_xData partition distribution Thread Count add 1, i.e., numbering be k_xData partition extract data Thread Count be：n_int(k_x)+1。

So far, all N number of threads have been assigned.

S104：N number of thread is opened, according to the Thread Count distributed, respective numbers are used to each data partition respectively Thread carry out data pick-up

Specifically, respectively according to the Thread Count for each data child partition for distributing to each data partition, using respective numbers Thread carries out data pick-up to each data child partition

During concrete operations, if the right boundary value between the corresponding data sub-area for extracting data of each thread is respectively r_left And r (x)_right(x), as 1≤x<When N, following SQL query is performed in relational database by ODBC or JDBC interfaces Sentence：

Select [field 1], [field 2] .., from [table name] where id>=r_left(x)and id<r_right(x)

As x=N, following SQL statement is performed in relational database by ODBC or JDBC interfaces：

Select [field 1], [field 2] .., from [table name] where id>=r_left(x)and id<=r_right (x)。

Preferably,

After step S103, before S104, it can also include：

S3011：If the Thread Count of data partition distribution is more than or equal to 2, the data partition is divided into data son point Area, the number of the data child partition of the data partition is the Thread Count that the data partition is distributed, each data of the data partition Child partition one thread of correspondence.

In concrete operations, if the Thread Count for the subregion distribution that numbering is m is n_c(m), ifIt is single The right boundary value that each thread extracts data inside individual subregion is set to r_leftAnd r (x)_right(x), wherein x is thread number (1 ≤x≤n_c(m))。

If n_c(m) it is not equal to 0, x-th of thread extracts the subinterval rg of data inside the subregion that numbering is m_m(x) expression Formula is：

Preferably,

After step S103, before S104, it can also include：

S1032：I-th of data partition is merged with j-th of data partition, wherein i-th of data partition distribution Thread Count is 0, and the Thread Count of j-th of data partition distribution is not 0, and 1≤i≤M, 1≤j≤M, i is not equal to j.

The step is will to distribute the interval data point non-zero with distribution Thread Count that is closing on for the data partition that Thread Count is 0 Merge between the adjacent subarea in area.If some data partitions are assigned with the thread of 0 extraction data, but can in these data partitions Can be containing data, it is necessary to which the interval of these data partitions to be merged into the adjacent son that the distribution Thread Count closed on is more than 0 subregion In interval.Acquiescence will be distributed during Thread Count is merged between the adjacent subarea of right partition for 0 data partition；If distributing Thread Count The end of whole data interval is in for 0 data partition, the data partition is merged into the adjacent subarea of left data subregion Between in.

Specifically in operation, it can operate by the following method：

If 1) Thread Count of m-th of subregion distribution is more than 0, i.e. n for the Thread Count of 0 and the adjacent subregion distribution in the right_c(m) Equal to 0 and n_c(m+1)>0, as shown in figure 3, the numbering that acquiescence closes on the interval RG (m) for numbering the data partition for being m with the right For the 1st data subinterval rg inside m+1 data partition_m+1(1) merge, i.e. rg_m+1(1)=rg_m+1(1)∪RG(m)。

If 2) Thread Count of m-th subregion distribution is more than 0 (n for the Thread Count of 0 and the adjacent subregion distribution in the left side_c(M) etc. In 0 and n_c(M-1)>0) subregion that, the interval RG (M) for numbering the data partition for the being M numberings closed on the left side are M-1 by acquiescence Inside n-th_c(M-1) individual data subinterval rg_M-1(n_c(M-1)) merge, i.e. rg (inside subregion between the data sub-area of rightmost)_M-1 (n_c(M-1))=rg_M-1(n_c(M-1))∪RG(M)。

If 3) there is the data partition that continuous multiple distribution Thread Counts are 0, by these data partitions merge then perform 1) or Person 2).

The boundary value of data is extracted between data sub-area after merging as each thread.

As shown in figure 4, the present embodiment one also provides a kind of data pick-up device, including：Including division module 11, distribution mould Block 12 and abstraction module 13, wherein,

The codomain that the division module 11 is used for certain field in the tables of data according to selection is distributed, and the tables of data is divided into M data partition, the type of the field can be converted into numerical value for the value of numeric type or the field；

The distribute module 12 further comprises weight calculation unit 121 and thread allocation unit 122；

The weight calculation unit 121 is used to calculate each data partition according to the number of data lines of each data partition Weight；

The thread allocation unit 122 is used to distribute line according to the weight of each data partition for each data partition Number of passes；The summation of each Thread Count of each data partition distribution is equal to default total Thread Count N, wherein M≤N；

The abstraction module 13 is used to open N number of thread, according to the Thread Count distributed, respectively to each data partition Data pick-up is carried out using the thread of respective numbers.

It is preferred that

The weight calculation unit 121 is used to calculate each data partition according to the number of data lines of each data partition Weight refer to：

Obtain the number of data lines C of each data partition_m, 1≤m≤M；

The thread allocation unit 122 is used to distribute line according to the weight of each data partition for each data partition Number of passes refers to：

It is preferred that described device also includes child partition module 14,

The child partition module 14 is used for when thread allocation unit is that the Thread Count that data partition is distributed is more than or equal to 2, The data partition is then divided into data child partition, the number of the data child partition of the data partition is data partition distribution Thread Count, each data child partition one thread of correspondence of the data partition.

It is preferred that described device also includes merging module 15,

The merging module 15 is used to merge i-th of data partition with j-th of data partition, wherein i-th of number The Thread Count distributed according to subregion is 0, and the Thread Count of j-th of data partition distribution is not 0,1≤i≤M, 1≤j≤M, and i is not Equal to j.

It is preferred that

The abstraction module 13 uses the line of respective numbers to each data partition respectively according to the Thread Count distributed Cheng Jinhang data pick-ups refer to：

One of ordinary skill in the art will appreciate that all or part of step in the above method can be instructed by program Related hardware is completed, and described program can be stored in computer-readable recording medium, such as read-only storage, disk or CD Deng.Alternatively, all or part of step of above-described embodiment can also use one or more integrated circuits to realize, accordingly Each module/module in ground, above-described embodiment can be realized in the form of hardware, it would however also be possible to employ the shape of software function module Formula is realized.The application is not restricted to the combination of the hardware and software of any particular form.

The preferred embodiment of the application is the foregoing is only, the application is not limited to, for the skill of this area For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair Change, equivalent substitution, improvement etc., should be included within the protection domain of the application.

Claims

1. a kind of data pick-up method, applied to relevant database, it is characterised in that methods described includes：

According to the codomain distribution of certain field in the tables of data of selection, the tables of data is divided into M data partition, the field Type can be converted into numerical value for the value of numeric type or the field；

It is each data partition distribution Thread Count according to the weight of each data partition；It is each that each data partition is distributed The summation of Thread Count is equal to default total Thread Count N, wherein M≤N；

N number of thread is opened, according to the Thread Count distributed, each data partition is carried out using the thread of respective numbers respectively Data pick-up；

Distributed according to the weight of each data partition for each data partition after Thread Count, it is described to open before N number of thread, Also include：

I-th of data partition is merged with j-th of data partition, wherein the Thread Count of i-th of data partition distribution is 0, The Thread Count of j-th of data partition distribution is not 0, and 1≤i≤M, 1≤j≤M, i is not equal to j.

2. the method as described in claim 1, it is characterised in that：

Obtain the number of data lines C of each data partition_m, 1≤m≤M；

The weight of than the m-th data subregion is w_m,C=C₁+…+C_m+…+C_M, the weight sum of each data partition is 1；

3. method as claimed in claim 2, it is characterised in that：

If the Thread Count of data partition distribution is more than or equal to 2, the data partition is divided into data child partition, the data point The number of the data child partition in area is the Thread Count that the data partition is distributed, each data child partition correspondence one of the data partition Individual thread.

4. method as claimed in claim 3, it is characterised in that：

It is described according to the Thread Count distributed, data pick-up is carried out using the thread of respective numbers to each data partition respectively Including：

Respectively according to the Thread Count for each data child partition for distributing to each data partition, using the thread of respective numbers to each Data child partition carries out data pick-up.

5. a kind of data pick-up device, applied to relevant database, it is characterised in that described device includes division module, divided With module and abstraction module, wherein,

The codomain that the division module is used for certain field in the tables of data according to selection is distributed, and the tables of data is divided into M numbers According to subregion, the type of the field can be converted into numerical value for the value of numeric type or the field；

The weight calculation unit is used for the weight that each data partition is calculated according to the number of data lines of each data partition；

The thread allocation unit is used to distribute Thread Count according to the weight of each data partition for each data partition；Institute The summation for stating each Thread Count of each data partition distribution is equal to default total Thread Count N, wherein M≤N；

The abstraction module is used to open N number of thread, according to the Thread Count distributed, uses phase to each data partition respectively The thread of quantity is answered to carry out data pick-up；

Described device also includes merging module,

The merging module is used to merge i-th of data partition with j-th of data partition, wherein i-th of data partition The Thread Count of distribution is 0, and the Thread Count of j-th of data partition distribution is not 0, and 1≤i≤M, 1≤j≤M, i is not equal to j.

6. device as claimed in claim 5, it is characterised in that：

The weight calculation unit is used for the weight that each data partition is calculated according to the number of data lines of each data partition Refer to：

Obtain the number of data lines C of each data partition_m, 1≤m≤M；

The thread allocation unit is used for Refer to：

7. device as claimed in claim 6, it is characterised in that described device also includes child partition module,

The child partition module is used to when thread allocation unit is that the Thread Count that data partition is distributed is more than or equal to 2, then should Data partition is divided into data child partition, and the number of the data child partition of the data partition is the thread that the data partition is distributed Number, each data child partition one thread of correspondence of the data partition.

8. device as claimed in claim 7, it is characterised in that：

The abstraction module is carried out to each data partition using the thread of respective numbers respectively according to the Thread Count distributed Data pick-up refers to：