CN104933176A

CN104933176A - Big data address hierarchical scheduling method based on MapReduce technology

Info

Publication number: CN104933176A
Application number: CN201510374579.2A
Authority: CN
Inventors: 胡自权; 徐勇; 尹德辉; 龙汉安; 夏纪毅; 王柯
Original assignee: Sichuan Medical University
Current assignee: Sichuan Medical University
Priority date: 2015-06-30
Filing date: 2015-06-30
Publication date: 2015-09-23
Anticipated expiration: 2035-06-30
Also published as: CN104933176B

Abstract

The invention discloses a big data address hierarchical scheduling method based on a MapReduce technology. The method comprises the following steps of: building a contact-address-oriented scheduling table; determining a business territorial scope; generating Key and Value in a Map stage; realizing dispatching analysis in the Reduce stage; and performing layer-by-layer downward dispatching, and the like. The big data address hierarchical scheduling method has the advantages that the contact-address-oriented scheduling is realized; the contact address can be upwards expanded to national or even intercontinental stage addresses, and can downwards extend to a more precise position; and the hierarchical scheduling according to addresses in different granularities can be supported.

Description

Large data address based on MapReduce technology divides layer scheduling method

Technical field

The present invention relates to a kind of data processing method, particularly relate to a kind of large data address based on MapReduce technology and divide layer scheduling method.

Background technology

Address refers to country, province (autonomous region or municipality directly under the Central Government or special administrative region), city, district (county), town, street number (village's group), address structure has level, available characters string list shows address, as mailing address, home address, CompanyAddress and unit address etc., the existing algorithm based on address has: disk scheduling, IP scheduling and GPS scheduling.

For disk scheduling (prerequisite variable algorithm, the shortest seek time priority algorithm, scanning algorithm and scan round algorithm), the physical block address of disk is made up of cylinder number, head number and sector number.The access time completing a certain physical block of disk comprises seek time, rotational time and access time, and the target of disk scheduling is that seek time is as far as possible short as far as possible large with handling capacity.Address, from the physical block address (cylinder magnetic head sector) of disk with different, the large data address that disk scheduling is not suitable for described in this patent divides layer scheduling algorithm.

For IP address scheduling (IP datagram route), according to different addresses distributing IP address field.By the routing table stored in router, IP datagram is forwarded to the path (port) of particular network address.IP address only represents the computer identity of accessing Internet, and different from the contact address (national province, city and region town street number) described in this patent, the large data address that IP address dispatching algorithm is not suitable for described in this patent divides layer scheduling algorithm.

For GPS scheduling, its terminal receives satellite-signal by satellite antenna, automatically locates; Address information is sent overall control center by GPRS module by terminal; Overall control center utilizes internet or private network to extract positioning address, and shows in electronic chart.Described in GPS positioning address and this patent, address is basically identical, but the address of GPS location needs to pass to overall control center real-time, due to the requirement of real-time of positioning address, be difficult to weaken real-time demand (even not considering real-time demand), locator data can not be accumulated and generate large data.

Summary of the invention

The present invention aims to provide a kind of large data address based on MapReduce technology and divides layer scheduling method, achieve the scheduling towards contact address, contact address upwards can expand to country even continental level address, more elaborate position can be extended to downwards, can support to dispatch by the layering of different grain size address.

For achieving the above object, the present invention realizes by the following technical solutions:

Large data address based on MapReduce technology disclosed by the invention divides layer scheduling method, comprises the following steps:

Step 1, build dispatch list towards contact address, the row race of described dispatch list comprises essential information row race and the dispatch queue race of Problem Areas, described essential information row race is included in the correlative connection address column of Reduce stage content to be processed and large data, described dispatch queue race comprises the contact address being divided into rough address and better address row, choose can distinguish large data record field as the row key word of dispatch list, and row key word to be put in essential information row race;

Step 2, determine business territorial scope, the rough address of initialization and better address: according to the territorial scope of Problem Areas determination business, in the rough address rough address of contact address and better address being written to the dispatch queue race of dispatch list and better address row.

Step 3, generate Key and Value in the Map stage: by the rough address assignment of large Data relationship address to Key, by row key word+contact address+content assignment to be processed to Value.

Step 4, realize lexical analysis in the Reduce stage: according to the contact address of Key and Value, export rough address and better address that next stage address divides;

Step 5, successively to dispatching: initialization Job, set up the connection in schedule table data storehouse, source table and object table be all initialized as dispatch list table, by the correlative connection address of large data successively to dispatching, until bottom contact address; No person, repeats step 3 to step 5.

Preferably, described rough address comprises country, province or autonomous region or municipality directly under the Central Government or special administrative region, city or county, and described better address comprises district or town, street, community or number.

Preferably, in step 3, described row key word is order number or Customer ID number.

Preferably, described dispatch list HBase dispatch list.

Further, in step 1, for existing large data, ETL instrument is utilized by large data importing to dispatch list.

Preferably, when service surface is to the whole world, the National Address choosing contact address is rough address, and contact address remainder is better address.

Preferably, when business at home time, choosing the province of contact address or autonomous region or municipality directly under the Central Government or special administrative region is rough address, and contact address remainder is better address.

Further, described better address comprises goods yard number; Described content to be processed comprises quantity in stock.

The large data address based on MapReduce technology of disclosure of the invention divides layer scheduling method to have following characteristics:

The first, support that contact address is expanded up and down.Upwards expansion can support wider scheduling; Downward expansion can support address arrangement more accurately, is applicable to arranging towards address scheduling of different field (occasion).

The second, along with the circulation of algorithm steps 4 ~ 6 of the present invention advances, contact address successively single level address scheduling is downwards advanced, realizes the scheduling based on address different demarcation granularity.

3rd, the content that will dispatch (goods yard number and quantity in stock thereof as contact address) is placed in the content to be processed of the essential information row bunch of dispatch list, reduce certain manual working (as statistics quantity in stock generates existing quantity ordered, goods is assigned in warehouse and goods yard thereof number).

Beneficial effect of the present invention is as follows:

(1) by dividing the two-stage of contact address, the scheduling towards contact address is realized.

(2) by running the present invention, the scheduling to contact address different demarcation granularity is realized.

(3) contact address upwards can expand to country even continental level address.

(4) contact address can extend to more elaborate position downwards, as the goods yard number in warehouse.

(5) Scheduling content (as goods yard number and quantity in stock) is placed in content to be processed.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with accompanying drawing, the present invention is further elaborated.

As shown in Figure 1, the large data address based on MapReduce technology disclosed by the invention divides layer scheduling method, comprises the following steps:

Step 1, build scheduling HBase table (abbreviation dispatch list) towards contact address

Build the dispatch list towards address, its row race comprises essential information and the dispatch queue race of Problem Areas.Essential information row bunch are included in the correlative connection address column of Reduce stage content to be processed and large data.Dispatch queue race comprises the rough address and better address row that contact address is divided into.Choose can distinguish large data record field as the row key word of dispatch list, as order number or the Customer ID of client, row key word is put in essential information row race.For existing large data, instrument (ETL etc.) can be utilized by large data importing to dispatch list; For also not having data at present, the table in background data base can carry out design dispatch list by above-mentioned requirements, can use method disclosed in this patent.

Step 2, determine business territorial scope, the rough address of initialization and better address

According to Problem Areas, determine the territorial scope of business: such as, if service surface is to the whole world, then the National Address choosing contact address is rough address, and the remainder of address is better address; If business at home, then the national province choosing contact address is rough address, and remainder is better address.The rough address of contact address and better address are written in the rough address of the dispatch queue race of dispatch list and better address row.

Step 3, generate Key and Value in the Map stage

First determine Key and Value that MapReduce programmes, the rough address of large Data relationship address is Key, and row key word (can distinguish the field of large data record)+contact address (dispatch address)+content to be processed is Value.

The correlative connection address of large data, rough address, row key word and content to be processed is read from dispatch list.Key ← rough address.Value ← row key word+contact address+content to be processed.

Step 4, realize lexical analysis in the Reduce stage

According to the contact address of Key and Value, export rough address and the better address of next stage address division, if rough address of this process is to province, the rough address of next stage is to address, city, and the remainder of contact address is better address.According to the row key word+content to be processed of Key and Value, large data are analyzed further, and content to be processed is dispatched in output next time.

Step 5, initialization Job, set up the connection of HBase database; Source table and object table are all initialized as dispatch list.

Successively dispatch by the correlative connection address of large data, this level when contact address has been dispatched, rough address is dispatch address from country to this level, the remaining part of address is better address, dispatch next stage address successively, until arrange (or dispensing) to complete, otherwise, repeated execution of steps 3 ~ 5.。

Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art can make various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims

1. the large data address based on MapReduce technology divides layer scheduling method, it is characterized in that, comprises the following steps:

Step 1, build dispatch list towards contact address, the row race of described dispatch list comprises essential information row race and the dispatch queue race of Problem Areas, described essential information row race is included in the correlative connection address column of Reduce stage content to be processed and large data, described dispatch queue race comprises the contact address being divided into rough address and better address, choose can distinguish large data record field as the row key word of dispatch list, and row key word to be put in essential information row race;

Step 3, generate Key and Value in the Map stage: by the rough address assignment of large Data relationship address to Key, by row key word+contact address+content assignment to be processed to Value;

2. the large data address based on MapReduce technology according to claim 1 divides layer scheduling method, it is characterized in that: described rough address comprises country, province or autonomous region or municipality directly under the Central Government or special administrative region, city or county, and described better address comprises district or town, street, community or number.

3. the large data address based on MapReduce technology according to claim 1 divides layer scheduling method, it is characterized in that: in step 3, and described row key word is order number or Customer ID number.

4. the large data address based on MapReduce technology according to claim 1 divides layer scheduling method, it is characterized in that: described dispatch list HBase dispatch list.

5. the large data address based on MapReduce technology according to claim 1 divides layer scheduling method, it is characterized in that: in step 1, for existing large data, utilizes ETL instrument by large data importing to dispatch list.

6. the large data address based on MapReduce technology according to claim 1 divides layer scheduling method, it is characterized in that: when service surface is to the whole world, and the National Address choosing contact address is rough address, and contact address remainder is better address.

7. the large data address based on MapReduce technology according to claim 1 divides layer scheduling method, it is characterized in that: when business at home time, choosing the province of contact address or autonomous region or municipality directly under the Central Government or special administrative region is rough address, and contact address remainder is better address.

8. the large data address based on MapReduce technology according to claim 1 divides layer scheduling method, it is characterized in that: described better address comprises goods yard number; Described content to be processed comprises quantity in stock.