WO2016101798A1 - 一种对大数据进行处理的方法和装置 - Google Patents

一种对大数据进行处理的方法和装置 Download PDF

Info

Publication number
WO2016101798A1
WO2016101798A1 PCT/CN2015/097179 CN2015097179W WO2016101798A1 WO 2016101798 A1 WO2016101798 A1 WO 2016101798A1 CN 2015097179 W CN2015097179 W CN 2015097179W WO 2016101798 A1 WO2016101798 A1 WO 2016101798A1
Authority
WO
WIPO (PCT)
Prior art keywords
key
key value
value pairs
value pair
reduce processing
Prior art date
Application number
PCT/CN2015/097179
Other languages
English (en)
French (fr)
Inventor
王晓丽
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP15871867.6A priority Critical patent/EP3193264B1/en
Publication of WO2016101798A1 publication Critical patent/WO2016101798A1/zh
Priority to US15/481,606 priority patent/US10691669B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • G06F16/24545Selectivity estimation or determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects

Definitions

  • the present invention relates to the field of data processing, and in particular, to a method and apparatus for processing big data.
  • Big data is a data set that contains a large amount of data. This data can be called sub-data of big data. In big data, only a small amount of sub-data has high value to users. In order to facilitate the user to browse big data, the big data can be processed at present, and the high-value sub-data contained in the big data is preferentially provided to the user.
  • a search engine searches for a plurality of search results according to keywords input by a user, and the plurality of search results constitute a big data, and the search engine processes the big data to obtain a high value for the user. Search results are given to users first.
  • the big data can be processed by the following process, including: performing Map processing on the big data to be processed by the mapping (English: Map) module and outputting at least one key value pair corresponding to the sub data in the big data, corresponding to the sub data
  • the key in the key-value pair is the sub-data
  • the value is another sub-data in the big data that has a preset relationship with the key.
  • each key-value pair containing the same key is assigned to a Reduce processing module in a set of Reduce processing modules, and the value of the key-value pairs is processed by the Reduce processing module, and the key is output.
  • Value the key is a sub-data, that is, the value of the sub-data is obtained.
  • the value of each sub-data in the big data can be obtained, and the sub-data with higher value is more valuable to the user, according to The value shows the user the sub data included in the big data.
  • each key-value pair is assigned to each Reduce processing module according to the key of the key-value pair, which may cause the load of each Reduce processing module to be unbalanced.
  • an embodiment of the present invention provides a method for processing big data, where the method includes:
  • the key value pair set includes at least one key value pair, and the key and value in the key value pair are respectively two sub data in the big data to be processed, and between the two sub data There is a preset data relationship, and the modulo remainder of each key-value pair in the set of key-value pairs is the same;
  • the assigned set of key value pairs are processed by each of the Reduce processing modules.
  • the step of acquiring a plurality of key value pairs includes:
  • a plurality of the key value pairs having the same modulo remainder are assigned to a set of key value pairs to form the plurality of key value pair sets.
  • the modulo coefficient the number of Reduce processing modules ⁇ the modulo factor, where the modulo factor is Predetermined.
  • the total value of the value included in the set according to each of the key value pairs and the load condition of each of the Reduce processing modules are respectively
  • the Reduce processing module assigns corresponding key-value pair aggregation steps including:
  • the total value of the value included in the set is obtained according to each key value pair obtained according to the allocation ratio
  • Each of the Reduce processing modules is assigned a corresponding set of key value pairs according to a distribution rule of the key value pair set with a larger total value.
  • an embodiment of the present invention provides an apparatus for processing big data, where the apparatus includes:
  • a first obtaining module configured to acquire a plurality of key value pairs, the key value pair set includes at least one key value pair, and the key and value in the key value pair are respectively two sub data in the big data to be processed, and There is a preset data relationship between the two sub-data, and the modulo remainder in each key-value pair in the set of key-value pairs is the same;
  • a calculation module configured to separately calculate a sum of values included in each set of key-value pairs, to obtain a total value of values included in each set of key-value pairs;
  • a second acquiring module configured to acquire a load condition of each Reduce processing module in the Reduce processing module set
  • An allocating module configured to allocate a corresponding set of key value pairs for each of the Reduce processing modules according to a total value of the values included in the set of each key value pair and a load condition of each of the Reduce processing modules;
  • a processing module configured to separately process the allocated set of key value pairs by each of the Reduce processing modules.
  • the first acquiring module is configured to:
  • a plurality of the key value pairs having the same modulo remainder are assigned to a set of key value pairs to form the plurality of key value pair sets.
  • the modulo coefficient the number of Reduce processing modules ⁇ the modulo factor, and the modulo factor is predetermined.
  • the allocating module is configured to:
  • the allocating module is further configured to:
  • Each of the Reduce processing modules is assigned a corresponding set of key value pairs according to a distribution rule of the key value pair set with a larger total value.
  • the method and device for processing big data allocate key-value pairs in each partition to each Reduce processing module according to the total value of each key-value pair set and the load condition of each Reduce processing module in each partition.
  • the load of each Reduce processing module is more balanced than the way that key-value pairs are normally assigned to each Reduce task according to the key of the key-value pair.
  • FIG. 1 is a schematic structural diagram of an implementation environment involved in a method for processing big data according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of another implementation environment involved in a method for processing big data according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a method for processing big data according to Embodiment 1 of the present invention.
  • FIG. 4 is a flowchart of a method for processing big data according to Embodiment 2 of the present invention.
  • FIG. 5 is a schematic structural diagram of an apparatus for processing big data according to Embodiment 3 of the present invention.
  • FIG. 6 is a schematic structural diagram of an apparatus for processing big data according to Embodiment 4 of the present invention.
  • FIG. 1 is a schematic structural diagram of an implementation environment involved in a virtual machine processing method according to an embodiment of the present invention.
  • the system includes a job server 10 and a task server 20 that performs data interaction with the job server 10.
  • the job server 10 is provided with a job tracker 11 that obtains the total value included in each set according to each key value pair transmitted by the task server 20, and obtains the load of each Reduce processing module directly from the Reduce processing module set.
  • each Reduce processing module is assigned a corresponding set of key value pairs, and the process of assigning the corresponding key value pair set to each Reduce processing module is to establish each key value pair set and each Reduce the correspondence of the module; then feedback the established correspondence between each key-value pair and each Reduce processing module to the task service Server 20.
  • the task server 20 is provided with a splitter 21, a map processing module 22, a Reduce processing module 23, a partitioner 24, and a task tracker 25; wherein the splitter 21 is configured to divide the big data into a plurality of data fragments.
  • the map processing module 22 is configured to perform processing on a plurality of data fragments obtained by the big data, and obtain a plurality of key value pairs, and send the obtained multiple key value pairs to the map processing module 22 for processing the big data.
  • a partitioner 24 configured to perform a modulo operation on the obtained key-value pair, assign a key-value pair having the same modulo remainder to a set of key-value pairs, and then obtain a set of each key-value pair by calculation
  • the total value, the total value of the obtained key value pair set is transmitted to the job tracker 11; the task tracker 25 is configured to acquire the correspondence between each key value pair set fed back by the job tracker 11 and each Reduce processing module.
  • the total value of the key-value pair set is obtained by accumulating the values of the key-value pairs in the set of key-value pairs.
  • FIG. 2 is a schematic structural diagram of another implementation environment involved in a method for processing big data according to an embodiment of the present invention.
  • the job server 10 is in the form of a functional module. It is set in the task server 20.
  • an embodiment of the present invention provides a method for processing big data, where the process includes:
  • Step 100 The partitioner of the task server acquires a plurality of key value pair sets, wherein the key value pair set includes at least one key value pair, and the key and value in the key value pair are respectively two sub data in the big data to be processed, and two There is a preset data relationship between the sub-data, and the modulo remainder in each key-value pair in the key-value pair is the same.
  • Step 101 The partitioner of the task server separately calculates the sum of the values included in each set of key-value pairs, and obtains the total value of the values included in each set of key-value pairs.
  • Step 102 The job tracker of the job server acquires the load condition of each Reduce processing module in the Reduce processing module set.
  • Step 103 The job tracker of the job server according to the total value of the values included in the set according to each key value pair And each Reduce processing module load situation, each Reduce processing module is assigned a corresponding set of key value pairs.
  • Step 104 Processing, by each Reduce processing module of the task server, a set of key value pairs corresponding to each Reduce processing module.
  • the method for processing big data provided by this embodiment allocates key value pairs in each partition to each Reduce processing module according to the total value of each key value pair set and the load condition of each Reduce processing module in each partition, as opposed to Usually, the key of each key value pair is assigned to each Reduce task according to the key of the key value pair, so that the load of each Reduce processing module is more balanced.
  • an embodiment of the present invention provides a method for processing big data, where the method includes:
  • Step 200 The partitioner of the task server acquires a plurality of key value pair sets, wherein the key value pair set includes at least one key value pair, and the key and the value in the key value pair are respectively two sub data in the big data to be processed, and two There is a preset data relationship between the sub-data, and the modulo remainder in each key-value pair in the key-value pair is the same.
  • the preset data relationship between the key and the value refers to the correspondence between the search condition of the big data acquired by the Map processing module and the search result obtained after the big data is completed, and the search condition is a key, and the search result is obtained. It is the value; the correspondence between the search condition of big data and the search result obtained after the big data completes the search forms a key value pair.
  • the Map processing module searches for big data by the search condition and obtains the corresponding search result, the key value pair is output in the form of (search condition, search result).
  • the Map processing module will be based on this "Hua” word.
  • the Map processing module will search for a webpage with a hyperlink containing the word "Chinese” in the webpage collection according to the hyperlink containing the word "Chinese”, and after the search is completed, (including "Chinese” Link, a web page with a hyperlink containing the word “Chinese”) Form output key-value pairs.
  • step 200 is specifically described by step 2001 to step 2003:
  • Step 2001 The big data is processed by the Map processing module of the task server to obtain a plurality of key value pairs.
  • the big data processed by the Map processing module is a plurality of data blocks into which the splitter of the task server divides the big data to be processed. After receiving the plurality of data blocks transmitted by the splitter, the Map processing module searches for each key in the received data block according to a preset key, and obtains the search result in the form of a key value pair. .
  • each search result contained in big data is input; when the user wants to find a hyperlink with the word "Chinese” from many web pages of the network, 100 are found after the search.
  • "Two-word hyperlinked webpage the hyperlinks containing "Chinese” in these 100 webpages have A and B, the webpages with hyperlink A have E, F, and G, and the webpages with hyperlink B have F , G and H, then the key-value pairs obtained after processing by the Map processing module are (A, E), (A, F), (A, G), (B, F), (B, G), and (B). , H).
  • the splitter splits big data according to the specific content of big data to operate on big data.
  • the big data to be processed is an e-book
  • the e-book can be divided into multiple data blocks according to a paragraph as a data block, or the e-book can be divided into multiple pieces according to a sentence as a data block.
  • Data blocks If big data is the home page collection of a website, then the web page in the home page collection of a website can be divided into multiple data blocks according to a web page as a data block.
  • the Map processing module sends the processed key value pair to the partitioner.
  • Step 2002 The partitioner of the task server performs a modulo operation on the obtained keys of the plurality of key value pairs according to the preset modulo coefficients, and obtains the modulo remainder of each key value pair.
  • the partitioner of the task server receives the key value pair transmitted by the Map processing module, performs a hash operation on the key in the key value pair, and obtains a digital string corresponding to the key; according to the preset modulo coefficient, the partitioner respectively The digital string of the keys of each key value pair is subjected to a modulo operation to obtain a modulo remainder of each key value pair.
  • the modulo coefficient the number of Reduce processing modules ⁇ the modulo factor, and the modulo factor is predetermined and stored in the partitioner of the task server.
  • step 2003 the partitioner of the task server allocates a plurality of key-value pairs having the same modulo remainder to a set of key-value pairs to form a plurality of sets of key-value pairs.
  • Each of the key-value pair sets respectively has a key-value pair set identifier.
  • step 2001 to step 2003 before the key value pair set is formed, the obtained key value pair is subjected to a modulo operation according to a preset modulo coefficient, and since the modulo coefficient is equal to the number of Reduce tasks ⁇ the modulus factor,
  • key-value pairs can be assigned to more sets of key-value pairs, so that the allocation of key-value pairs is more uniform, so that each Reduce processing module in the Reduce processing module set is processing the set of key-value pairs.
  • the key-value pairs in the time are more balanced.
  • Step 201 The partitioner of the task server separately calculates the sum of the values included in each set of key-value pairs, and obtains the total value of the values included in each set of key-value pairs.
  • the partitioner After the partitioner obtains the total value of each key-value pair set, generates a correspondence between the identifier of each key-value pair set and the total value, records it in a preset relationship list, and then sends the obtained relationship list to In the job tracker of the job server.
  • Step 202 The job tracker of the job server acquires a load condition of each Reduce processing module in the Reduce processing module set.
  • the job tracker of the job server obtains the load condition of each Reduce processing module from the task tracker of the task server.
  • the task tracker of the task server is preset with a load list for recording the load condition of each Reduce processing module in the Reduce processing module set, and the task tracker of the task server periodically acquires the load status of each Reduce processing module to update.
  • the load list records the correspondence between the identifier of each Reduce processing module and the total value of the key-value pair set that has not been processed.
  • Step 203 The job tracker of the job server allocates a corresponding set of key value pairs for each Reduce processing module according to a total value of each key value pair and a load condition of each Reduce processing module.
  • step 203 is specifically described by step 2031 to step 2035:
  • Step 2031 The job tracker of the job server determines the number of times the current key value pair is allocated.
  • the number of times the key-value pair is allocated is recorded in advance in the assignment list of the job tracker of the job server, and the allocation ratio of the key-value pair set corresponding to the number of allocations is also recorded in the distribution list.
  • the allocation list records the allocation ratio 1 allocation ratio 20%, the allocation number 2 allocation ratio 40%, the allocation number 3 allocation ratio 60%, the allocation number 4 allocation ratio 80%, and the allocation number 5 allocation ratio 100%.
  • the above allocation list only records a method of allocating a set of key-value pairs, and may also allocate a set of key-value pairs by using other allocation times and corresponding allocation ratios, and details are not described herein again.
  • Step 2032 The job tracker of the job server finds the allocation ratio of the key value pair set corresponding to the allocation number according to the determined allocation number of the current key value pair set, and the correspondence relationship between the allocation times and the allocation ratio of the key value pair set is Set.
  • the job tracker of the job server finds the allocation ratio of the set of key value pairs corresponding to the number of allocations from the pre-stored allocation list according to the determined number of times the current key value is allocated to the set.
  • Step 2033 The job tracker of the job server acquires a corresponding number of sets of key value pairs from the plurality of key value pair sets according to the obtained distribution ratio of the key value pair set.
  • the job tracker of the job server acquires, from the received relationship list, a correspondence relationship between the identifier of the set of key-value pairs and the total value of the corresponding proportion.
  • the job tracker will obtain the correspondence between the identification of the key-value pair set and the total value recorded in the relationship list from the current relationship list. 40% of the total number of relationships is allocated. For example, if the relationship between the identifier of the set of 1000 key-value pairs and the total value is recorded in the relationship list, then 400 of them are allocated.
  • Step 2034 The job tracker of the job server respectively assigns a corresponding key to each Reduce processing module according to the total value of the values included in the set and the load condition of each Reduce processing module according to each key value obtained according to the allocation ratio. Value pair collection.
  • the job tracker of the job server obtains the identifier of the Reduce processing module from the load list, obtains the identifier of the key value pair set from the relationship list, and corresponds to the load condition corresponding to the identifier of the Reduce processing module and the identifier of the key value pair set.
  • the total value of the key-value pair set is assigned to the Reduced-Received Reduced-Processing module, and each Reduce processing module is assigned a corresponding set of key-value pairs, that is, the Reduced processing to be obtained.
  • the identifier and the key value of the module are associated with the identifier of the set, and the correspondence between the key value pair set and the Reduce processing module is established, and then the corresponding relationship between the generated key value pair set and the Reduce processing module is fed back to the task tracker of the task server.
  • the task tracker controls the Reduce processing module according to the correspondence between the key value pair and the Reduce processing module. Get the corresponding set of key-value pairs in the region.
  • the load of A is 10
  • the load of B is 20
  • the load of C is 30, and the set of key-value pairs to be assigned is 3, which are a, b.
  • the total value of c a is 30, the total value of b is 40, and the total value of c is 50; then according to the allocation rule, the key value pair set a is assigned to the Reduce processing module C, and the key value pair set b is allocated.
  • the key value pair set c is assigned to the Reduce processing module A.
  • the load of the Reduce processing modules A, B, and C is 60, which makes the load of the Reduce processing modules A, B, and C equalized.
  • the key-value pair set is allocated to each Reduce processing module by allocating a large-value key-value pair set allocation rule to the light-weight Reduce processing module, thereby ensuring the load of each Reduce processing module. Equalization, so that the Reduce processing module can complete the assigned tasks as much as possible.
  • Step 204 Process each of the assigned key value pair sets by each Reduce processing module in the task server.
  • the method for processing big data provided by this embodiment allocates key value pairs in each partition to each Reduce processing module according to the total value of each key value pair set and the load condition of each Reduce processing module in each partition, as opposed to Usually, the key of each key value pair is assigned to each Reduce task according to the key of the key value pair, so that the load of each Reduce processing module is more balanced.
  • an embodiment of the present invention provides an apparatus for processing big data, where the apparatus includes:
  • the first obtaining module 300 is configured to obtain a plurality of key value pair sets, where the key value pair set includes at least one key value pair, and the key and the value in the key value pair are respectively two sub data in the big data to be processed. And a preset data relationship exists between the two sub-datas, and the modulo remainder of each key-value pair in the set of key-value pairs is the same; the calculating module 301 is connected to the first obtaining module 300, and is configured to calculate each key separately The sum of the values included in the set is obtained as a total value of the values included in each set of key-value pairs; the second obtaining module 302 is connected to the calculating module 301 for acquiring each Reduce processing in the Reduce processing module set.
  • the load condition of the module is connected to the second obtaining module 302, and is configured to respectively calculate the total value of the value included in the set and the load condition of each Reduce processing module according to each key value.
  • the processing module allocates a corresponding set of key value pairs; the processing module 304 is coupled to the distribution module 303 for processing the assigned set of key value pairs by each Reduce processing module.
  • the method for processing big data provided by this embodiment allocates key value pairs in each partition to each Reduce processing module according to the total value of each key value pair set and the load condition of each Reduce processing module in each partition, as opposed to Usually, the key of each key value pair is assigned to each Reduce task according to the key of the key value pair, so that the load of each Reduce processing module is more balanced.
  • an embodiment of the present invention provides an apparatus for processing big data, where the apparatus includes:
  • a first obtaining module 400 a calculating module 401, a second obtaining module 402, an allocating module 403, and a processing module 404;
  • the first obtaining module 400 is configured to obtain a plurality of key value pair sets, where the key value pair set includes at least one key value pair, and the key and the value in the key value pair are respectively two sub data in the big data to be processed. And a preset data relationship exists between the two sub-datas, and the modulo remainder of each key-value pair in the set of key-value pairs is the same; the calculating module 401 is connected to the first obtaining module 400, and is configured to calculate each key separately The sum of the values included in the set of values is obtained as a total value of the values included in each set of key-value pairs; the second obtaining module 402 is coupled to the calculating module 401 for acquiring each Reduce processing in the Reduce processing module set.
  • the load condition of the module is connected to the second obtaining module 402, and is configured to respectively determine, according to each key value, the total value of the value included in the set and the load condition of each Reduce processing module, respectively, for each Reduce processing module.
  • the corresponding key-value pair set is allocated; the processing module 404 is connected to the distribution module 403, and is configured to separately process the assigned key-value pair set by each Reduce processing module.
  • the first obtaining module 401 is configured to:
  • the big data is processed by the Map processing module to obtain a plurality of key value pairs;
  • a plurality of key-value pairs having the same modulo remainder are assigned to a set of key-value pairs to form a plurality of sets of key-value pairs.
  • the modulus of modulo the number of Reduce processing modules ⁇ the modulo factor, and the modulo factor is predetermined.
  • the allocation module 403 is configured to:
  • Each Reduce processing module is assigned a corresponding set of key value pairs according to the total value of the values included in the set and the load condition of each Reduce processing module according to each key value obtained according to the allocation ratio.
  • allocation module 403 is further configured to:
  • Each of the Reduce processing modules is assigned a corresponding set of key value pairs according to a distribution rule of the key value pair set with a larger total value assigned to the lighter Reduced processing module.
  • the method for processing big data provided by this embodiment allocates key value pairs in each partition to each Reduce processing module according to the total value of each key value pair set and the load condition of each Reduce processing module in each partition, as opposed to Usually, the key of each key value pair is assigned to each Reduce task according to the key of the key value pair, so that the load of each Reduce processing module is more balanced.
  • the apparatus for processing the big data provided by the foregoing embodiment is only illustrated by the division of the foregoing functional modules. In actual applications, the foregoing functions may be allocated according to needs. Different functional modules are completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device for processing the big data provided by the foregoing embodiment is the same as the method for processing the big data. The specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种对大数据进行处理的方法和装置,属于数据处理领域。所述方法包括:获取多个键值对集合,键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且两个子数据之间存在预设数据关系,键值对集合中的各键值对中的取模余数都相同;分别计算每个键值对集合中包括的各值之和,得到每个键值对集合包括的值的总值;获取Reduce处理模块集合中的每个Reduce处理模块的负载情况;根据每个键值对集合中包括的值的总值和每个Reduce处理模块的负载情况,分别为每个Reduce处理模块分配对应的键值对集合;通过每个Reduce处理模块分别对分配到的所述键值对集合进行处理。

Description

一种对大数据进行处理的方法和装置 技术领域
本发明涉及数据处理领域,特别涉及一种对大数据进行处理的方法和装置。
背景技术
大数据就是一种包含大量数据的数据集,这些数据可以称为是大数据的子数据,在大数据中只有少量的子数据对用户具有较高的价值。为了方便用户浏览大数据,目前可以对大数据进行处理,将大数据中包含的价值较高的子数据优先提供给用户。例如,在搜索引擎领域中,搜索引擎根据用户输入的关键词搜索出众多搜索结果,该众多搜索结果便组成了一个大数据,搜索引擎通过对该大数据进行处理,得到对用户价值较高的搜索结果并优先提供给用户。
目前可以通过如下过程来对大数据进行处理,包括:通过映射(英文:Map)模块对待处理的大数据进行Map处理并输出大数据中的子数据对应的至少一个键值对,子数据对应的键值对中的键为该子数据,值为大数据中的与该键之间存在预设关系的其他一子数据。然后将包含键相同的各键值对分配给一规约(英文:Reduce)处理模块集合中的一Reduce处理模块,由该Reduce处理模块对这些键值对中的值进行处理,并输出该键的价值度,该键为一子数据,即得到该子数据的价值度,按上述方法可以得到大数据中的每个子数据的价值度,价值度越高的子数据,对用户越有价值,根据价值度向用户显示大数据中包括的子数据。
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:
目前根据键值对的键来将各键值对分配到各Reduce处理模块中,可能会造成各Reduce处理模块的负载不均衡。
发明内容
为了解决现有技术的问题,本发明实施例提供了一种对大数据进行处理的方法和装置。所述技术方案如下:
第一方面,本发明实施例提供了一种对大数据进行处理的方法,所述方法包括:
获取多个键值对集合,所述键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且所述两个子数据之间存在预设数据关系,所述键值对集合中的各键值对中的取模余数都相同;
分别计算每个键值对集合中包括的各值之和,得到所述每个键值对集合包括的值的总值;
获取Reduce处理模块集合中的每个Reduce处理模块的负载情况;
根据所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合;
通过所述每个Reduce处理模块分别对分配到的所述键值对集合进行处理。
在第一方面的第一种可能的实现方式中,所述获取多个键值对集合步骤包括:
通过Map处理模块对所述大数据进行处理,得到多个所述键值对;
根据预先设置的取模系数,分别对得到的多个所述键值对的键进行取模操作,分别得到多个所述键值对的取模余数;
将取模余数相同的多个所述键值对分配到一个键值对集合中,形成所述多个键值对集合。
与第一方面的第一种可能的实现方式相结合,在第一方面的第二种可能的实现方式中,所述取模系数=Reduce处理模块数量×取模因子,所述取模因子是预先确定的。
在第一方面的第三种可能的实现方式中,所述根据所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合步骤包括:
确定当前所述键值对集合的分配次数;
根据确定的当前所述键值对集合的分配次数,找到与所述分配次数对应的所述键值对集合的分配比例,所述分配次数与所述键值对集合的分配比例的对应关系是预先设定的;
根据得到的所述键值对集合的分配比例,从多个所述键值对集合中获取相应数量的所述键值对集合;
根据按照分配比例获取到的所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应 的键值对集合。
与第一方面的第三种可能的实现方式相结合,在第一方面的第四种可能的实现方式中,根据按照分配比例获取到的所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合步骤包括:
按照对负载较轻的所述Reduce处理模块分配总值较大的所述键值对集合的分配规则,分别为所述每个Reduce处理模块分配对应的键值对集合。
第二方面,本发明实施例提供一种对大数据进行处理的装置,所述装置包括:
第一获取模块,用于获取多个键值对集合,所述键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且所述两个子数据之间存在预设数据关系,所述键值对集合中的各键值对中的取模余数都相同;
计算模块,用于分别计算每个键值对集合中包括的各值之和,得到所述每个键值对集合包括的值的总值;
第二获取模块,用于获取Reduce处理模块集合中的每个Reduce处理模块的负载情况;
分配模块,用于根据所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合;
处理模块,用于通过所述每个Reduce处理模块分别对分配到的所述键值对集合进行处理。
在第二方面的第一种可能的实现方式中,所述第一获取模块用于:
通过Map处理模块对所述大数据进行处理,得到多个所述键值对;
根据预先设置的取模系数,分别对得到的多个所述键值对的键进行取模操作,分别得到多个所述键值对的取模余数;
将取模余数相同的多个所述键值对分配到一个键值对集合中,形成所述多个键值对集合。
与第二方面的第一种可能的实现方式相结合,在第二方面的第二种可能的 实现方式中,所述取模系数=Reduce处理模块数量×取模因子,所述取模因子是预先确定的。
在第二方面的第三种可能的实现方式中,所述分配模块用于:
确定当前所述键值对集合的分配次数;
根据确定的当前所述键值对集合的分配次数,找到与所述分配次数对应的所述键值对集合的分配比例,所述分配次数与所述键值对集合的分配比例的对应关系是预先设定的;
根据得到的所述键值对集合的分配比例,从多个所述键值对集合中获取相应数量的所述键值对集合;
根据按照分配比例获取到的所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合。
与第二方面的第三种可能的实现方式相结合,在第二方面的第四种可能的实现方式中,所述分配模块还用于:
按照对负载较轻的所述Reduce处理模块分配总值较大的所述键值对集合的分配规则,分别为所述每个Reduce处理模块分配对应的键值对集合。
本发明实施例提供的技术方案带来的有益效果是:
本发明实施例提供的对大数据进行处理的方法和装置,根据各个分区中各键值对集合的总值和各Reduce处理模块的负载情况将各分区中的键值对分配到各个Reduce处理模块,相对于通常情况下根据键值对的键将各键值对分配到各Reduce任务的方式,使各Reduce处理模块的负载更加均衡。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的对大数据进行处理的方法所涉及的一种实施环境的结构示意图;
图2是本发明实施例提供的对大数据进行处理的方法所涉及的另一种实施环境的结构示意图;
图3是本发明实施例一提供的对大数据进行处理的方法的流程图;
图4是本发明实施例二提供的对大数据进行处理的方法的流程图;
图5是本发明实施例三提供的对大数据进行处理的装置结构示意图;
图6是本发明实施例四提供的对大数据进行处理的装置结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
除非另作定义,此处使用的技术术语或者科学术语应当为本发明所属领域内具有一般技能的人士所理解的通常意义。本发明专利申请说明书以及权利要求书中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“一个”或者“一”等类似词语也不表示数量限制,而是表示存在至少一个。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。
参见图1,其示出了本发明实施例提供的虚拟机处理方法所涉及的一种实施环境的结构示意图,该系统包括:作业服务器10和与作业服务器10进行数据交互的任务服务器20。
其中,作业服务器10中设置有作业跟踪器11,作业跟踪器11根据任务服务器20传输的每个键值对集合包括的总值,以及直接从Reduce处理模块集合中获取每个Reduce处理模块的负载情况,根据预先设定的分配规则,分别为每个Reduce处理模块分配对应的键值对集合,为每个Reduce处理模块分配对应的键值对集合的过程就是建立各键值对集合与每个Reduce处理模块的对应关系;然后将建立好的各键值对集合与每个Reduce处理模块的对应关系反馈给任务服 务器20。
任务服务器20中设置有拆分器21、Map处理模块22、Reduce处理模块23、分区器24和任务跟踪器25;其中,拆分器21,用于将大数据分为多个数据分片,以便Map处理模块22对大数据进行处理;Map处理模块22,用于对由大数据得到的多个数据分片进行Map处理,得到多个键值对,将得到的多个键值对发送给分区器24;分区器24,用于对得到的键值对进行取模操作,将取模余数相同的键值对分配到一个键值对集合中,然后通过计算得到每个键值对集合的总值,将得到的键值对集合的总值传输到作业跟踪器11中;任务跟踪器25,用于获取作业跟踪器11反馈的各键值对集合与每个Reduce处理模块的对应关系,确定当前键值对集合的分配次数,根据确定的分配次数找到与分配次数对应的键值对集合的分配比例,然后根据键值对集合的分配比例控制各Reduce处理模块23从分区器24中获取分配比例相应数量的键值对集合;Reduce处理模块23,用于处理获取到的键值对集合,并将处理结果输出。
其中,键值对集合的总值是由键值对集合中各键值对的值累加得到的。
可选地,参见图2,其示出了本发明实施例提供的对大数据进行处理的方法所涉及的另一种实施环境的结构示意图,该实施场景下作业服务器10以一功能模块的形式设置在任务服务器20中。
实施例一
参见图3,本发明实施例提供了一种对大数据进行处理的方法,该方法的流程包括:
步骤100、任务服务器的分区器获取多个键值对集合,键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且两个子数据之间存在预设数据关系,键值对集合中的各键值对中的取模余数都相同。
步骤101、任务服务器的分区器分别计算每个键值对集合中包括的各值之和,得到每个键值对集合包括的值的总值。
步骤102、作业服务器的作业跟踪器获取Reduce处理模块集合中的每个Reduce处理模块的负载情况。
步骤103、作业服务器的作业跟踪器根据每个键值对集合中包括的值的总值 和每个Reduce处理模块的负载情况,分别为每个Reduce处理模块分配对应的键值对集合。
步骤104、通过任务服务器的每个Reduce处理模块分别对每个Reduce处理模块对应的键值对集合进行处理。
本实施例提供的对大数据进行处理的方法,根据各个分区中各键值对集合的总值和各Reduce处理模块的负载情况将各分区中的键值对分配到各个Reduce处理模块,相对于通常情况下根据键值对的键将各键值对分配到各Reduce任务的方式,使各Reduce处理模块的负载更加均衡。
实施例二
参见图4,本发明实施例提供了一种对大数据进行处理的方法,该方法流程包括:
步骤200、任务服务器的分区器获取多个键值对集合,键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且两个子数据之间存在预设数据关系,键值对集合中的各键值对中的取模余数都相同。
其中,键和值之间存在的预设数据关系是指Map处理模块获取的对大数据的查找条件和对大数据完成查找后得到的查找结果之间的对应关系,查找条件就是键,查找结果就是值;对大数据的查找条件和大数据完成查找后得到的查找结果之间的对应关系就形成了键值对。当Map处理模块通过查找条件对大数据进行查找并得到相应的查找结果时,会以(查找条件,查找结果)的形式输出键值对。
比如:若想要在一本书中找出“华”字的数量,那么“华”字就是查找条件,而“华”字数量就是查找结果;Map处理模块就会根据这个“华”字在这本书中查找“华”字的数量,并在查找完成后,以(“华”,“华”字数量)的形式输出键值对。若想要在网页集合中找到具有包含“中华”二字的超链接的网页,那么包含“中华”二字的超链接就是查找条件,具有包含“中华”二字的超链接的网页就是查找结果,Map处理模块就会根据包含“中华”二字的超链接在网页集合中查找具有包含“中华”二字的超链接的网页,并在查找完成后,以(包含“中华”二字的超链接,具有包含“中华”二字的超链接的网页)的 形式输出键值对。
具体地,步骤200的流程由步骤2001至步骤2003具体描述:
步骤2001、通过任务服务器的Map处理模块对大数据进行处理,得到多个键值对。
具体地,Map处理模块处理的大数据是任务服务器的拆分器将待处理的大数据分成的多个数据块。Map处理模块在接收到拆分器传输过来的多个数据块后,按照预先设定的键,在接收的每个数据块中查找是否有键的内容,并以键值对的形式得到查找结果。
例如,在搜索引擎领域,将大数据包含的各搜索结果输入;当用户要从网络的众多网页中找出具有“中华”这两个字的超链接时,在检索后找到100个具有“中华”二字的超链接的网页,这100个网页中具有包含“中华”二字的超链接有A和B,具有超链接A的网页有E、F和G,具有超链接B的网页有F、G和H,那么经过Map处理模块处理后得到的键值对就是(A,E)、(A,F)、(A,G)、(B,F)、(B,G)和(B,H)。
其中,拆分器对大数据的拆分是根据大数据的具体内容来对大数据进行的操作。比如:如果待处理的大数据是一本电子书,那么可以按照一个段落作为一个数据块的方式将电子书分成多个数据块,也可以按照一个句子作为一个数据块的方式将电子书分成多个数据块。如果大数据是某网站的主页集合,那么可以按照一个网页作为一个数据块的方式将某网站的主页集合中的网页分成多个数据块。
其中,Map处理模块将经过处理形成的键值对发送到分区器中。
步骤2002、任务服务器的分区器根据预先设置的取模系数,分别对得到的多个键值对的键进行取模操作,得到各键值对的取模余数。
具体地,任务服务器的分区器接收Map处理模块传输过来的键值对,对键值对中的键进行哈希操作,得到键对应的数字串;根据预先设置的取模系数,分区器分别对各键值对的键的数字串进行取模操作,得到各键值对的取模余数。
其中,取模系数=Reduce处理模块数量×取模因子,取模因子是预先确定的,存储在任务服务器的分区器中。
步骤2003、任务服务器的分区器将取模余数相同的多个键值对分配到一个键值对集合中,形成多个键值对集合。
其中,各个键值对集合都分别具有键值对集合标识。
通过步骤2001至步骤2003的描述,在形成键值对集合之前,根据预先设置的取模系数对得到的键值对进行取模操作,由于取模系数等于Reduce任务数量×取模因子,所以相对于通常的取模方式,可以将键值对分配到更多的键值对集合中,使得对键值对的分配更加均匀,使得Reduce处理模块集合中的各Reduce处理模块在处理键值对集合中的键值对所花费的时间更加均衡。
步骤201、任务服务器的分区器分别计算每个键值对集合中包括的各值之和,得到每个键值对集合包括的值的总值。
其中,每个键值对集合的总值的越大,说明获取该键值对集合的Reduce处理模块对该键值对集合的处理时间就越长。
其中,分区器在得到每个键值对集合的总值后,生成每个键值对集合的标识和总值的对应关系并记录到预先设置的关系列表中,然后将得到的关系列表发送到作业服务器的作业跟踪器中。
步骤202、作业服务器的作业跟踪器获取Reduce处理模块集合中的每个Reduce处理模块的负载情况。
其中,作业服务器的作业跟踪器从任务服务器的任务跟踪器中获取每个Reduce处理模块的负载情况。
其中,任务服务器的任务跟踪器中预先设置有记录Reduce处理模块集合中每个Reduce处理模块的负载情况的负载列表,任务服务器的任务跟踪器周期性获取每个Reduce处理模块的负载情况,以更新负载列表中记录的每个Reduce处理模块的负载情况。
其中,负载列表中记录有每个Reduce处理模块的标识和还未处理的键值对集合的总值的对应关系。
步骤203、作业服务器的作业跟踪器根据每个键值对集合中包括的值的总值和每个Reduce处理模块的负载情况,分别为每个Reduce处理模块分配对应的键值对集合。
具体地,步骤203的流程由步骤2031至步骤2035具体描述:
步骤2031、作业服务器的作业跟踪器确定当前键值对集合的分配次数。
其中,键值对集合的分配次数预先记录在作业服务器的作业跟踪器的分配列表中,分配列表中还记录有与分配次数对应的键值对集合的分配比例。
比如:分配列表中记录有分配次数1分配比例20%、分配次数2分配比例40%、分配次数3分配比例60%、分配次数4分配比例80%、分配次数5分配比例100%。
以上分配列表只记录了对键值对集合进行分配的一种方式,也可以采用其他的分配次数和对应的分配比例对键值对集合进行分配,这里不再一一赘述。
步骤2032、根据确定的当前键值对集合的分配次数,作业服务器的作业跟踪器找到与分配次数对应的键值对集合的分配比例,分配次数与键值对集合的分配比例的对应关系是预先设定的。
其中,根据确定的当前键值对集合的分配次数,作业服务器的作业跟踪器从预先存储的分配列表中找到与分配次数对应的键值对集合的分配比例。
步骤2033、根据得到的键值对集合的分配比例,作业服务器的作业跟踪器从多个键值对集合中获取相应数量的键值对集合。
具体地,根据得到的键值对集合的分配比例,作业服务器的作业跟踪器从接收的关系列表中获取分配比例相应数量的键值对集合的标识和总值的对应关系。
比如:分配列表中记录的当前的分配次数1,对应的分配比例是40%,那么作业跟踪器就会从当前的关系列表中获取关系列表中记录的键值对集合的标识和总值的对应关系总数的40%进行分配。例如:关系列表中记录了1000条键值对集合的标识和总值的对应关系,那么就对其中的400条进行分配。
步骤2034、根据按照分配比例获取到的每个键值对集合中包括的值的总值和每个Reduce处理模块的负载情况,作业服务器的作业跟踪器分别为每个Reduce处理模块分配对应的键值对集合。
具体地,作业服务器的作业跟踪器从负载列表中获取Reduce处理模块的标识,从关系列表中获取键值对集合的标识,根据Reduce处理模块的标识对应的负载情况和键值对集合的标识对应的总值,按照对负载较轻的Reduce处理模块分配总值较大的键值对集合的分配规则,分别为每个Reduce处理模块分配对应的键值对集合,即:将获取到的Reduce处理模块的标识和键值对集合的标识进行关联,建立键值对集合与Reduce处理模块的对应关系,然后将生成的键值对集合与Reduce处理模块的对应关系反馈给任务服务器的任务跟踪器,任务跟踪器会根据键值对集合与Reduce处理模块的对应关系控制Reduce处理模块从分 区器中获取对应的键值对集合。
比如:有3个Reduce处理模块:A、B和C,A的负载是10,B的负载是20,C的负载是30;而待分配的键值对集合有3个,分别是a、b和c,a的总值是30,b的总值是40,c的总值是50;那么根据分配规则,会将键值对集合a分配给Reduce处理模块C,将键值对集合b分配给Reduce处理模块B,将键值对集合c分配给Reduce处理模块A。在分配完毕之后,Reduce处理模块A、B和C的负载都是60,使Reduce处理模块A、B和C的负载均衡。
通过步骤2034的描述,通过对负载较轻的Reduce处理模块分配总值较大的键值对集合的分配规则将键值对集合分配到每个Reduce处理模块中,保证了各Reduce处理模块的负载均衡,从而使Reduce处理模块能够尽可能同时完成所分配的任务。
步骤204、通过任务服务器中的每个Reduce处理模块分别对分配到的键值对集合进行处理。
本实施例提供的对大数据进行处理的方法,根据各个分区中各键值对集合的总值和各Reduce处理模块的负载情况将各分区中的键值对分配到各个Reduce处理模块,相对于通常情况下根据键值对的键将各键值对分配到各Reduce任务的方式,使各Reduce处理模块的负载更加均衡。
实施例三
参见图5,本发明实施例提供了一种对大数据进行处理的装置,该装置包括:
第一获取模块300、计算模块301、第二获取模块302、分配模块303和处理模块304;
具体地,第一获取模块300,用于获取多个键值对集合,键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且两个子数据之间存在预设数据关系,键值对集合中的各键值对中的取模余数都相同;计算模块301,与第一获取模块300连接,用于分别计算每个键值对集合中包括的各值之和,得到每个键值对集合包括的值的总值;第二获取模块302,与计算模块301连接,用于获取Reduce处理模块集合中的每个Reduce处理模块的负载情况;分配模块303,与第二获取模块302连接,用于根据每个键值对集合中包括的值的总值和每个Reduce处理模块的负载情况,分别为每个Reduce 处理模块分配对应的键值对集合;处理模块304,与分配模块303连接,用于通过每个Reduce处理模块分别对分配到的键值对集合进行处理。
本实施例提供的对大数据进行处理的方法,根据各个分区中各键值对集合的总值和各Reduce处理模块的负载情况将各分区中的键值对分配到各个Reduce处理模块,相对于通常情况下根据键值对的键将各键值对分配到各Reduce任务的方式,使各Reduce处理模块的负载更加均衡。
实施例四
参见图6,本发明实施例提供了一种对大数据进行处理的装置,该装置包括:
第一获取模块400、计算模块401、第二获取模块402、分配模块403和处理模块404;
具体地,第一获取模块400,用于获取多个键值对集合,键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且两个子数据之间存在预设数据关系,键值对集合中的各键值对中的取模余数都相同;计算模块401,与第一获取模块400连接,用于分别计算每个键值对集合中包括的各值之和,得到每个键值对集合包括的值的总值;第二获取模块402,与计算模块401连接,用于获取Reduce处理模块集合中的每个Reduce处理模块的负载情况;分配模块403,与第二获取模块402连接,用于根据每个键值对集合中包括的值的总值和每个Reduce处理模块的负载情况,分别为每个Reduce处理模块分配对应的键值对集合;处理模块404,与分配模块403连接,用于通过每个Reduce处理模块分别对分配到的键值对集合进行处理。
具体地,第一获取模块401用于:
通过Map处理模块对大数据进行处理,得到多个键值对;
根据预先设置的取模系数,分别对得到的多个键值对的键进行取模操作,分别得到多个键值对的取模余数;
将取模余数相同的多个键值对分配到一个键值对集合中,形成多个键值对集合。
进一步地,取模系数=Reduce处理模块数量×取模因子,取模因子是预先确定的。
具体地,分配模块403用于:
确定当前键值对集合的分配次数;
根据确定的当前键值对集合的分配次数,找到与分配次数对应的键值对集合的分配比例,分配次数与键值对集合的分配比例的对应关系是预先设定的;
根据得到的键值对集合的分配比例,从多个键值对集合中获取相应数量的键值对集合;
根据按照分配比例获取到的每个键值对集合中包括的值的总值和每个Reduce处理模块的负载情况,分别为每个Reduce处理模块分配对应的键值对集合。
进一步地,分配模块403还用于:
按照对负载较轻的Reduce处理模块分配总值较大的键值对集合的分配规则,分别为每个Reduce处理模块分配对应的键值对集合。
本实施例提供的对大数据进行处理的方法,根据各个分区中各键值对集合的总值和各Reduce处理模块的负载情况将各分区中的键值对分配到各个Reduce处理模块,相对于通常情况下根据键值对的键将各键值对分配到各Reduce任务的方式,使各Reduce处理模块的负载更加均衡。
需要说明的是:上述实施例提供的对大数据进行处理的装置在对大数据进行处理时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的对大数据进行处理的装置与对大数据进行处理的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者 对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (10)

  1. 一种对大数据进行处理的方法,其特征在于,所述方法包括:
    获取多个键值对集合,所述键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且所述两个子数据之间存在预设数据关系,所述键值对集合中的各键值对中的取模余数都相同;
    分别计算每个键值对集合中包括的各值之和,得到所述每个键值对集合包括的值的总值;
    获取Reduce处理模块集合中的每个Reduce处理模块的负载情况;
    根据所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合;
    通过所述每个Reduce处理模块分别对分配到的所述键值对集合进行处理。
  2. 根据权利要求1所述的对大数据进行处理的方法,其特征在于,所述获取多个键值对集合步骤包括:
    通过Map处理模块对所述大数据进行处理,得到多个所述键值对;
    根据预先设置的取模系数,分别对得到的多个所述键值对的键进行取模操作,分别得到多个所述键值对的取模余数;
    将取模余数相同的多个所述键值对分配到一个键值对集合中,形成所述多个键值对集合。
  3. 根据权利要求2所述的对大数据进行处理的方法,其特征在于,所述取模系数=Reduce处理模块数量×取模因子,所述取模因子是预先确定的。
  4. 根据权利要求1所述的对大数据进行处理的方法,其特征在于,所述根据所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合步骤包括:
    确定当前所述键值对集合的分配次数;
    根据确定的当前所述键值对集合的分配次数,找到与所述分配次数对应的所述键值对集合的分配比例,所述分配次数与所述键值对集合的分配比例的对应关系是预先设定的;
    根据得到的所述键值对集合的分配比例,从多个所述键值对集合中获取相应数量的所述键值对集合;
    根据按照分配比例获取到的所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合。
  5. 根据权利要求4所述的对大数据进行处理的方法,其特征在于,所述根据按照分配比例获取到的所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合步骤包括:
    按照对负载较轻的所述Reduce处理模块分配总值较大的所述键值对集合的分配规则,分别为所述每个Reduce处理模块分配对应的键值对集合。
  6. 一种对大数据进行处理的装置,其特征在于,所述装置包括:
    第一获取模块,用于获取多个键值对集合,所述键值对集合包括至少一个键值对,键值对中的键和值分别为待处理的大数据中的两个子数据,且所述两个子数据之间存在预设数据关系,所述键值对集合中的各键值对中的取模余数都相同;
    计算模块,用于分别计算每个键值对集合中包括的各值之和,得到所述每个键值对集合包括的值的总值;
    第二获取模块,用于获取Reduce处理模块集合中的每个Reduce处理模块的负载情况;
    分配模块,用于根据所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合;
    处理模块,用于通过所述每个Reduce处理模块分别对分配到的所述键值对集合进行处理。
  7. 根据权利要求6所述的对大数据进行处理的装置,其特征在于,所述第一获取模块用于:
    通过Map处理模块对所述大数据进行处理,得到多个所述键值对;
    根据预先设置的取模系数,分别对得到的多个所述键值对的键进行取模操作,分别得到多个所述键值对的取模余数;
    将取模余数相同的多个所述键值对分配到一个键值对集合中,形成所述多个键值对集合。
  8. 根据权利要求7所述的对大数据进行处理的装置,其特征在于,所述取模系数=Reduce处理模块数量×取模因子,所述取模因子是预先确定的。
  9. 根据权利要求6所述的对大数据进行处理的装置,其特征在于,所述分配模块用于:
    确定当前所述键值对集合的分配次数;
    根据确定的当前所述键值对集合的分配次数,找到与所述分配次数对应的所述键值对集合的分配比例,所述分配次数与所述键值对集合的分配比例的对应关系是预先设定的;
    根据得到的所述键值对集合的分配比例,从多个所述键值对集合中获取相应数量的所述键值对集合;
    根据按照分配比例获取到的所述每个键值对集合中包括的值的总值和所述每个Reduce处理模块的负载情况,分别为所述每个Reduce处理模块分配对应的键值对集合。
  10. 根据权利要求9所述的对大数据进行处理的装置,其特征在于,所述分配模块还用于:
    按照对负载较轻的所述Reduce处理模块分配总值较大的所述键值对集合的分配规则,分别为所述每个Reduce处理模块分配对应的键值对集合。
PCT/CN2015/097179 2014-12-26 2015-12-11 一种对大数据进行处理的方法和装置 WO2016101798A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP15871867.6A EP3193264B1 (en) 2014-12-26 2015-12-11 Method and apparatus for processing big data
US15/481,606 US10691669B2 (en) 2014-12-26 2017-04-07 Big-data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410836226.5 2014-12-26
CN201410836226.5A CN105786938A (zh) 2014-12-26 2014-12-26 一种对大数据进行处理的方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/481,606 Continuation US10691669B2 (en) 2014-12-26 2017-04-07 Big-data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2016101798A1 true WO2016101798A1 (zh) 2016-06-30

Family

ID=56149230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/097179 WO2016101798A1 (zh) 2014-12-26 2015-12-11 一种对大数据进行处理的方法和装置

Country Status (4)

Country Link
US (1) US10691669B2 (zh)
EP (1) EP3193264B1 (zh)
CN (1) CN105786938A (zh)
WO (1) WO2016101798A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528712A (zh) * 2016-10-26 2017-03-22 国云科技股份有限公司 一种比较一组大数据在另一组大数据中是否存在方法
CN109144690B (zh) * 2018-07-06 2021-06-22 麒麟合盛网络技术股份有限公司 任务处理方法和装置
CN110275703B (zh) * 2019-06-27 2023-06-06 浙江大搜车软件技术有限公司 键值对数据的赋值方法、装置、计算机设备和存储介质
CN113094262B (zh) * 2021-03-29 2022-10-18 四川新网银行股份有限公司 一种基于数据库分库分表的生产数据进行测试的方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678609A (zh) * 2013-12-16 2014-03-26 中国科学院计算机网络信息中心 一种基于分布式关系-对象映射处理的大数据查询的方法
CN103853727A (zh) * 2012-11-29 2014-06-11 深圳中兴力维技术有限公司 提高大数据量查询性能的方法及系统
CN103970604A (zh) * 2013-01-31 2014-08-06 国际商业机器公司 基于MapReduce架构实现图处理的方法和装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5070456A (en) * 1989-12-27 1991-12-03 International Business Machines Corporation Method for facilitating the sorting of national language keys in a data processing system
US8726290B2 (en) * 2008-06-12 2014-05-13 Yahoo! Inc. System and/or method for balancing allocation of data among reduce processes by reallocation
CN102662639A (zh) * 2012-04-10 2012-09-12 南京航空航天大学 一种基于Mapreduce的多GPU协同计算方法
US20130290972A1 (en) * 2012-04-27 2013-10-31 Ludmila Cherkasova Workload manager for mapreduce environments
US20130332608A1 (en) * 2012-06-06 2013-12-12 Hitachi, Ltd. Load balancing for distributed key-value store
CN102799486B (zh) * 2012-06-18 2014-11-26 北京大学 一种MapReduce系统中的数据采样和划分方法
CN102799628B (zh) * 2012-06-21 2015-10-07 新浪网技术(中国)有限公司 在key-value数据库中进行数据分区的方法和装置
US9032416B2 (en) * 2012-07-30 2015-05-12 Oracle International Corporation Load balancing using progressive sampling based on load balancing quality targets
US9176712B2 (en) * 2013-03-14 2015-11-03 Oracle International Corporation Node Grouped Data Marshalling
US9336334B2 (en) * 2013-05-17 2016-05-10 Bigobject, Inc. Key-value pairs data processing apparatus and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853727A (zh) * 2012-11-29 2014-06-11 深圳中兴力维技术有限公司 提高大数据量查询性能的方法及系统
CN103970604A (zh) * 2013-01-31 2014-08-06 国际商业机器公司 基于MapReduce架构实现图处理的方法和装置
CN103678609A (zh) * 2013-12-16 2014-03-26 中国科学院计算机网络信息中心 一种基于分布式关系-对象映射处理的大数据查询的方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3193264A4 *

Also Published As

Publication number Publication date
EP3193264A4 (en) 2017-10-11
EP3193264B1 (en) 2019-06-26
EP3193264A1 (en) 2017-07-19
US10691669B2 (en) 2020-06-23
CN105786938A (zh) 2016-07-20
US20170212923A1 (en) 2017-07-27

Similar Documents

Publication Publication Date Title
JP6542909B2 (ja) ファイル操作方法及び装置
US10360199B2 (en) Partitioning and rebalancing data storage
CN107391629B (zh) 集群间数据迁移方法、系统、服务器及计算机存储介质
US8577892B2 (en) Utilizing affinity groups to allocate data items and computing resources
US9372880B2 (en) Reclamation of empty pages in database tables
US20190196875A1 (en) Method, system and computer program product for processing computing task
WO2016101798A1 (zh) 一种对大数据进行处理的方法和装置
WO2022111313A1 (zh) 一种请求处理方法及微服务系统
JP2020531949A (ja) ブロックチェーン内のデータベース・ハッシュコードの遅延更新
US9110917B2 (en) Creating a file descriptor independent of an open operation
TW201702870A (zh) 一種資源分配方法和裝置
CN108140049B (zh) 基于树的数据结构的并行批量处理
US20110179041A1 (en) Matching service entities with candidate resources
JPWO2012063301A1 (ja) 計算機システム、マルチテナント制御方法及びマルチテナント制御プログラム
US10083121B2 (en) Storage system and storage method
US10042957B2 (en) Devices and methods for implementing dynamic collaborative workflow systems
CN111522788B (zh) 处理访问请求和更新存储系统的方法、设备
US20190370259A1 (en) Devices and methods for implementing dynamic collaborative workflow systems
US20140365542A1 (en) Data processing system and method
US20210117483A1 (en) Document flagging based on multi-generational complemental secondary data
CN116166438A (zh) 集群扩容方法、装置、电子设备及存储介质
WO2018067420A1 (en) Perform graph traversal with graph query language

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15871867

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2015871867

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015871867

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE