CN105653534B

CN105653534B - Data processing method and device

Info

Publication number: CN105653534B
Application number: CN201410640319.0A
Authority: CN
Inventors: 陈维锋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2014-11-13
Filing date: 2014-11-13
Publication date: 2020-03-17
Anticipated expiration: 2034-11-13
Also published as: CN105653534A

Abstract

The invention relates to a data processing method, which comprises the following steps: receiving a mapping/simplifying calculation request, and acquiring data to be processed based on a key value pair, wherein the data type of a key in the data to be processed is an integer; calling a mapper to map the data to be processed to obtain intermediate result data; calling corresponding reducers according to the keys of each key value pair in the intermediate result data and the preset number of the reducers to carry out reduction processing on the intermediate result data to obtain final result data; and outputting the final result data. The invention also provides a data processing device. The invention can reduce the complexity of the algorithm when processing data and improve the operation efficiency of the whole process when processing data by using the mapping/simplifying model.

Description

Data processing method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus based on distributed computing.

Background

With the development of computer technology, the amount of data which needs to be processed by a computer is larger and larger, and a single computer cannot load some large-scale data processing, for example, searching out users meeting requirements from massive users of a network social platform. Therefore, it is generally necessary to combine a plurality of computers to form a computer cluster and process large-scale data in parallel. In order to combine a plurality of computers to process large-scale data in parallel, a model-Map/Reduce (Map/Reduce) model for processing large-scale data in parallel is developed.

The processing of data by the map/reduce model generally includes a map (map) phase for reading data to be processed based on key-value pairs from a plurality of input paths through a plurality of mappers (i.e., mappers, which are user applications for implementing the map process in the map/reduce model) and performing data sorting and de-stacking on the data to be processed to generate intermediate result data also based on the key-value pairs, and a reduce (reduce) phase for summarizing the intermediate result data into final result data through a plurality of reducers (i.e., reducers, which are user applications for implementing the reduce process in the map/reduce model) and outputting the final result data through a plurality of output paths.

However, the existing method for processing data by using the mapping/simplifying model generally has the defects of complex algorithm, low operation efficiency of the whole process and the like. For example, when data processing is performed by using the mapping/reduction model, a plurality of reducers are required to be designated in the reduction stage to summarize the intermediate result data, wherein each reducer is used for processing a part of the intermediate result data. This involves the problem of how to partition the intermediate result data to the plurality of reducers. The existing method for dividing the intermediate result data into different reducers for processing is generally to divide the intermediate result data according to the keys in each key value pair, and to process the key values with the same key by the same reducer. The algorithm complexity of the partitioning mode is high, and the number of key value pairs processed by each reducer may be greatly different, so that the workload of each reducer is uneven, the processing efficiency of the reduction stage on intermediate result data may be low, and the operation efficiency of the whole flow of the mapping/reduction model is low.

Disclosure of Invention

In view of the above, it is necessary to provide a data processing method and apparatus, which can reduce the complexity of the algorithm during data processing and improve the operation efficiency of the overall process when the mapping/simplifying model is used to process data.

A method of data processing comprising the steps of: receiving a mapping/simplifying calculation request, and acquiring data to be processed based on a key value pair, wherein the data type of a key in the data to be processed is an integer; calling a mapper to map the data to be processed to obtain intermediate result data; calling corresponding reducers according to the keys of each key value pair in the intermediate result data and the preset number of the reducers to carry out reduction processing on the intermediate result data to obtain final result data; and outputting the final result data.

A data processing apparatus comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for receiving a mapping/simplifying calculation request and acquiring data to be processed based on key value pairs, and the data types of keys in the data to be processed are integers; the first processing module is used for calling a mapper to map the data to be processed to obtain intermediate result data; the second processing module is used for calling corresponding reducers to carry out reduction processing on the intermediate result data according to the keys of each key value pair in the intermediate result data and the preset number of the reducers to obtain final result data; and the output module is used for outputting the final result data.

Compared with the prior art, the data processing method and the data processing device have the advantages that for the key values of which the data types of the keys are integers, the data to be processed is mapped by calling the mapper to obtain the intermediate result data, then the corresponding reducers are called according to the keys of each key value pair in the intermediate result data and the preset number of the reducers to reduce the intermediate result data, the intermediate result data can be evenly distributed to the preset number of the reducers to be processed, the algorithm is simpler, and the operation efficiency of the whole process can be improved when the data are processed by using the mapping/reducing model.

In order to make the aforementioned and other objects, features and advantages of the invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a block diagram of a computer.

Fig. 2 is a schematic environmental diagram of the data processing method according to the embodiment of the present invention.

Fig. 3 is a flowchart of a data processing method according to a first embodiment of the present invention.

Fig. 4 is a flowchart of a data processing method according to a third embodiment of the present invention.

Fig. 5 is a block diagram of a data processing apparatus according to a fourth embodiment of the present invention.

Fig. 6 is a block diagram of a data processing apparatus according to a fifth embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be made with reference to the accompanying drawings and preferred embodiments.

Fig. 1 shows a block diagram of a computer. As shown in fig. 1, the computer 1 includes one or more memories 11 (only one is shown in the figure), a processor 12, a memory controller 13, a peripheral interface 14, a communication module 15, an input unit 16, and a display unit 17. These components communicate with each other via one or more communication buses/signal lines.

It will be understood by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting on the structure of the computer 1. For example, computer 1 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

The memory 11 can be used for storing software programs and modules, such as program instructions/modules corresponding to the data processing method and apparatus in the embodiment of the present invention, and the processor 12 executes various functional applications and data processing by running the software programs and modules stored in the memory 11, so as to implement the data processing method described above.

The memory 11 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 11 may further include memory remotely located from processor 12, which may be connected to computer 1 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Access to the memory 11 by the processor 12 and possibly other components may be under the control of a memory controller 13.

The peripheral interface 14 couples various input/output devices to the processor 12 and to the memory 11. The processor 12 runs various software, instructions, and performs various functions of the computer 1 and data processing within the memory 11.

The communication module 15 is used for communicating with a communication network or other devices. Specifically, the communication module 15 may be, for example, a network card 151 or an RF (Radio Frequency) module 152. The network card 151 serves as an interface for connecting a computer and a transmission medium in a local area network, and is used to implement physical connection with the transmission medium of the local area network and match with an electrical signal, thereby establishing the local area network and connecting to the Internet (Internet) to communicate with various networks such as a local area network, a metropolitan area network, and a wide area network. Network card 151 may include various conventional circuit elements for performing the above-described functions, such as a processor and memory (including ROM and RAM). The RF module 152 is used for receiving and transmitting electromagnetic waves, and implementing interconversion between the electromagnetic waves and electrical signals, thereby communicating with a communication network or other devices. The RF module 152 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and so forth. The RF module 152 may communicate with various networks such as the internet, an intranet, a wireless network, or with other devices over a wireless network. The wireless network may comprise a cellular telephone network, a wireless local area network, or a metropolitan area network. The Wireless network may use various Communication standards, protocols and technologies, including, but not limited to, Global System for Mobile Communication (GSM), Enhanced Mobile Communication (EDGE), wideband Code division multiple Access (W-CDMA), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Wireless Fidelity (WiFi) (e.g., Institute of Electrical and Electronics Engineers (IEEE) standard IEEE802.11 a, IEEE802.11 b, IEEE802.1 g and/or IEEE802.1 n), Voice over internet protocol (VoIP), world wide mail Access (micro for Wireless Communication, Wi-Max), and other short message Communication protocols, as well as any other suitable communication protocols, and may even include those that have not yet been developed.

The input unit 16 may be used to receive input character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. Specifically, the input unit 16 may include a key 161 and a touch surface 162. The keys 161 may include, for example, character keys for inputting characters, and control keys for triggering control functions. Examples of control keys include a "back to home" key, a power on/off key, a take picture key, and the like. The touch surface 162 may collect touch operations by a user on or near the touch surface 162 (e.g., operations by a user on or near the touch surface 162 using a finger, a stylus, or any other suitable object or accessory) and drive the corresponding connection device according to a predetermined program. Alternatively, the touch surface 162 may include two portions, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 12, and can receive and execute commands sent by the processor 12. In addition, the touch surface 162 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 16 may include other input devices in addition to the touch surface 162. Such other input devices include, but are not limited to, one or more of a physical keyboard, trackball, mouse, joystick, and the like.

The display unit 17 is used to display information input by the user, information provided to the user, and various graphic interfaces of the server 1. These graphical user interfaces may be made up of graphics, text, icons, video, and any combination thereof. In one example, the display unit 17 includes one display panel 171. The Display panel 171 may be, for example, a Liquid Crystal Display (LCD), an Organic Light-Emitting diode (OLED) Display panel, an electrophoretic Display (EPD), or the like. Further, the touch surface 162 may be disposed on the display panel 171 to form an integral body with the display panel 171.

Fig. 2 is a schematic diagram of an environment in which the data processing method according to the embodiment of the present invention is applied. In this embodiment, the data processing method is applied to a computer cluster 3 formed by connecting one or more computers 1 via a network 2. The computers 1 in the computer cluster 3 perform network communication and data interaction with each other via the network 2. The computer cluster 3 may also communicate with one or more terminals 4 (only one is shown in fig. 2) via the network 2. The computer cluster 3 may receive a data processing request from the terminal 4, process data to be processed according to the data processing request, for example, process the data to be processed, including sorting, stacking, screening, and the like, and then return a processing result to the terminal 4.

In this embodiment, the terminal 4 may also request some large-scale data processing from the computer cluster 3. For example, a certain social networking platform is erected in the computer cluster 3, and the social networking platform adopts a user account management system, that is, each legitimate user of the social networking platform has a user account and a password valid for the social networking platform. The user account can be letters, numbers, coincidence or a combination thereof set by the user or allocated by the system. The client of the social networking platform is installed and operated in the terminal 4. The user can log in the social networking platform through the client in the terminal 4 by using the valid user account and the password, and send a network request to the computer cluster 3 through the client, or receive network information returned by the computer cluster 3, so as to access or use various services of the social networking platform, for example, send an instant message to other users of the social networking platform, browse articles or comments published by other users, search users, articles or comments and the like meeting requirements in the social networking platform. If the social networking platform already has a large number of users who issue a large number of articles or comments, when a data processing request is sent to the computer cluster 3 through the terminal 4 to request that users meeting the requirements are searched from the large number of users through information such as nicknames, locations, graduate schools and the like, or articles or comments meeting the requirements are searched from the large number of articles or comments through information such as publishers, release time, content keywords and the like, data which need to be processed by the computer cluster 3 will have a very large scale.

In order to allow each computer 1 in the computer cluster 3 to process large-scale data in parallel, a distributed computing model is erected in the computer cluster 3. In this embodiment, the distributed computing model may be a Map/Reduce (Map/Reduce) model. The processing of data by the map/reduce model generally includes a map (map) phase for reading data to be processed based on key-value pairs from a plurality of input paths through a plurality of mappers (i.e., mappers, which are user applications for implementing the map process in the map/reduce model) and performing data sorting and heap processing on the data to be processed to generate intermediate result data also based on the key-value pairs, and a reduce (reduce) phase for summarizing the intermediate result data into final result data through a plurality of reducers (i.e., reducers, which are user applications for implementing the reduce process in the map/reduce model) and outputting the final result data through a plurality of output paths. The map/reduce model may be implemented using Hadoop Streaming programming tools that a user may use to write an executable file or script as the mapper and reducer.

Specific examples of the computer 1 include, but are not limited to, a desktop computer, a portable computer, and other computer devices with high arithmetic processing capabilities. Specific examples of the terminal 4 include, but are not limited to, a smart phone, a tablet computer, a vehicle-mounted computer, a Personal Digital Assistant (PDA), an electronic book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a set-top box, a smart television, a wearable device, or other similar computing devices. The network 2 may be any network connection means, such as the Internet (Internet), a mobile Internet (e.g. a 2G or 3G network provided by a telecom operator), a local area network (wired or wireless), etc.

First embodiment

Referring to fig. 3, a first embodiment of the present invention provides a data processing method, which includes the following steps:

step S1, receiving a mapping/simplifying calculation request, and acquiring data to be processed based on key value pairs, wherein the data types of keys in the data to be processed are integers;

step S2, calling a mapper to map the data to be processed to obtain intermediate result data;

step S3, calling corresponding reducers according to the keys of each key value pair in the intermediate result data and the preset number of the reducers to carry out reduction processing on the intermediate result data to obtain final result data; and

in step S4, the final result data is output.

According to the data processing method, aiming at the key values of which the data types of the keys are integers, after the mapper is called to map the data to be processed to obtain the intermediate result data, the corresponding reducers are called according to the keys of each key value pair in the intermediate result data and the preset number of the reducers to reduce the intermediate result data, the intermediate result data can be evenly distributed to the preset number of the reducers to be processed, the algorithm is simpler, and therefore the operation efficiency of the whole process when the data are processed by using the mapping/reducing model can be improved.

In some examples, implementation details of the steps of the above method are as follows:

the mapping/simplifying calculation request of step S1 may be generated according to various data processing requests sent by the client in the terminal 4 to the computer cluster 3. For example, when the client in the terminal 4 sends a search request to the computer cluster 3 to search out a designated user matching a preset keyword from the mass users of the social networking platform, since the search request involves large-scale data processing, the computer cluster 3 may issue a mapping/reduction calculation request according to the search request, request to process the search request using a distributed calculation model mapping/reduction model in the computer cluster 3, and finally search out the designated user from the mass users.

The data to be processed includes, for example, the index information of the designated user, the index information of the massive users, and storage positions and sorting modes of the index information. The data to be processed is based on key-value pairs, i.e. the information in the data to be processed is in the form of key-value pairs (key-value pair). Specifically, when the computer cluster 3 receives various data processing requests sent by the terminal 4, the data to be processed is determined, and the determined data to be processed is arranged into a key value pair form. In this embodiment, the data type of the key in each key value pair in the data to be processed is an integer, for example, a 64-bit integer. In one example, the to-be-processed data may include an index identifier (ID, Identity) for data retrieval, where the data type of the index identifier is an integer, and step S1 may set the index identifier in the to-be-processed data to be the key of each key-value pair in the to-be-processed data in the step of acquiring the to-be-processed data based on the key-value pair. Furthermore, the data to be processed may be stored under a plurality of input paths in the computer cluster 3, each input path storing a portion of the data to be processed. Step S1 may obtain the data to be processed from the specified input path according to the map/reduce calculation request.

Step S2 calls a mapper to perform mapping processing on the data to be processed, where in this embodiment, the mapper performs mapping processing on the data to be processed according to the input path of the data to be processed. Specifically, step S2 needs to split the data to be processed into data blocks to be processed, the number of which is consistent with the number of mappers in the mapping/reduction model, according to the number of mappers in the mapping/reduction model, where each data block to be processed is respectively mapped by one mapper in the mapping/reduction model. The number of mappers may be specified by the user. The principle of splitting the to-be-processed data in step S2 is to make the workload of each mapper for processing the to-be-processed data block more even. Each mapper may first identify an input path of a corresponding block of data to be processed when processing the block of data. For example, the mapper may identify the input path of the data block to be processed by a Java statement getenv ("map _ input _ file"), the code being as follows:

char*pPath；

pPath＝getenv(＂map_input_file＂)；

wherein, the pPath is the input path of the data block to be processed.

And then the mapper performs mapping processing on corresponding data blocks to be processed according to the identified input path, so as to obtain the intermediate result data. Of course, the intermediate result data is also based on key-value pairs, the keys of the key-value pairs in the intermediate result data are continued from the keys of the key-value pairs in the data to be processed, and the data types are also integers.

The preset number of reducers described in step S3 may also be specified by the user. For the intermediate result data obtained in step S2, step S3 needs to call a preset number of reducers to summarize the intermediate result data, i.e., perform a reduction process on the intermediate result data. Each of the reducers is used for carrying out reduction processing on a part of data in the intermediate result data.

Step S3 calls corresponding reducers to reduce the intermediate result data according to the keys of each key value pair in the intermediate result data and the preset number of reducers, so as to obtain final result data. Specifically, since the data type of the key of each key-value pair in the intermediate result data is an integer, step S3 may sequentially modulo a preset number of reducers by the key of each key-value pair in the intermediate result data, and then submit the key-value pairs corresponding to the keys with the same modulus value obtained after the modulo processing in the intermediate result data to the same reducer in the preset number of reducers for reduction, where each reducer in the preset number of reducers is used to process the key-value pair corresponding to the key with the same modulus value obtained after the modulo processing in the intermediate result data. For example, the key-value pairs corresponding to the keys with the modulus value of 0 obtained by the above-mentioned modulus extraction process in the intermediate result data are all processed by a first simplifier, and the key-value pairs corresponding to the keys with the modulus value of 1 obtained by the above-mentioned modulus extraction process are all processed by a second simplifier, and so on. And calling the corresponding reducers to carry out simplification processing on the intermediate result data according to the modulus values of the preset number of the reducers of the keys of the key value pairs in the intermediate result data, so that the workload of each reducer in the preset number of reducers can be more average.

In one example, step S3 may inherit the partitioner class originally in the map/reduce model to generate a new class, which in this embodiment is named as DocidPatier class. The DocidPatitioner can realize the scheme of sequentially taking the modulus of the preset number of key pair reducers of each key pair in the intermediate result data and submitting the key pairs corresponding to the keys with the same modulus to the same reducer in the preset number of reducers for simplification processing. The partitioner class is a class originally used in the map/reduce model for dividing the intermediate result data into a preset number of reducers for reduction, and through the partitioner class, key values with the same key contained in the intermediate result data can be handed over to the same reducer for processing.

Specifically, step S3 may generate the docidpartitioner class by inheriting the partitioner class by the following Java code:

step S4 acquires the generated final result data from the preset number of reducers and outputs the final result data. In this embodiment, the final result data may be output by using a multiple sequence fileoutputformat class in the mapping/reduction model. The final result data can be output to a designated output path in a sequence file format through the multiple sequence fileoutputformat class.

In summary, in the data processing method of this embodiment, for a key value with a key data type of an integer, after a reducer is called to reduce the data to be processed to obtain intermediate result data, a corresponding reducer is called according to the key of each key value pair in the intermediate result data and the preset number of reducers to reduce the intermediate result data, so that the intermediate result data can be more evenly distributed to the preset number of reducers to be processed, and the algorithm is simpler, thereby improving the operation efficiency of the overall process when a mapping/reduction model is used to process data.

Second embodiment

In the overall process of the data processing method provided in the first embodiment, the data storage format is not limited, and the data to be processed, the intermediate result data, and the final result data are usually transmitted in a character string manner during the transmission process. However, in order to prevent the keys in each key value pair from being cut off by a special character in the character string, such as a slash "/", when the data to be processed, the intermediate result data, and the final result data are transmitted, the data to be processed, the intermediate result data, and the final result data need to be encrypted first, and when the data to be processed, the intermediate result data, and the final result data need to be processed, the data to be processed, the intermediate result data, and the final result data need to be decrypted first. The algorithm involved in the encryption and decryption processes is complex, the occupancy rate of a memory is high, the data storage load in the whole data processing process is high, and therefore the operation efficiency is low.

In order to further solve the above problem, the second embodiment of the present invention provides a data processing method, compared to the data processing method provided by the first embodiment, in step S1, when receiving the map/reduce calculation request, the data storage format needs to be specified as typebytes. The typebytes storage format is a binary file format, and after the data storage format is specified as the typebytes, the data in the overall flow of the data processing method of the embodiment, including the data to be processed, the intermediate result data, and the final result data, are all represented by [ 1 byte type +4 byte length + original byte ]. In one example, step S1 may specify the data storage format as typebytes through an-io option in the Hadoop Streaming programming tool according to a user operation. The data processing method of the embodiment designates the data storage format as typebytes when receiving the mapping/simplifying calculation request, so that the algorithm is simpler, the data storage capacity of each link of the whole process can be reduced, and the operation efficiency of the whole process is improved.

Third embodiment

In the data processing method according to the first embodiment of the present invention, when the final result data is output, the final result data is output by using a multiple sequence fileoutputformat class existing in the mapping/simplification model. However, for the final result data of which the data type of the key is an integer, the final result data is output through the multiple sequence fileoutputformat class, and in a mapping/simplification processing flow, the final result data can only be output to a specified output path. If the final result data is output to a plurality of designated output paths in a slicing mode, a plurality of mapping/simplifying processing flows need to be operated, and the mode not only has higher algorithm complexity, but also leads the operation time of processing the data by using the mapping/simplifying model to be longer.

In order to further solve the above problem, referring to fig. 4, a third embodiment of the present invention provides a data processing method, which is compared with the data processing method of the first embodiment, the step S4 includes:

and S4.1, identifying the category of each key value pair in the final result data. Step S4.1 may identify the type of each key-value pair in the final result data according to the specified field of the key-value pair. The category of the key-value pair may be, for example, abstract, inline, inverted, etc.

And S4.2, marking each key value pair in the final result data according to the type of the key value pair. In this embodiment, step S4.2 may mark the keys in each key value pair in the final result data according to the category of the key value pair to which the key value pair belongs, and mark the keys in different key value pairs in different categories. In one example, the keys in each key value pair in the final result data are marked according to the category of the key value pair, and since the data type of the keys in the final result data is an integer, the following processing can be performed on the keys in each key value pair in the final result data according to the category of the key value pair: and erasing the highest bit of the key, and then taking a preset identification (flag) value corresponding to the category of the key value pair to which the key belongs as the lowest bit of the key, wherein the preset identification values corresponding to the key value pairs of different categories are different. For example, if the key is a decimal number, the highest bit of the key may be erased and multiplied by 10, and a preset identification value corresponding to the category of the key-value pair to which the key belongs may be added. The most significant bit of the key is erased, then multiplied by 10 and added with the preset identification value corresponding to the category of the key value pair to which the key belongs, wherein the key is in a 64-bit integer form, and the preset identification value is realized by the following Java code:

//flag should<10

Unit64_t encode_key(uint64_t raw_key,int flag)

{

return raw_key*10+flag；

}

and S4.3, outputting the final result data according to the marks of the key value pairs in the final result data. According to the above example, step S4.3 may output the final result data according to the preset identification value of the middle key of each key value of the final result data. Specifically, a corresponding output path may be established for each key value pair in the final result data according to a preset identification value of a key in each key value pair, and the corresponding key value pair may be output according to the established output path. In this embodiment, step S4.3 may inherit the MultipleSequenceFileOutputFormat class existing in the mapping/simplification model to generate a new class, and the new class is named as a MultiplePathSeqOutputFormat class in this embodiment. The establishment of the corresponding output path for each key value pair according to the preset identification value of the key in each key value of the final result data can be realized through the multiple pathseqoutputformat class.

Specifically, step 4.3 may generate the multiple pathfileoutputformat class by inheriting the multiple sequenceinfeedformat class by the following Java code:

wherein, outputName is the output path established by the key value pair in the final result data. After the multiple pathseqoutputformat class establishes a corresponding output path for the key-value pair in the final result data, the corresponding key-value pair can be output according to the established output path through the following Java codes:

//flag flag:0

out.write_long(encode_key(docid,0))；

out.write_string(fsort_data.fwd().data(),fort_data.fwd().size())；

//fwd_invt 1

out.write_long(encode_key(docid,1))；

out.write_string(fort_data.invt().data(),fsort_data.invt().size())。

in summary, compared with the data processing method of the first embodiment, in the data processing method of this embodiment, for the final result data of which the data type of the key is an integer, each key value pair in the final result data is marked according to the type of the key value pair, and then the final result data is output according to the mark of each key value pair, and the final result data is output to a plurality of specified output paths in a fragmented manner only by running the mapping/simplifying process once, so that the running time for processing the data using the mapping/simplifying model can be shortened, and the algorithm is simpler.

Fourth embodiment

Referring to fig. 5, an embodiment of the invention provides a data processing apparatus 100, which includes an obtaining module 101, a first processing module 102, a second processing module 103, and an output module 104. It will be appreciated that the modules described above refer to computer programs or program segments for performing a certain function or functions. In addition, the distinction between the above-described modules does not mean that the actual program code must also be separated.

The obtaining module 101 is configured to receive a mapping/simplifying calculation request, and obtain to-be-processed data based on a key-value pair, where a data type of a key in the to-be-processed data is an integer. The obtaining module 101 may set the index identifier in the to-be-processed data as a key of each key value pair in the to-be-processed data, where the data type of the index identifier is an integer. The acquisition module 101 may also specify the data storage format as typebytes. Specifically, the obtaining module 101 may obtain the to-be-processed data from the specified input path according to the mapping/reduction calculation request.

The first processing module 102 is configured to invoke a mapper to perform mapping processing on the data to be processed, so as to obtain intermediate result data. The first processing module 102 may invoke a corresponding mapper to perform mapping processing on the data to be processed according to the input path of the data to be processed.

And the second processing module 103 is configured to invoke corresponding reducers to perform reduction processing on the intermediate result data according to the keys of each key value pair in the intermediate result data and the preset number of the reducers, so as to obtain final result data. Specifically, the second processing module 103 may sequentially modulo a preset number of reducers by the keys of each key value pair in the intermediate result data, and then submit the key value pairs corresponding to the keys with the same modulus value obtained after the modulo processing in the intermediate result data to the same reducer in the preset number of reducers for simplification, where each reducer in the preset number of reducers is respectively used for processing the key value pair corresponding to the key with the same modulus value obtained after the modulo processing in the intermediate result data.

And an output module 104, configured to output the final result data.

For the specific working processes of the above modules, reference may be further made to the data processing methods provided in the first embodiment and the second embodiment of the present invention, and details are not described here.

In summary, in the data processing apparatus provided in this embodiment, for a key value with a key data type of an integer, after a mapper is invoked to map the data to be processed to obtain intermediate result data, corresponding reducers are invoked to reduce the intermediate result data according to the preset number of reducers and keys of each key value pair in the intermediate result data, so that the intermediate result data can be more evenly distributed to the preset number of reducers to be processed, and the algorithm is simpler, thereby improving the operation efficiency of the overall process when a mapping/simplifying model is used to process data.

In addition, when a mapping/simplification calculation request is received, the data storage format is specified as typebytes, so that the algorithm is simpler, the data storage capacity of each link of the whole process can be reduced, and the operation efficiency of the whole process is improved.

Fifth embodiment

Referring to fig. 6, a fifth embodiment of the present invention provides a data processing apparatus 200, wherein compared to the data processing apparatus 100 provided in the fourth embodiment, the output module 104 further includes:

and the identifying submodule 1041 is configured to identify a category of each key value pair in the final result data. The identifier sub-module 1041 is configured to identify a category of each key-value pair in the final result according to a designated field of the key-value pair.

The labeling sub-module 1042 is configured to label each key value pair in the final result data according to the category of the key value pair. The labeling sub-module 1042 may label keys in each key value pair in the final result data according to the category of the key value pair. In one example, the labeling sub-module 1042 may perform the following processing on the keys in each key-value pair of the final result data in turn according to the category of the key-value pair: and erasing the highest bit of the key, and then taking a preset identification value corresponding to the category of the key value pair to which the key belongs as the lowest bit of the key.

And the output submodule 1043 is configured to output the final result data according to the label of each key value pair in the final result data. According to the above example, the output sub-module 1043 may output the final result data according to the preset identification value of the middle key of each key value of the final result data. Specifically, the output sub-module 1043 may establish a corresponding output path according to each key value pair marked in the final result data, and output the corresponding key value pair according to the established output path.

For the specific working processes of the above modules, reference may be further made to the data processing method provided in the third embodiment of the present invention, which is not described herein again.

In summary, in the data processing apparatus provided in this embodiment, for the final result data in which the data type of the key is an integer, each key value pair in the final result data is marked according to the type of the key value pair, and then the final result data is output according to the mark of each key value pair, and the final result data is output to a plurality of specified output paths in a fragmented manner only by running the mapping/reduction process once, so that the running time for processing the data using the mapping/reduction model can be shortened, and the algorithm is simpler.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, where the computer-readable storage medium is, for example, a non-volatile memory such as an optical disc, a hard disc, or a flash memory. The computer-executable instructions are used for making a computer or a similar operation device perform various operations in the data processing method.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of data processing, the method comprising the steps of:

receiving a mapping/simplifying calculation request, and acquiring data to be processed based on a key value pair, wherein the data type of a key in the data to be processed is an integer;

averagely splitting the data to be processed into the data blocks to be processed with the first preset number based on the first preset number of mappers in mapping/simplification, wherein the mappers correspond to the data blocks to be processed one by one;

calling the mapper to identify the input path of the corresponding data block to be processed, and mapping the corresponding data block to be processed based on the identified input path to obtain intermediate result data;

calling corresponding reducers according to the keys of the key value pairs in the intermediate result data and the second preset number of the reducers to carry out reduction processing on the intermediate result data to obtain final result data; and

identifying the category of each key value pair in the final result data;

marking each key value pair in the final result data according to the category of the key value pair;

and establishing a corresponding output path according to each key value pair marked in the final result data, and outputting the corresponding key value pair according to the established output path.

2. The data processing method of claim 1, wherein the step of obtaining key-value pair-based data to be processed comprises:

and setting the index identifier in the data to be processed as a key of each key value pair in the data to be processed, wherein the data type of the index identifier is an integer.

3. The data processing method of claim 1, wherein the step of obtaining key-value pair-based data to be processed comprises:

and acquiring the data to be processed from the specified input path according to the mapping/simplifying calculation request.

4. The data processing method of claim 1, wherein the step of invoking corresponding reducers to reduce the intermediate result data according to the keys of the key value pairs in the intermediate result data and the second preset number of reducers comprises:

sequentially taking a modulus of the second preset number of the reducers according to the keys of each key value pair in the intermediate result data;

and submitting the key value pairs corresponding to the keys with the same module values obtained after the module extraction processing in the intermediate result data to the same simplifying device in the second preset number of simplifying devices for simplifying, wherein each simplifying device in the second preset number of simplifying devices is respectively used for processing the key value pairs corresponding to the keys with the same module values obtained after the module extraction processing in the intermediate result data.

5. The data processing method of claim 1, wherein the step of identifying the category of each key-value pair in the final result data comprises:

and identifying the category of each key value pair in the final result according to the appointed field of the key value pair.

6. The data processing method of claim 1, wherein the step of labeling each key-value pair in the final result data according to the category of the key-value pair comprises:

and marking the keys in each key value pair in the final result data according to the category of the key value pair.

7. The data processing method of claim 6, wherein the step of marking the keys in each key value pair in the final result data according to the category of the key value pair comprises:

and sequentially processing the keys in each key value pair of the final result data according to the category of the key value pair as follows:

erasing the highest bit of the key, and then taking a preset identification value corresponding to the category of the key value pair to which the key belongs as the lowest bit of the key;

the step of outputting the final result data according to the mark of each key value pair in the final result data comprises the following steps:

and outputting the final result data according to the preset identification value of the middle key of each key value of the final result data.

8. A data processing apparatus, characterized in that the apparatus comprises:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for receiving a mapping/simplifying calculation request and acquiring data to be processed based on key value pairs, and the data types of keys in the data to be processed are integers;

the first processing module is used for averagely splitting the data to be processed into the data blocks to be processed with the first preset number based on the first preset number of mappers in mapping/simplification, wherein the mappers correspond to the data blocks to be processed one by one;

the first processing module is further configured to invoke the mapper to identify an input path of the corresponding data block to be processed, and perform mapping processing on the corresponding data block to be processed based on the identified input path to obtain intermediate result data;

the second processing module is used for calling corresponding reducers to carry out reduction processing on the intermediate result data according to the keys of the key value pairs in the intermediate result data and a second preset number of the reducers to obtain final result data; and

the identification submodule is used for identifying the category of each key value pair in the final result data;

the marking submodule is used for marking each key value pair in the final result data according to the category of the key value pair;

and the output sub-module is used for establishing a corresponding output path according to each key value pair marked in the final result data and outputting the corresponding key value pair according to the established output path.

9. The data processing apparatus of claim 8, wherein the obtaining the to-be-processed data based on the key-value pair comprises:

10. The data processing apparatus of claim 8, wherein the obtaining the to-be-processed data based on the key-value pair comprises:

11. The data processing apparatus of claim 8, wherein the invoking the corresponding reducers to reduce the intermediate result data according to the keys of the key-value pairs in the intermediate result data and the second preset number of reducers comprises:

12. The data processing apparatus of claim 8, wherein the identification submodule is to:

13. The data processing apparatus of claim 8, wherein the tagging submodule is to:

14. The data processing apparatus of claim 13, wherein said marking the keys in each key-value pair in the final result data according to the category of the key-value pair comprises:

the output submodule is used for:

15. A storage medium storing executable instructions for implementing a data processing method as claimed in any one of claims 1 to 7 when executed.