CN104424220A

CN104424220A - Data processing method and equipment

Info

Publication number: CN104424220A
Application number: CN201310373788.6A
Authority: CN
Inventors: 黄晓锋
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba China Network Technology Co Ltd
Priority date: 2013-08-23
Filing date: 2013-08-23
Publication date: 2015-03-18
Anticipated expiration: 2033-08-23
Also published as: CN104424220B

Abstract

The invention discloses a data processing method and equipment. The method comprises the steps: acquiring dimensional data of at least one to-be-processed dimension from a to-be-processed data record; with respect to each to-be-processed dimension and according to the dimensional data of the to-be-processed dimension, choosing a data processing unit for processing the to-be-processed data record from a plurality of preset data processing units corresponding to the to-be-processed dimension; distributing the to-be-processed record to the chosen data processing unit; and processing the dimensional data of the to-be-processed dimension of the to-be-processed data record by the chosen data processing unit. With adoption of the scheme, the data processing efficiency is improved.

Description

A kind of data processing method and device

Technical field

The application relates to the technical field of data processing in field of computer technology, particularly relates to a kind of data processing method and device.

Background technology

At present, in the practical application of computer technology and Internet technology, frequent needs are added up a large amount of data, be polymerized calculating and the process such as analysis, such as, data summation, data deduplication, ask for data maximums and ask for the process such as data minimum value.

In prior art, when stream data processes, data record is sent to data processing equipment by message-oriented middleware by data source in batch form, data processing equipment processes for the dimension data of the pending dimension of data record, and obtain the result of this lot data record, further, comprehensive accumulation process can also be carried out by processing to multiple lot data record the multiple results obtained, and data record and the data result that finally obtains are stored in database.

In the such scheme of prior art, the process of data equipment to data record is that serial is carried out, after must waiting for that a data record has been processed, next data record of reprocessing, and for the data record of batch, only can process the dimension data of a dimension, when needs process for multiple data dimension, also can only carry out successively, thus cause the efficiency of data processing lower.

Summary of the invention

In view of this, the embodiment of the present application provides a kind of data processing method and device, for solving the lower problem of the efficiency of carrying out data processing that exists in prior art.

The embodiment of the present application is achieved through the following technical solutions:

The embodiment of the present application provides a kind of data processing method, comprising:

Obtain the dimension data of at least one pending dimension of pending data record;

For each pending dimension, according to the dimension data of this pending dimension, from multiple data processing units corresponding with this pending dimension preset, select the data processing unit that will process described pending data record;

Described pending data record is distributed to the data processing unit of selection;

Processed by the dimension data of the data processing unit selected to this pending dimension of described pending data record.

In the above-mentioned data processing method that the embodiment of the present application provides, different dimensions for data record has preset corresponding data processing unit, thus make the dimension data for different dimensions, can by data processing unit parallel processing corresponding to each dimension, and, the multiple data processing units corresponding for each dimension set, so for the dimension data parallel processing of this dimension of multiple pending data record, thus can improve the efficiency of carrying out data processing.

Further, according to the dimension data of this pending dimension, from multiple data processing units corresponding with this pending dimension preset, select the data processing unit that will process described pending data record, specifically comprise:

Determine the Hash codes of the dimension data of this pending dimension;

Use the quantity remainder of the Hash codes of this dimension data pair multiple data processing units corresponding with this pending dimension, obtain remainder values;

From described multiple data processing unit, selection unit ID is the data processing unit of described remainder values, as the data processing unit that will process described pending data record.

Like this, according to the Hash codes of the dimension data of this pending dimension, exactly from multiple data processing unit, the data processing unit that will process this pending data record can be selected.

Further, processed by the dimension data of the data processing unit selected to this pending dimension of described pending data record, specifically comprise:

The data processing unit selected determines the Hash codes of the unique identification data of described pending data record;

According to the rear predetermined number position of the Hash codes of described unique identification data, from multiple data centralizations of preserving data accepted record, determine the data set corresponding with the rear predetermined number position of the Hash codes of described unique identification data, as data set to be checked, the rear predetermined number position of the Hash codes of the unique identification data of the data record that each data centralization of described multiple data centralization is preserved is identical, and different pieces of information concentrates the rear predetermined number position of the Hash codes of the unique identification data of the data record of preservation different;

When the data centralization described to be checked determined does not exist described pending data record, the dimension data of this pending dimension of described pending data record is processed.

Like this, when processing this dimension data of pending data record, first duplicate removal process is carried out according to the multiple data sets preserving data accepted record, no longer need during its duplicate removal process to inquire about from all data accepted records, only need to inquire about from one of multiple data set, decrease the calculated amount of duplicate removal process, thus further increase the efficiency of carrying out data processing.

Further, above-mentioned data processing method, also comprises:

According to the timestamp of the data record that described multiple data centralization is preserved, meet to described multiple data centralization the data record presetting the condition that abandons and carry out discard processing, the timestamp of data record is the temporal information that this data record is saved to data set.

Like this, the storage space of data set can be saved, and the data volume of the data record that data centralization stores can be reduced, to reduce query time in duplicate removal processing procedure, improve search efficiency.

Further, above-mentioned data processing method, also comprises:

For this pending dimension, the result obtained after processing the dimension data of this pending dimension of the data record received separately respectively described multiple data processing unit, carries out comprehensive accumulation process.

The embodiment of the present application also provides a kind of data processing equipment, comprising:

Acquiring unit, for obtaining the dimension data of at least one pending dimension of pending data record;

Selection unit, for for each pending dimension, according to the dimension data of this pending dimension, from multiple data processing units corresponding with this pending dimension preset, select the data processing unit that will process described pending data record;

Dispatching Unit, for being distributed to the data processing unit of selection by described pending data record;

Data processing unit, the dimension data for this pending dimension to the described pending data record be distributed to processes.

In the above-mentioned data processing equipment that the embodiment of the present application provides, different dimensions for data record has preset corresponding data processing unit, thus make the dimension data for different dimensions, can by data processing unit parallel processing corresponding to each dimension, and, the multiple data processing units corresponding for each dimension set, so for the dimension data parallel processing of this dimension of multiple pending data record, thus can improve the efficiency of carrying out data processing.

Further, selection unit, specifically for determining the Hash codes of the dimension data of this pending dimension; And use the quantity remainder of the Hash codes of this dimension data pair multiple data processing units corresponding with this pending dimension, obtain remainder values; And from described multiple data processing unit, selection unit ID is the data processing unit of described remainder values, as the data processing unit that will process described pending data record.

Further, data processing unit, specifically for determining the Hash codes of the unique identification data of described pending data record; And the rear predetermined number position of Hash codes according to described unique identification data, from multiple data centralizations of preserving data accepted record, determine the data set corresponding with the rear predetermined number position of the Hash codes of described unique identification data, as data set to be checked, the rear predetermined number position of the Hash codes of the unique identification data of the data record that each data centralization of described multiple data centralization is preserved is identical, and different pieces of information concentrates the rear predetermined number position of the Hash codes of the unique identification data of the data record of preservation different; And when the data centralization described to be checked determined does not exist described pending data record, the dimension data of this pending dimension of described pending data record is processed.

Further, above-mentioned data processing equipment, also comprises:

Discarding unit, for the timestamp of the data record according to described multiple data centralization preservation, meet to described multiple data centralization the data record presetting the condition that abandons and carry out discard processing, the timestamp of data record is the temporal information that this data record is saved to data set.

Further, above-mentioned data processing equipment, also comprises:

Comprehensive summing elements, for for this pending dimension, the result obtained after processing the dimension data of this pending dimension of the data record received separately respectively described multiple data processing unit, carries out comprehensive accumulation process.

The further feature of the application and advantage will be set forth in the following description, and, partly become apparent from instructions, or understand by implementing the application.The object of the application and other advantages realize by structure specifically noted in write instructions, claims and accompanying drawing and obtain.

Accompanying drawing explanation

Accompanying drawing is used to provide further understanding of the present application, and forms a part for instructions, is used from explanation the application with the embodiment of the present application one, does not form the restriction to the application.In the accompanying drawings:

The process flow diagram of the data processing method that Fig. 1 provides for the embodiment of the present application;

Selecting in the data processing method that Fig. 2 provides for the embodiment of the present application will to the process flow diagram of the data processing unit that pending data record processes;

The process flow diagram processed by the dimension data of data processing unit to pending data record in the data processing method that Fig. 3 provides for the embodiment of the present application;

The structural representation of the data processing equipment that Fig. 4 provides for the embodiment of the present application.

Embodiment

In order to provide the implementation improving and carry out the efficiency of data processing, the embodiment of the present application provides a kind of data processing method and device, this technical scheme can be applied to the process processed data, both can be implemented as a kind of method, also can be implemented as a kind of device.Be described below in conjunction with the preferred embodiment of Figure of description to the application, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the application, and be not used in restriction the application.And when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.

The embodiment of the present application provides a kind of data processing method, as shown in Figure 1, comprising:

Step 101, obtain the dimension data of at least one pending dimension of pending data record.

Step 102, for each pending dimension, according to the dimension data of this pending dimension, from multiple data processing units corresponding with this pending dimension preset, select the data processing unit that will process this pending data record.

Step 103, this pending data record is distributed to the data processing unit of selection.

Step 104, to be processed by the dimension data of the data processing unit selected to this pending dimension of this pending data record.

Below in conjunction with accompanying drawing, the method provided the application with specific embodiment and device are described in detail.

In the embodiment of the present application, the pending data record obtained in above-mentioned steps 101 can be constantly transfer to data processing equipment with the form of flow data, pending data record can be various types of data records, such as, can be the data record relevant to Internet technology, as the transaction data record related in e-commerce website.

Pending dimension can be arrange according to the actual needs of data processing in advance, can be set to multiple, can carry out parallel processing so that follow-up for different pending dimensions to data record, thus improves data-handling efficiency.Pending dimension can be the various data dimensions of data record, such as, for transaction data record, this pending dimension can be buyer's payment dimension, then the dimension data of this buyer's payment dimension is the amount of money paid when buyer in transaction data record buys commodity, also can be that seller collects amount of money dimension, the amount of money collected when the dimension data that then this seller collects amount of money dimension is seller sells commodity in transaction data record, also can be postage amount of money dimension, when then the dimension data of this postage dimension is that in transaction data record, seller posts commodity to buyer, buyer needs the postage paid.

Further, in order to reduce follow-up data record is processed time calculated amount, before above-mentioned steps 101, pre-service can also be carried out to the original data record of form of the flow data received, filter out follow-up data of carrying out required for data processing, obtain pending data record.

In the embodiment of the present application, be provided with corresponding multiple data processing units for often kind of data dimension in advance, thus parallel processing can be carried out, to improve treatment effeciency to the dimension data of this pending dimension of multiple pending data record simultaneously.Further, can be each data processing unit setting unit ID, unit ID can be respectively from 0 to the plurality of data processing unit quantity integer.

Accordingly, according to the dimension data of a pending dimension in above-mentioned steps 102, from multiple data processing units corresponding with this pending dimension preset, when selecting the data processing unit that will process this pending data record, specifically can as shown in Figure 2, comprise:

Step 201, determine the Hash codes of dimension data of this pending dimension obtained.

Step 202, use the quantity remainder of the Hash codes of this dimension data pair the plurality of data processing unit corresponding with this pending dimension, obtain remainder values.

Step 203, from the plurality of data processing unit, selection unit ID is the data processing unit of this remainder values, as the data processing unit that will process this pending data record.

In the embodiment of the present application, also the processing mode similar to selecting data processing unit mode shown in above-mentioned Fig. 2 can be adopted, according to this dimension data obtained, select the data processing unit that will process these pending data from the plurality of data processing unit, be no longer described in detail at this.

In the said method that the embodiment of the present application provides, after this pending data record is distributed to the data processing unit of selection, namely can be processed by the dimension data of the data processing unit selected to this pending dimension of this pending data record by above-mentioned steps 104, specifically as shown in Figure 3, following treatment step can be comprised:

Step 301, the data processing unit selected determine the Hash codes of the unique identification data of this pending data record.

Wherein, this unique identification data may be used for distinguishing different pending data records, and such as, for transaction record data, this unique identification data can be trading card number.

Step 302, rear predetermined number position according to the Hash codes of this unique identification data, from multiple data centralizations of preserving data accepted record, determine the data set corresponding with the rear predetermined number position of the Hash codes of this unique identification data, as data set to be checked.

Wherein, the rear predetermined number position of the Hash codes of the unique identification data of the data record that each data centralization of the plurality of data centralization is preserved is identical, and different pieces of information concentrates the rear predetermined number position of the Hash codes of the unique identification data of the data record of preservation different.

In the embodiment of the present application, data processing unit is after receiving pending data record, data centralization can be saved in, and be that data record identical for the rear predetermined number position of the Hash codes of unique identification data is saved in same data centralization when preserving, thus make the follow-up data record can preserved based on data centralization, duplicate removal process is carried out to the new pending data record received.

Wherein, this predetermined number can be arranged according to actual needs flexibly, such as, can arrange according to the total bit of the Hash codes of unique identification data.

Step 303, whether there is this pending data record in data centralization to be checked inquiry, when there is not this pending data record in this data centralization to be checked, the dimension data of this pending dimension of this pending data record is processed, when there is this pending data record in this data centralization to be checked, represent that this pending data record was received, do not need to process this pending data record again, namely the process to this pending data record is cancelled, further, this pending data record can be abandoned.

In the said method that the embodiment of the present application provides, further, at the plurality of data processing unit corresponding with this pending dimension respectively to the dimension data of this pending dimension of the data record received separately, carry out after process obtains corresponding result, can also for this pending dimension, comprehensive accumulation process is carried out to these results, such as, if data summation process, then these results can be carried out cumulative summation, if ask for data maximums, data maximums can be asked for from these results.

After the final process result obtaining comprehensive accumulation process, by the final process result corresponding respectively with each dimension, can also export in the storage system preset and preserve.

In the said method that the embodiment of the present application provides, further, when data record is saved in data centralization, the temporal information that data record can also be saved to data set carries out record as timestamp, thus the timestamp of the data record can preserved according to multiple data centralization, the data record presetting the condition that abandons is met to multiple data centralization and carries out discard processing, such as, the data record that can be exceeded predetermined time period the holding time abandons, also the data record of timestamp before predetermined time can be abandoned, thus the storage space of data set can be saved, and the data volume of the data record that data centralization stores can be reduced, to reduce query time in duplicate removal processing procedure, improve search efficiency.

Based on same inventive concept, according to the data processing method that the above embodiments of the present application provide, correspondingly, the application another embodiment still provides data processing equipment, its structural representation as shown in Figure 4, specifically comprises:

Acquiring unit 401, for obtaining the dimension data of at least one pending dimension of pending data record;

Selection unit 402, for for each pending dimension, according to the dimension data of this pending dimension, from multiple data processing units corresponding with this pending dimension preset, select the data processing unit that will process described pending data record;

Dispatching Unit 403, for being distributed to the data processing unit of selection by described pending data record;

Data processing unit 404, the dimension data for this pending dimension to the described pending data record be distributed to processes.

Further, selection unit 402, specifically for determining the Hash codes of the dimension data of this pending dimension; And use the quantity remainder of the Hash codes of this dimension data pair multiple data processing units corresponding with this pending dimension, obtain remainder values; And from described multiple data processing unit, selection unit ID is the data processing unit of described remainder values, as the data processing unit that will process described pending data record.

Further, data processing unit 404, specifically for determining the Hash codes of the unique identification data of described pending data record; And the rear predetermined number position of Hash codes according to described unique identification data, from multiple data centralizations of preserving data accepted record, determine the data set corresponding with the rear predetermined number position of the Hash codes of described unique identification data, as data set to be checked, the rear predetermined number position of the Hash codes of the unique identification data of the data record that each data centralization of described multiple data centralization is preserved is identical, and different pieces of information concentrates the rear predetermined number position of the Hash codes of the unique identification data of the data record of preservation different; And when the data centralization described to be checked determined does not exist described pending data record, the dimension data of this pending dimension of described pending data record is processed.

Further, above-mentioned data processing equipment, also comprises:

Discarding unit 405, for the timestamp of the data record according to described multiple data centralization preservation, meet to described multiple data centralization the data record presetting the condition that abandons and carry out discard processing, the timestamp of data record is the temporal information that this data record is saved to data set.

Further, above-mentioned data processing equipment, also comprises:

Comprehensive summing elements 406, for for this pending dimension, the result obtained after processing the dimension data of this pending dimension of the data record received separately respectively described multiple data processing unit, carries out comprehensive accumulation process.

The function of above-mentioned each unit may correspond to the respective handling step in flow process shown in Fig. 1 to Fig. 3, does not repeat them here.

In sum, the scheme that the embodiment of the present application provides, comprising: the dimension data getting at least one pending dimension of pending data record; And for each pending dimension, according to the dimension data of this pending dimension, from multiple data processing units corresponding with this pending dimension preset, select the data processing unit that will process this pending data record; And this pending data record is distributed to the data processing unit of selection; And processed by the dimension data of the data processing unit selected to this pending dimension of this pending data record.The scheme adopting the embodiment of the present application to provide, improves the efficiency of carrying out data processing.

The data processing equipment that the embodiment of the application provides realizes by computer program.Those skilled in the art should be understood that; above-mentioned Module Division mode is only the one in numerous Module Division mode; if be divided into other modules or do not divide module, as long as data processing equipment has above-mentioned functions, all should within the protection domain of the application.

The application describes with reference to according to the process flow diagram of the method for the embodiment of the present application, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

In one typically configuration, described computer equipment comprises one or more processor (CPU), input/output interface, network interface and internal memory.Internal memory may comprise the volatile memory in computer-readable medium, and the forms such as random access memory (RAM) and/or Nonvolatile memory, as ROM (read-only memory) (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.Computer-readable medium comprises permanent and impermanency, removable and non-removable media can be stored to realize information by any method or technology.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computing machine comprises, but be not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic RAM (DRAM), the random access memory (RAM) of other types, ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc ROM (read-only memory) (CD-ROM), digital versatile disc (DVD) or other optical memory, magnetic magnetic tape cassette, tape magnetic rigid disk stores or other magnetic storage apparatus or any other non-transmitting medium, can be used for storing the information can accessed by computing equipment.According to defining herein, computer-readable medium does not comprise the computer readable media (transitory media) of non-standing, as data-signal and the carrier wave of modulation.

Obviously, those skilled in the art can carry out various change and modification to the application and not depart from the spirit and scope of the application.Like this, if these amendments of the application and modification belong within the scope of the application's claim and equivalent technologies thereof, then the application is also intended to comprise these change and modification.

Claims

1. a data processing method, is characterized in that, comprising:

2. the method for claim 1, it is characterized in that, according to the dimension data of this pending dimension, from multiple data processing units corresponding with this pending dimension preset, the data processing unit that selection will process described pending data record, specifically comprises:

Determine the Hash codes of the dimension data of this pending dimension;

3. the method for claim 1, is characterized in that, is processed, specifically comprise by the dimension data of the data processing unit selected to this pending dimension of described pending data record:

4. method as claimed in claim 3, is characterized in that, also comprise:

5. the method as described in as arbitrary in claim 1-4, is characterized in that, also comprise:

6. a data processing equipment, is characterized in that, comprising:

7. device as claimed in claim 6, is characterized in that, described selection unit, specifically for determining the Hash codes of the dimension data of this pending dimension; And use the quantity remainder of the Hash codes of this dimension data pair multiple data processing units corresponding with this pending dimension, obtain remainder values; And from described multiple data processing unit, selection unit ID is the data processing unit of described remainder values, as the data processing unit that will process described pending data record.

8. device as claimed in claim 6, is characterized in that, data processing unit, specifically for determining the Hash codes of the unique identification data of described pending data record; And the rear predetermined number position of Hash codes according to described unique identification data, from multiple data centralizations of preserving data accepted record, determine the data set corresponding with the rear predetermined number position of the Hash codes of described unique identification data, as data set to be checked, the rear predetermined number position of the Hash codes of the unique identification data of the data record that each data centralization of described multiple data centralization is preserved is identical, and different pieces of information concentrates the rear predetermined number position of the Hash codes of the unique identification data of the data record of preservation different; And when the data centralization described to be checked determined does not exist described pending data record, the dimension data of this pending dimension of described pending data record is processed.

9. device as claimed in claim 8, is characterized in that, also comprise:

10. the device as described in as arbitrary in claim 6-9, is characterized in that, also comprise: