CN104252486B - A kind of method and device of data processing - Google Patents

A kind of method and device of data processing Download PDF

Info

Publication number
CN104252486B
CN104252486B CN201310268334.2A CN201310268334A CN104252486B CN 104252486 B CN104252486 B CN 104252486B CN 201310268334 A CN201310268334 A CN 201310268334A CN 104252486 B CN104252486 B CN 104252486B
Authority
CN
China
Prior art keywords
data
key
index
mapping
deleted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310268334.2A
Other languages
Chinese (zh)
Other versions
CN104252486A (en
Inventor
王立
刘立川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310268334.2A priority Critical patent/CN104252486B/en
Publication of CN104252486A publication Critical patent/CN104252486A/en
Application granted granted Critical
Publication of CN104252486B publication Critical patent/CN104252486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • G06F16/2386Bulk updating operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a kind of method and device of data processing, and this method includes:Compare the first input data of previous processing needs the second input data to be processed with this, to obtain deleted data and new interpolation data;Carry out first to being deleted data and new interpolation data and handle, to obtain deleted data key indexed set and new interpolation data key index collection and be deleted mapping data set and new addition mapping data set;The deletion mapping data corresponding with being deleted mapping data from the first mapping data set, new addition mapping data are added to the first mapping data set to form the second mapping data set;Mapping data corresponding with deleted data key indexed set and new interpolation data key index collection in second mapping data set are carried out second processing to obtain change output data;Output data to be replaced in first output data is substituted for change output data to obtain the second output data.It can avoid reprocessing constant input data using the application.

Description

A kind of method and device of data processing
Technical field
The application is related to field of computer technology, more particularly to a kind of method and device of data processing.
Background technology
In large-scale calculations field, cloud computing is just by the strong interest of people, as the core technology of cloud computing, MapReduce(MapReduce)Also extensive concern is received.Map Reduce system passes through Map(Mapping)And Reduce(Abbreviation) Two such simple concept constitutes computing elementary cell.User need to only write Map functions and Reduce functions can be achieved Parallel processing to extensive mass data collection.Map Reduce system can be according to input data size and the configuration of operation It is automatically multiple identical Map tasks and Reduce tasks by the job initialization etc. information, different input numbers is read respectively According to block and Map functions and Reduce functions is called to be handled.
In practical application at this stage, MapReduce data handling systems are usually arranged as the state of timing operation, example Such as operation daily.The input data of MapReduce data handling procedures, is typically the data of accumulation in a period of time, for example, most The data of accumulation in nearly 15 days.The characteristics of carrying out MapReduce processing to such data is, at this MapReduce data The input data of the input data of reason process and last data handling procedure is largely identical, and only partial data exists This MapReduce data handling procedure is deleted, and/or this MapReduce data handling procedure increases part newly again Data.For such MapReduce data processings, this property of input data is all have ignored in current application, so that All data are carried out with complete MapReduce processing.However, its tangible adjacent MapReduce processing twice of many data During be constant, be to repeat in fact to the processing evaluation works of these data, waste computing resource.
The content of the invention
The application is to overcome drawbacks described above there is provided a kind of method and device of data processing, to avoid reprocessing constant Data flow.
According to the one side of the application there is provided a kind of method of data processing, including:Compare previous processing first is defeated Entering data and this needs the second input data to be processed, and to obtain delta data, it is defeated that the delta data includes described second Enter deleted data and new interpolation data of the data relative to first input data;To the deleted data and new addition Data carry out first and handled, to obtain deleted data key indexed set and new interpolation data key index collection, and respectively with it is described Be deleted data key indexed set it is corresponding with new interpolation data key index collection using key for index deleted mapping data set with New addition mapping data set;From the first mapping data set using key as index in deletion and the deleted mapping data set The corresponding mapping data of mapping data are deleted, and the new addition in the new addition mapping data set is mapped into data addition Into the described first mapping data set, to form the second mapping number using key as index corresponding with second input data According to collection, wherein, it is described first mapping data set respectively with the first input data, using key for index the first output data it is relative Should;To relative with the deleted data key indexed set and the new interpolation data key index collection in the described second mapping data set The mapping data answered carry out second processing to obtain the change output data using key as index;And by first output data In be substituted for the change output data with the corresponding output data to be replaced of key index of the change output data with must To the second output data using key as index of this processing procedure.
According to embodiments herein, in the method, the new addition mapping data set, the first mapping data set, the Two mapping data sets include at least one data subset respectively, wherein, data corresponding with identical key are using the key as index A data subset in.
According to embodiments herein, in the method, in the described second mapping data set with the deleted data The key index collection mapping data corresponding with the new interpolation data key index collection carry out second processing to obtain using key as index Change output data the step of, including:Determine it is described second mapping data set in the deleted data key indexed set and The key index identical delta data subset that the new interpolation data key index is concentrated;Second is carried out to the delta data subset Handle to obtain the change output data using key as index.
According to embodiments herein, in the method, by first output data with the change output data The corresponding output data to be replaced of key index be substituted for the change output data with obtain this processing procedure with key For index the second output data the step of, including:Search and change with described in output data in first output data Each key identical output data to be replaced;The output data to be replaced is substituted for the change output data, and Using the first output data after replacement as this processing procedure the second output data.
According to embodiments herein, in the method, the deleted data are appearance in first input data And the data that do not occur in second input data;The new interpolation data be first input data in do not occur And the data that occur in second input data.
According to embodiments herein, in the method, first processing includes:Extracted based on data to be dealt with Key-value pair, to obtain the key index collection using key as index, and forms the mapping data set using key as index;And it is wherein, described First processing also includes:The recording mark of data to be dealt with is extracted, the recording mark includes:File path, line number;Institute Stating second processing includes:Data to be dealt with are handled according to pre-defined rule, the output data using key as index is obtained.
According to the another aspect of the application there is provided a kind of device of data processing, including:Comparison module, before comparing First input data of secondary processing needs the second input data to be processed with this, to obtain delta data, the delta data Including the deleted data and new interpolation data second input data relative to first input data;First processing mould Block, is handled for carrying out first to the deleted data and new interpolation data, to obtain deleted data key indexed set and new Interpolation data key index collection, and it is corresponding with the deleted data key indexed set and new interpolation data key index collection respectively Using key as the deleted mapping data set of index and new addition mapping data set;Intermediate process module, for from using key as index The first mapping data set in delete and the deleted corresponding mapping number of deleted mapping data mapped in data set According to, and the new addition mapping data in the new addition mapping data set are added in the first mapping data set, with shape Into the second mapping data set using key as index corresponding with second input data, wherein, the first mapping data Collection respectively with the first input data, using key for index the first output data it is corresponding;Second processing module, for described The mapping number corresponding with the deleted data key indexed set and the new interpolation data key index collection in two mapping data sets According to progress second processing to obtain the change output data using key as index;And output data acquisition module, for by described in Output data to be replaced corresponding with the key index of the change output data is substituted for the change in first output data Second output data using key as index of the output data to obtain this processing procedure.
According to embodiments herein, in the apparatus, the new addition mapping data set, the first mapping data set, the Two mapping data sets include at least one data subset respectively, wherein, data corresponding with identical key are using the key as index A data subset in.
According to embodiments herein, in the apparatus, the Second processing module includes:Determination sub-module, for true The key concentrated in the fixed second mapping data set with the deleted data key indexed set and the new interpolation data key index Index identical delta data subset;Handle submodule, for the delta data subset carry out second processing with obtain with Key is the change output data of index.
According to embodiments herein, in the apparatus, the acquisition module includes:Submodule is searched, for described Searched in first output data and each key identical output data to be replaced in the change output data;Replace submodule Block, for the output data to be replaced to be substituted for into the change output data, and by the first output data after replacement It is used as the second output data of this processing procedure.
According to embodiments herein, in the apparatus, the deleted data are appearance in first input data And the data that do not occur in second input data;The new interpolation data be first input data in do not occur And the data that occur in second input data.
According to embodiments herein, in the apparatus, first processing includes:Extracted based on data to be dealt with Key-value pair, to obtain the key index collection using key as index, and forms the mapping data set using key as index;And it is wherein, described First processing also includes:The recording mark of data to be dealt with is extracted, the recording mark includes:File path, line number;Institute Stating second processing includes:Data to be dealt with are handled according to pre-defined rule, the output data using key as index is obtained.
Compared with prior art, according to the technical scheme of the application, it can avoid reprocessing constant input data, from And the time of data processing can be shortened, save data processing resources.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please is used to explain the application, does not constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 be the invention relates to a kind of data processing method flow chart;
Fig. 2 be the invention relates to a kind of data processing method in previous processing data and this handle The schematic diagram of data;
Fig. 3 is the particular flow sheet of the step S104 in Fig. 1 of the embodiment of the present application;
Fig. 4 is the particular flow sheet of the step S105 in Fig. 1 of the embodiment of the present application;And
Fig. 5 be the invention relates to a kind of data processing device block diagram.
Embodiment
The main thought of the application is, is become by contrasting the input data in previous processing and this processing procedure The data of change, and the input data of change is handled using key as the change output data indexed, and become according to described The key index for changing output data replaces corresponding output data in previous processing procedure, using obtain this processing using key to index Output data.
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described corresponding accompanying drawing, it is clear that described embodiment is only the application one Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under the premise of creative work, belong to the scope of the application protection.
With reference to Fig. 1, Fig. 1 is a kind of flow chart of data processing method of the application.
In step S101, the first input data of relatively more previous processing needs the second input data to be processed with this, To obtain delta data.
The delta data include second input data relative to first input data deleted data and New interpolation data.Wherein, the deleted data be first input data in occur and in second input data In the data that do not occur.The new interpolation data in first input data to be not occurring in second input data The data of middle appearance.
In actual applications, time interval or the frequency operation that data processing can typically be set, for example, can do daily Data processing, and the data handled are generally the data of a period of time accumulation, for example, the data of nearest 15 days.The application In previous processing and this handle, refer in time successively carry out two processing procedures, for each processing procedure, This processing is can serve as upon execution, and the last processing procedure carried out before it can be previous treated as its Journey, and it is also the previous processing procedure of single treatment process thereafter.As shown in Fig. 2 for example, the number of processing accumulation in 5 days daily According to, then the data 210 of processing yesterday are exactly the first input data of previous processing, and today needs the data 220 handled just It is the second input data of this processing.In second input data of processing today, delete processing yesterday continuous 5 days are accumulated The first input data in the data that produce for the 1st day, and with the addition of the data newly produced today.That is, being inputted from first Need to be deleted the input data produced for the 1st day in data(It is deleted data)And add the input data of generation today(New addition Data), can obtain this needs the second input data to be processed.
For example, in an embodiment handled user access logses.The data of the previous processing(First is defeated Enter data)For:
URL:111.com, date:20130214,11:00:00,8
URL:222.com, date:20130214,13:00:00,7
URL:111.com, date:20130214,15:00:00,5
URL:222.com, date:20130214,17:00:00,7
URL:111.com, date:20130215,14:00:00,5
URL:333.com, date:20130215,16:00:00,3
URL:333.com, date:20130216,15:00:00,8
URL:555.com, date:20130216,16:00:00,11
URL:222.com, date:20130217,15:00:00,6
URL:555.com, date:20130217,15:00:00,10
URL:666.com, date:20130218,14:00:00,8
URL:666.com, date:20130218,15:00:00,9
URL:666.com, date:20130218,16:00:00,5
This needs data to be processed(Second input data)For:
URL:111.com, date:20130215,14:00:00,5
URL:333.com, date:20130215,16:00:00,3
URL:333.com, date:20130216,15:00:00,8
URL:555.com, date:20130216,16:00:00,11
URL:222.com, date:20130217,15:00:00,6
URL:555.com, date:20130217,15:00:00,10
URL:666.com, date:20130218,14:00:00,8
URL:666.com, date:20130218,15:00:00,9
URL:666.com, date:20130218,16:00:00,5
URL:222.com, date:20130219,15:00:00,9
URL:333.com, date:20130219,16:00:00,6
URL:222.com, date:20130219,17:00:00,9
URL:333.com, date:20130219,18:00:00,8
Compare the first input data and the second input data, the first input data can be obtained relative to the second input data Deleted data be:
URL:111.com, date:20130214,11:00:00,8
URL:222.com, date:20130214,13:00:00,7
URL:111.com, date:20130214,15:00:00,5
URL:222.com, date:20130214,17:00:00,7
And, new interpolation data is:
URL:222.com, date:20130224,15:00:00,9
URL:333.com, date:20130224,16:00:00,6
URL:222.com, date:20130224,17:00:00,9
URL:333.com, date:20130224,18:00:00,8
In step s 102, carry out first to the deleted data and new interpolation data to handle, to obtain deleted number According to key index collection and new interpolation data key index collection, and respectively with the deleted data key indexed set and new interpolation data key Indexed set is corresponding using key as the deleted mapping data set of index and new addition mapping data set.Wherein, at described first Reason can include:Key-value pair is extracted based on data to be dealt with, to obtain the key index collection using key as index, and is formed with key For the mapping data set of index.Wherein key index refers to using the key of data as index, i.e. can map data to be dealt with Mapping processing can be carried out as key-value pair data, that is, to data to be dealt with, the key of each data is obtained(key)With Value(value)Corresponding key-value pair(key-value)Data, and it is possible to according to obtained key-value pair data, with every number According to key for index, generate it is corresponding using key for index mapping data set.The mapping data set can include at least one Data subset, wherein, data corresponding with identical key are in using the key as the data subset of index.That is, described reflect One or more data subsets can be included by penetrating data set, wherein, each data subset includes the one or more numbers of key identical According to, also, it is used as using the identical key of one or more of data the key index of the data subset.Therefore, to being deleted number The first processing is carried out respectively according to new interpolation data, can obtain the key and each new interpolation data of each deleted data Key, each key for being deleted data is constituted and gathered, and is as deleted data key indexed set, and by each new interpolation data Key composition set, as new interpolation data key index collection.First processing can also include:Extract data to be dealt with Recording mark, the recording mark can include:File path, line number, recording mark can be used for identifying each data, for example, Recording mark of the line number as data can be used.
For example, being used as the key of data using URL(key), obtained being deleted data and new addition number in step S101 According to the data progress mapping processing to obtaining can obtain each key for being deleted data(key)Respectively 111.com, 222.com, 111.com, 222.com, the key of each new interpolation data(key)Respectively 222.com, 333.com, 222.com, 333.com, therefore, it can obtain, and it is { 111.com, 222.com }, new interpolation data key index to be deleted data key indexed set Collect for { 222.com, 333.com }.
Also, the key-value pair data that mapping processing is obtained is carried out according to the deleted data and new interpolation data, can To obtain using key as the deleted mapping data set of index and new addition mapping data set, and the deleted mapping data set It is relative with the key index that deleted data key indexed set and new interpolation data key index are concentrated respectively with new addition mapping data set Should.Using key as the deleted mapping data set of index or new addition mapping data set, including at least one data subset, wherein, Data corresponding with identical key are in using the key as a data subset of index.That is, being deleted mapping data set In new addition mapping data set, at least including a subclass being made up of key identical data, these subclass can be with The identical key of mapping data in each subset is index, therefore, is deleted in mapping data set and new addition mapping data set The key index of each data subset is corresponding with deleted data key indexed set and new interpolation data key index collection.Also, it is described The recording mark of data can also be included by being deleted mapping data set and new addition mapping data set, as the mark of data, That is, recording mark can as each data where it using key as index data subset in mark, for example, can With the recording mark using the line number of each data as the data.
Thus, it is possible to obtain being by the deleted mapping data set indexed of key:
111.com:
8^A0
5^A2
(It is that, using 111.com as a data subset of the deleted mapping data set indexed, the data subset includes above Data 8^A0 and 5^A2, the key of these data is all mutually 111.com, wherein, 8 and 5 be the value of data, and A0 and A2 represent data The line number being expert at is respectively 0 and 2, and the recording mark of data can be used as using line number.)
222.com:
7^A1
7^A3
It is by the new addition mapping data set indexed of key:
222.com:
9^A13
9^A14
333.com:
6^A15
8^A16
Wherein, Ai represents line number, and wherein i is 1,2,3 ..., n.
In step s 103, deleted and the deleted mapping data set from the first mapping data set using key as index In the corresponding mapping data of deleted mapping data, and the new addition in the new addition mapping data set is mapped into data It is added in the first mapping data set, is reflected so that formation is corresponding with second input data using key as the second of index Penetrate data set.Wherein, the first mapping data set is respectively with the first input data, using key as the first output data phase of index Correspondence.First mapping data set is that set, the set using key as the data of index obtained in previous processing is included at least One data subset, wherein, data corresponding with identical key are in using the key as a data subset of index.Namely Say, in the first mapping data set, including one or more subclass being made up of key identical data, these subclass are with each son The key that the mapping data of concentration are common is index.Also, the first mapping data set can also include the recording mark of data, The mark of data is used as, i.e. the index of each data subset in using key as the described first mapping data set, with the record mark of data Note identifies each mapping data, for example, can each data line number as the data recording mark.
That is, can according to it is deleted mapping data set in all key indexes and each data recording mark, The corresponding mapping data in the first mapping data set are deleted, for example, according to being deleted for being obtained in above-mentioned steps S102 Mapping data set includes the mapping data using 111.com, 222.com as index, also, this is deleted in mapping data set Data are using the line number of data as recording mark, then, found according to key index and line number corresponding in the first mapping data set Mapping data, and delete them.Specifically, can according to it is deleted mapping data set in key index 111.com and 222.com, the data subset that key index is 111.com and 222.com is searched in the first mapping data set, and according to deleted The line number for mapping the deleted data of data centralized recording searches corresponding mapping data and deletes the mapping data that will be found Delete, because the described first mapping data set is corresponding with the first input data, using the first output data of key as index respectively, It therefore, it can extract each deleted line number of the data in first input data and map each in data set as first The recording mark of mapping data is deleted, data set phase just can be mapped first according to the recording mark of each deleted data Corresponding mapping data are deleted in the subset answered.Also, the new addition mapping data in the new addition mapping data set are added It is added in the first mapping data set.That is, the mapping data newly added are added in the first mapping data, so that Obtain corresponding the second mapping data set using key as index of second input data.Specifically, can be by new addition Mapping data set in each the first mapping according to the key index of each data subset is added to using the data subset of key as index In data it is corresponding using key in the data subset of index, so as to obtain corresponding second mapping data using key for what is indexed Collection.So as to, the second mapping data the first mapping data as also depicted, including at least one is made up of key identical data Subclass, the common key of mapping data that these subclass can be in each subset is index, also, the second mapping data set The recording mark of data can also be included, the mark of data is used as, i.e. each data in using key as the described second mapping data set The index of subset, each mapping data are identified with the recording mark of data.
For example, the first mapping data set obtained in previous processing procedure is:
111.com:
8^A0
5^A2
8^A4
222.com:
7^A1
7^A3
6^A8
333.com:
3^A5
8^A6
555.com:
11^A7
10^A9
666.com:
8^A10
9^A11
5^A12
Also, it is { 111.com, 222.com }, new interpolation data to obtain being deleted data key indexed set in step s 102 Key index collection is { 222.com, 333.com }, is deleted mapping data set:
111.com
8^A0
5^A2
222.com:
7^A1
7^A3
New addition maps data set:
222.com:
9^A13
9^A14
333.com:
6^A15
8^A16
Therefore, the deleted mapping data in deleted mapping data set are deleted from the first mapping data set, and The new addition mapping data new addition mapped in data set are added in the first mapping data set, can obtain:Second mapping Data set is:
111.com:
8^A4
222.com:
6^A8
9^A13
9^A14
333.com:
3^A5
8^A6
6^A15
8^A16
555.com:
11^A7
10^A9
666.com:
8^A10
9^A11
5^A12
In step S104, to described second mapping data set in deleted data key indexed set and new interpolation data key The corresponding mapping data of indexed set carry out second processing to obtain the change output data using key as index.Step S104 is specific Refer to shown in Fig. 3, Fig. 3 is step S104 particular flow sheet, as shown in Figure 3:
In step S301, determine in the second mapping data set with the deleted data key indexed set and described new The key index identical delta data subset that interpolation data key index is concentrated.In step S103 above, have been obtained for Key maps data set for the second of index, and the set includes the data subset that at least one is made up of key identical data, its In, data corresponding with identical key are in using the key as a data subset of index.According to the deleted data key rope Draw collection and the new interpolation data key index collection described second mapping data in search with deleted data key indexed set and newly The key index identical data subset that interpolation data key index is concentrated, it is possible to determine the second mapping data relative to the first mapping The delta data subset of data.
In step s 302, the delta data subset is carried out second processing to obtain change output data.Described Two processing can include:Data to be dealt with are handled according to pre-defined rule, the output data using key as index is obtained. Wherein, pre-defined rule can be set according to the specific needs of data processing.That is, to delta data subset obtained above according to Pre-defined rule is handled, and obtains changing output data, wherein, the change output data can be the output using key as index Data.
For example, in above-mentioned step S103, obtaining the second mapping data set is:
111.com:
8^A4
222.com:
6^A8
9^A13
9^A14
333.com:
3^A5
8^A6
6^A15
8^A16
555.com:
11^A7
10^A9
666.com:
8^A10
9^A11
5^A12
And be deleted data key indexed set and new interpolation data key index collection be respectively { 111.com, 222.com } and { 222.com, 333.com },
Therefore, in step S301, available delta data subset be respectively with 111.com, 222.com, 333.com is the data subset of index, i.e.
111.com:
8^A4
222.com:
6^A8
9^A13
9^A14
333.com:
3^A5
8^A6
6^A15
8^A16
In step s 302, obtained data subset is handled according to pre-defined rule, for example:Add operation is done, can It is as follows using the change output data of key as index to obtain:
111.com:
8
222.com:
24
333.com:
25
In step S105, by first output data with it is described change output data key index it is corresponding will quilt Replace output data and be substituted for second output number using key as index of the change output data to obtain this processing procedure According to.
Step S105, is specifically referred to shown in Fig. 4, and Fig. 4 is step S105 particular flow sheet, as shown in Figure 4:
In step S401, searched in first output data and each key identical in the change output data Output data to be replaced.That is, according to all key indexes using key as the change output data of index, searching described With the key identical key index output data of change output data, output data as to be replaced in first output data.
In step S402, the output data to be replaced is substituted for the change output data, and by after replacement The first output data as this processing procedure the second output data.That is, by will quilt in the first output data found Replace output data and be substituted for and change output data with its key index identical, the first output data after replacement is at this Second output data of reason process.Specifically, it is that is, in output data, the data changed are corresponding with key The change output data output of bonding identical is replaced for index output data, and the corresponding key index of constant data exports number According to according to the constant output of the first original output data.
For example, previous the first obtained output data that handles is:
111.com:
21
222.com:
20
333.com:
11
555.com:
21
666.com:
22
Therefore,, can be with according to the change output data obtained in step S104 using key as index in step S402 Find and be with each key identical output data to be replaced in the change output data:
111.com:
21
222.com:
20
333.com:
11
They are replaced with to the change output data obtained in step S104, the second output data of this processing is obtained For:
111.com:
8
222.com:
24
333.com:
25
555.com:
21
666.com:
22
In second output data, do not changed using the output data of 555.com and 666.com as index, and be also not required to Carry out it is above-mentioned handle twice, so as to reduce treating capacity.
In addition, for first time processing procedure(That is, without previous processing procedure), the number for the previous processing procedure being related to According to sky is all considered as, such as the first input data, the first mapping data set, the first output data.That is, in this processing implementation procedure, In step S101, this second input data be previous processing the first input data relative to this handle second The new interpolation data of input data, and data are deleted for sky, in step s 102, obtain new interpolation data key index collection, quilt Data key indexed set, new addition mapping data set and the second mapping data set are deleted, wherein, being deleted data key indexed set is Sky, the second mapping data set is new addition mapping data set, and step S103-S105 by that analogy, is not being repeated herein, because This, the second output data of this processing finally obtained is to carry out what second processing was obtained to the new addition mapping data Change output data.
With reference to Fig. 5, Fig. 5 provides a kind of block diagram of data processing equipment according to the another aspect of the application, such as Fig. 5 institutes Show, the device can include:Comparison module 510, first processing module 520, intermediate process module 530, Second processing module 540th, output data acquisition module 550.
Comparison module 510, the first input data and this need to be processed second that can be used for the previous processing of comparison is defeated Enter data, to obtain delta data, the delta data includes second input data relative to first input data Deleted data and new interpolation data.
First processing module 520, can be used for carrying out the first processing to the deleted data and new interpolation data, to obtain Must be deleted data key indexed set and new interpolation data key index collection, and respectively with deleted data key indexed set and new addition The corresponding deleted mapping data set of data key indexed set and new addition mapping data set.
Intermediate process module 530, can be used for deleting from the first mapping data set using key as index and is deleted with described Except the corresponding mapping data of the deleted mapping data in mapping data set, and will be new in the new addition mapping data set Addition mapping data are added in the first mapping data set, are reflected with forming second corresponding with second input data Data set is penetrated, wherein, the first mapping data set is respectively with the first input data, using key as the first output data phase of index Correspondence.
Second processing module 540, in the described second mapping data set with deleted data key indexed set and newly adding Addend carries out second processing to obtain the change output data using key as index according to the corresponding mapping data of key index collection.
Output data acquisition module 550, for by first output data with it is described change output data key rope Draw corresponding output data to be replaced be substituted for the change output data using obtain this processing procedure using key to index The second output data.
Wherein, the new addition mapping data set, the first mapping data set, the second mapping data set include at least one respectively Individual data subset, wherein, data corresponding with identical key are in using the key as a data subset of index.
The Second processing module 540 can include:Determination sub-module 541 and processing submodule 542.
Determination sub-module 541, is determined in the second mapping data set and the deleted data key index The key index identical delta data subset that collection and the new interpolation data key index are concentrated.
Submodule 542 is handled, can be used for carrying out the delta data subset second processing to obtain using key as index Change output data.
The output data acquisition module 550 can include:Search submodule 551, replace submodule 552.
Submodule 551 is searched, can be used for the lookup in first output data and change each in output data with described Individual key identical output data to be replaced;
Submodule 552 is replaced, can be used for the output data to be replaced being substituted for the change output data, and Using the first output data after replacement as this processing procedure the second output data.
Wherein, the deleted data be first input data in occur and in second input data not The data of appearance;The new interpolation data goes out to be not occurring in first input data in second input data Existing data.
First processing can include:Key-value pair is extracted based on data to be dealt with, to obtain using key as index Key index collection, and form the mapping data set using key as index;And wherein, first processing also includes:At extracting The recording mark of the data of reason, the recording mark includes:File path, line number.
The second processing can include:Data to be dealt with are handled according to pre-defined rule, obtain using key as The output data of index.
By the function that the device of the present embodiment is realized essentially corresponds to earlier figures 1 to the embodiment of the method shown in Fig. 4, Therefore not detailed part in the description of the present embodiment, the related description in previous embodiment is may refer to, be will not be described here.
The application can be described in the general context of computer executable instructions, such as program Module or unit.Usually, program module or unit can include performing particular task or realize particular abstract data type Routine, program, object, component, data structure etc..In general, program module or unit can be by softwares, hardware or both Combination realize.The application can also be put into practice in a distributed computing environment, in these DCEs, by passing through Communication network and connected remote processing devices perform task.In a distributed computing environment, program module or unit can With positioned at including in the local and remote computer-readable storage medium including storage device.
Finally, in addition it is also necessary to explanation, term " comprising ", "comprising" or its any other variant are intended to non-exclusive Property include so that process, method, commodity or equipment including a series of key elements not only include those key elements, and Also include other key elements for being not expressly set out, or also include for this process, method, commodity or equipment inherently Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including described Also there is other identical element in process, method, commodity or the equipment of key element.
Specific case used herein is set forth to the principle and embodiment of the application, and above example is said It is bright to be only intended to help and understand the present processes and its main thought;Simultaneously for those of ordinary skill in the art, foundation The thought of the application, will change in specific embodiments and applications, all in spirit herein and principle Within, any modification, equivalent substitution and improvements made etc. all should be included within the scope of claims hereof.To sum up institute State, this specification content should not be construed as the limitation to the application.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can be used in one or more computers for wherein including computer usable program code Usable storage medium(Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)The computer program production of upper implementation The form of product.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory.Internal memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
Embodiments herein is the foregoing is only, the application is not limited to, for those skilled in the art For member, the application can have various modifications and variations.All any modifications within spirit herein and principle, made, Equivalent substitution, improvement etc., should be included within the scope of claims hereof.

Claims (12)

1. a kind of method of data processing, it is characterised in that including:
Compare the first input data of previous processing needs the second input data to be processed, to obtain delta data, institute with this Stating delta data includes deleted data and new interpolation data of second input data relative to first input data;
First is carried out to the deleted data and new interpolation data to handle, to obtain deleted data key indexed set and new addition Data key indexed set, and it is corresponding with key with the deleted data key indexed set and new interpolation data key index collection respectively Deleted mapping data set and new addition mapping data set for index;
Deleted from the first mapping data set using key as index and the deleted mapping number in the deleted mapping data set It is added to described first according to corresponding mapping data, and by the new addition mapping data in the new addition mapping data set and reflects Penetrate in data set, to form the second mapping data set using key as index corresponding with second input data, wherein, institute State the first mapping data set respectively with the first input data, using key as index the first output data it is corresponding;
To described second mapping data set in the deleted data key indexed set and the new interpolation data key index collection phase Corresponding mapping data carry out second processing to obtain the change output data using key as index;And
Output data to be replaced corresponding with the key index of the change output data in first output data is replaced The second output data using key as index into the change output data to obtain this processing procedure.
2. according to the method described in claim 1, it is characterised in that the new addition mapping data set, the first mapping data set, Second mapping data set includes at least one data subset respectively, wherein, data corresponding with identical key are using the key as rope In the data subset drawn.
3. method according to claim 2, it is characterised in that in the described second mapping data set with the deleted number Integrate corresponding mapping data according to key index collection and the new interpolation data key index and carry out second processing to obtain using key as rope The step of change output data drawn, including:
Determine it is described second mapping data set in the deleted data key indexed set and the new interpolation data key index collection In key index identical delta data subset;
Second processing is carried out to the delta data subset to obtain the change output data using key as index.
4. method according to claim 2, it is characterised in that number will be exported with the change in first output data According to the corresponding output data to be replaced of key index be substituted for the change output data with obtain this processing procedure with The step of key is the second output data of index, including:
Searched in first output data and each key identical output data to be replaced in the change output data;
The output data to be replaced is substituted for the change output data, and using the first output data after replacement as Second output data of this processing procedure.
5. according to the method described in claim 1 characterized in that,
The number that the deleted data do not occur to be occurring in first input data in second input data According to;
The number that the new interpolation data occurs to be not occurring in first input data in second input data According to.
6. according to the method described in claim 1, it is characterised in that
First processing includes:Key-value pair is extracted based on data to be dealt with, to obtain the key index collection using key as index, And formed using key as the mapping data set indexed;And wherein, first processing also includes:Extract data to be dealt with Recording mark, the recording mark includes:File path, line number;
The second processing includes:Data to be dealt with are handled according to pre-defined rule, obtained using key as the defeated of index Go out data.
7. a kind of device of data processing, it is characterised in that including:
Comparison module, the first input data and this need second input data to be processed for comparing previous processing, to obtain Delta data is obtained, the delta data includes deleted data of second input data relative to first input data With new interpolation data;
First processing module, is handled for carrying out first to the deleted data and new interpolation data, to obtain deleted number According to key index collection and new interpolation data key index collection, and respectively with the deleted data key indexed set and new interpolation data key Indexed set is corresponding using key as the deleted mapping data set of index and new addition mapping data set;
Intermediate process module, for being deleted and the deleted mapping data set from the first mapping data set using key as index In the corresponding mapping data of deleted mapping data, and the new addition in the new addition mapping data set is mapped into data It is added in the first mapping data set, is reflected so that formation is corresponding with second input data using key as the second of index Data set is penetrated, wherein, the first mapping data set is respectively with the first input data, using key as the first output data phase of index Correspondence;
Second processing module, in the described second mapping data set with the deleted data key indexed set and described newly adding Addend carries out second processing to obtain the change output data using key as index according to the corresponding mapping data of key index collection;And
Output data acquisition module, for the key index in first output data with the change output data is corresponding Second using key as index that output data to be replaced is substituted for the change output data to obtain this processing procedure is defeated Go out data.
8. device according to claim 7, it is characterised in that the new addition mapping data set, the first mapping data set, Second mapping data set includes at least one data subset respectively, wherein, data corresponding with identical key are using the key as rope In the data subset drawn.
9. device according to claim 8, it is characterised in that the Second processing module includes:
Determination sub-module, for determining in the second mapping data set with the deleted data key indexed set and described newly adding The key index identical delta data subset that addend is concentrated according to key index;
Submodule is handled, the change for carrying out second processing to the delta data subset to obtain using key as index exports number According to.
10. device according to claim 8, it is characterised in that the output data acquisition module includes:
Submodule is searched, will with each key identical in the change output data for being searched in first output data It is replaced output data;
Submodule is replaced, for the output data to be replaced to be substituted for into the change output data, and by after replacement First output data as this processing procedure the second output data.
11. device according to claim 7, it is characterised in that
The number that the deleted data do not occur to be occurring in first input data in second input data According to;
The number that the new interpolation data occurs to be not occurring in first input data in second input data According to.
12. device according to claim 7, it is characterised in that
First processing includes:Key-value pair is extracted based on data to be dealt with, to obtain the key index collection using key as index, And formed using key as the mapping data set indexed;And wherein, first processing also includes:Extract data to be dealt with Recording mark, the recording mark includes:File path, line number;
The second processing includes:Data to be dealt with are handled according to pre-defined rule, obtained using key as the defeated of index Go out data.
CN201310268334.2A 2013-06-28 2013-06-28 A kind of method and device of data processing Active CN104252486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310268334.2A CN104252486B (en) 2013-06-28 2013-06-28 A kind of method and device of data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310268334.2A CN104252486B (en) 2013-06-28 2013-06-28 A kind of method and device of data processing

Publications (2)

Publication Number Publication Date
CN104252486A CN104252486A (en) 2014-12-31
CN104252486B true CN104252486B (en) 2017-09-12

Family

ID=52187388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310268334.2A Active CN104252486B (en) 2013-06-28 2013-06-28 A kind of method and device of data processing

Country Status (1)

Country Link
CN (1) CN104252486B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399151B (en) * 2017-02-06 2022-02-15 百度在线网络技术(北京)有限公司 Data comparison system and method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129458A (en) * 2011-03-09 2011-07-20 胡劲松 Method and device for storing relational database
CN102402559A (en) * 2010-09-16 2012-04-04 中兴通讯股份有限公司 Database upgrade script generating method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049767A1 (en) * 2002-09-05 2004-03-11 International Business Machines Corporation Method and apparatus for comparing computer code listings
TW201007557A (en) * 2008-08-06 2010-02-16 Inventec Corp Method for reading/writing data in a multithread system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402559A (en) * 2010-09-16 2012-04-04 中兴通讯股份有限公司 Database upgrade script generating method and device
CN102129458A (en) * 2011-03-09 2011-07-20 胡劲松 Method and device for storing relational database

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A comparison of data warehousing methodologies;Arun Sen等;《Communications of The ACM》;20050331;第48卷(第3期);第79-84页 *
Beyond Constant Comparison Qualitative Data Analysis: Using NVivo;Nancy L Leech等;《School Psychology Quarterly》;20110331;第26卷(第1期);第70-84页 *
基于Hibernate的数据整合系统的研究与开发;王兰香;《中国优秀硕士学位论文全文数据库 信息科技辑》;20070715(第01期);第I138-135页 *
面向服务的Web异构数据集成存取体系结构研究;巫丹丹;《中国优秀硕士学位论文全文数据库 信息科技辑》;20070715(第01期);第I138-48页 *

Also Published As

Publication number Publication date
CN104252486A (en) 2014-12-31

Similar Documents

Publication Publication Date Title
US20230126005A1 (en) Consistent filtering of machine learning data
US20220335338A1 (en) Feature processing tradeoff management
US11100420B2 (en) Input processing for machine learning
US10452691B2 (en) Method and apparatus for generating search results using inverted index
US10366053B1 (en) Consistent randomized record-level splitting of machine learning data
US10318882B2 (en) Optimized training of linear machine learning models
US9846702B2 (en) Indexing of file in a hadoop cluster
CN111258978B (en) Data storage method
CN110321329A (en) Data processing method and device based on big data
CN105786808A (en) Method and apparatus for executing relation type calculating instruction in distributed way
US20160171047A1 (en) Dynamic creation and configuration of partitioned index through analytics based on existing data population
CN110674360B (en) Tracing method and system for data
CN105677904B (en) Small documents storage method and device based on distributed file system
CN108062384A (en) The method and apparatus of data retrieval
CN104778182A (en) Data import method and system based on HBase (Hadoop Database)
CN110019298A (en) Data processing method and device
CN108874379A (en) The processing method and processing device of the page
CN112364185B (en) Method and device for determining characteristics of multimedia resources, electronic equipment and storage medium
CN103530369A (en) De-weight method and system
CN109582476A (en) Data processing method, apparatus and system
CN109947702A (en) Index structuring method and device, electronic equipment
CN104252486B (en) A kind of method and device of data processing
US11093566B2 (en) Router based query results
CN103995831A (en) Object processing method, system and device based on similarity among objects
CN106294700A (en) The storage of a kind of daily record and read method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant