CN104252486B - A kind of method and device of data processing - Google Patents
A kind of method and device of data processing Download PDFInfo
- Publication number
- CN104252486B CN104252486B CN201310268334.2A CN201310268334A CN104252486B CN 104252486 B CN104252486 B CN 104252486B CN 201310268334 A CN201310268334 A CN 201310268334A CN 104252486 B CN104252486 B CN 104252486B
- Authority
- CN
- China
- Prior art keywords
- data
- key
- index
- mapping
- deleted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2379—Updates performed during online database operations; commit processing
- G06F16/2386—Bulk updating operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a kind of method and device of data processing, and this method includes:Compare the first input data of previous processing needs the second input data to be processed with this, to obtain deleted data and new interpolation data;Carry out first to being deleted data and new interpolation data and handle, to obtain deleted data key indexed set and new interpolation data key index collection and be deleted mapping data set and new addition mapping data set;The deletion mapping data corresponding with being deleted mapping data from the first mapping data set, new addition mapping data are added to the first mapping data set to form the second mapping data set;Mapping data corresponding with deleted data key indexed set and new interpolation data key index collection in second mapping data set are carried out second processing to obtain change output data;Output data to be replaced in first output data is substituted for change output data to obtain the second output data.It can avoid reprocessing constant input data using the application.
Description
Technical field
The application is related to field of computer technology, more particularly to a kind of method and device of data processing.
Background technology
In large-scale calculations field, cloud computing is just by the strong interest of people, as the core technology of cloud computing,
MapReduce(MapReduce)Also extensive concern is received.Map Reduce system passes through Map(Mapping)And Reduce(Abbreviation)
Two such simple concept constitutes computing elementary cell.User need to only write Map functions and Reduce functions can be achieved
Parallel processing to extensive mass data collection.Map Reduce system can be according to input data size and the configuration of operation
It is automatically multiple identical Map tasks and Reduce tasks by the job initialization etc. information, different input numbers is read respectively
According to block and Map functions and Reduce functions is called to be handled.
In practical application at this stage, MapReduce data handling systems are usually arranged as the state of timing operation, example
Such as operation daily.The input data of MapReduce data handling procedures, is typically the data of accumulation in a period of time, for example, most
The data of accumulation in nearly 15 days.The characteristics of carrying out MapReduce processing to such data is, at this MapReduce data
The input data of the input data of reason process and last data handling procedure is largely identical, and only partial data exists
This MapReduce data handling procedure is deleted, and/or this MapReduce data handling procedure increases part newly again
Data.For such MapReduce data processings, this property of input data is all have ignored in current application, so that
All data are carried out with complete MapReduce processing.However, its tangible adjacent MapReduce processing twice of many data
During be constant, be to repeat in fact to the processing evaluation works of these data, waste computing resource.
The content of the invention
The application is to overcome drawbacks described above there is provided a kind of method and device of data processing, to avoid reprocessing constant
Data flow.
According to the one side of the application there is provided a kind of method of data processing, including:Compare previous processing first is defeated
Entering data and this needs the second input data to be processed, and to obtain delta data, it is defeated that the delta data includes described second
Enter deleted data and new interpolation data of the data relative to first input data;To the deleted data and new addition
Data carry out first and handled, to obtain deleted data key indexed set and new interpolation data key index collection, and respectively with it is described
Be deleted data key indexed set it is corresponding with new interpolation data key index collection using key for index deleted mapping data set with
New addition mapping data set;From the first mapping data set using key as index in deletion and the deleted mapping data set
The corresponding mapping data of mapping data are deleted, and the new addition in the new addition mapping data set is mapped into data addition
Into the described first mapping data set, to form the second mapping number using key as index corresponding with second input data
According to collection, wherein, it is described first mapping data set respectively with the first input data, using key for index the first output data it is relative
Should;To relative with the deleted data key indexed set and the new interpolation data key index collection in the described second mapping data set
The mapping data answered carry out second processing to obtain the change output data using key as index;And by first output data
In be substituted for the change output data with the corresponding output data to be replaced of key index of the change output data with must
To the second output data using key as index of this processing procedure.
According to embodiments herein, in the method, the new addition mapping data set, the first mapping data set, the
Two mapping data sets include at least one data subset respectively, wherein, data corresponding with identical key are using the key as index
A data subset in.
According to embodiments herein, in the method, in the described second mapping data set with the deleted data
The key index collection mapping data corresponding with the new interpolation data key index collection carry out second processing to obtain using key as index
Change output data the step of, including:Determine it is described second mapping data set in the deleted data key indexed set and
The key index identical delta data subset that the new interpolation data key index is concentrated;Second is carried out to the delta data subset
Handle to obtain the change output data using key as index.
According to embodiments herein, in the method, by first output data with the change output data
The corresponding output data to be replaced of key index be substituted for the change output data with obtain this processing procedure with key
For index the second output data the step of, including:Search and change with described in output data in first output data
Each key identical output data to be replaced;The output data to be replaced is substituted for the change output data, and
Using the first output data after replacement as this processing procedure the second output data.
According to embodiments herein, in the method, the deleted data are appearance in first input data
And the data that do not occur in second input data;The new interpolation data be first input data in do not occur
And the data that occur in second input data.
According to embodiments herein, in the method, first processing includes:Extracted based on data to be dealt with
Key-value pair, to obtain the key index collection using key as index, and forms the mapping data set using key as index;And it is wherein, described
First processing also includes:The recording mark of data to be dealt with is extracted, the recording mark includes:File path, line number;Institute
Stating second processing includes:Data to be dealt with are handled according to pre-defined rule, the output data using key as index is obtained.
According to the another aspect of the application there is provided a kind of device of data processing, including:Comparison module, before comparing
First input data of secondary processing needs the second input data to be processed with this, to obtain delta data, the delta data
Including the deleted data and new interpolation data second input data relative to first input data;First processing mould
Block, is handled for carrying out first to the deleted data and new interpolation data, to obtain deleted data key indexed set and new
Interpolation data key index collection, and it is corresponding with the deleted data key indexed set and new interpolation data key index collection respectively
Using key as the deleted mapping data set of index and new addition mapping data set;Intermediate process module, for from using key as index
The first mapping data set in delete and the deleted corresponding mapping number of deleted mapping data mapped in data set
According to, and the new addition mapping data in the new addition mapping data set are added in the first mapping data set, with shape
Into the second mapping data set using key as index corresponding with second input data, wherein, the first mapping data
Collection respectively with the first input data, using key for index the first output data it is corresponding;Second processing module, for described
The mapping number corresponding with the deleted data key indexed set and the new interpolation data key index collection in two mapping data sets
According to progress second processing to obtain the change output data using key as index;And output data acquisition module, for by described in
Output data to be replaced corresponding with the key index of the change output data is substituted for the change in first output data
Second output data using key as index of the output data to obtain this processing procedure.
According to embodiments herein, in the apparatus, the new addition mapping data set, the first mapping data set, the
Two mapping data sets include at least one data subset respectively, wherein, data corresponding with identical key are using the key as index
A data subset in.
According to embodiments herein, in the apparatus, the Second processing module includes:Determination sub-module, for true
The key concentrated in the fixed second mapping data set with the deleted data key indexed set and the new interpolation data key index
Index identical delta data subset;Handle submodule, for the delta data subset carry out second processing with obtain with
Key is the change output data of index.
According to embodiments herein, in the apparatus, the acquisition module includes:Submodule is searched, for described
Searched in first output data and each key identical output data to be replaced in the change output data;Replace submodule
Block, for the output data to be replaced to be substituted for into the change output data, and by the first output data after replacement
It is used as the second output data of this processing procedure.
According to embodiments herein, in the apparatus, the deleted data are appearance in first input data
And the data that do not occur in second input data;The new interpolation data be first input data in do not occur
And the data that occur in second input data.
According to embodiments herein, in the apparatus, first processing includes:Extracted based on data to be dealt with
Key-value pair, to obtain the key index collection using key as index, and forms the mapping data set using key as index;And it is wherein, described
First processing also includes:The recording mark of data to be dealt with is extracted, the recording mark includes:File path, line number;Institute
Stating second processing includes:Data to be dealt with are handled according to pre-defined rule, the output data using key as index is obtained.
Compared with prior art, according to the technical scheme of the application, it can avoid reprocessing constant input data, from
And the time of data processing can be shortened, save data processing resources.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen
Schematic description and description please is used to explain the application, does not constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 be the invention relates to a kind of data processing method flow chart;
Fig. 2 be the invention relates to a kind of data processing method in previous processing data and this handle
The schematic diagram of data;
Fig. 3 is the particular flow sheet of the step S104 in Fig. 1 of the embodiment of the present application;
Fig. 4 is the particular flow sheet of the step S105 in Fig. 1 of the embodiment of the present application;And
Fig. 5 be the invention relates to a kind of data processing device block diagram.
Embodiment
The main thought of the application is, is become by contrasting the input data in previous processing and this processing procedure
The data of change, and the input data of change is handled using key as the change output data indexed, and become according to described
The key index for changing output data replaces corresponding output data in previous processing procedure, using obtain this processing using key to index
Output data.
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and
Technical scheme is clearly and completely described corresponding accompanying drawing, it is clear that described embodiment is only the application one
Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Go out the every other embodiment obtained under the premise of creative work, belong to the scope of the application protection.
With reference to Fig. 1, Fig. 1 is a kind of flow chart of data processing method of the application.
In step S101, the first input data of relatively more previous processing needs the second input data to be processed with this,
To obtain delta data.
The delta data include second input data relative to first input data deleted data and
New interpolation data.Wherein, the deleted data be first input data in occur and in second input data
In the data that do not occur.The new interpolation data in first input data to be not occurring in second input data
The data of middle appearance.
In actual applications, time interval or the frequency operation that data processing can typically be set, for example, can do daily
Data processing, and the data handled are generally the data of a period of time accumulation, for example, the data of nearest 15 days.The application
In previous processing and this handle, refer in time successively carry out two processing procedures, for each processing procedure,
This processing is can serve as upon execution, and the last processing procedure carried out before it can be previous treated as its
Journey, and it is also the previous processing procedure of single treatment process thereafter.As shown in Fig. 2 for example, the number of processing accumulation in 5 days daily
According to, then the data 210 of processing yesterday are exactly the first input data of previous processing, and today needs the data 220 handled just
It is the second input data of this processing.In second input data of processing today, delete processing yesterday continuous 5 days are accumulated
The first input data in the data that produce for the 1st day, and with the addition of the data newly produced today.That is, being inputted from first
Need to be deleted the input data produced for the 1st day in data(It is deleted data)And add the input data of generation today(New addition
Data), can obtain this needs the second input data to be processed.
For example, in an embodiment handled user access logses.The data of the previous processing(First is defeated
Enter data)For:
URL:111.com, date:20130214,11:00:00,8
URL:222.com, date:20130214,13:00:00,7
URL:111.com, date:20130214,15:00:00,5
URL:222.com, date:20130214,17:00:00,7
URL:111.com, date:20130215,14:00:00,5
URL:333.com, date:20130215,16:00:00,3
URL:333.com, date:20130216,15:00:00,8
URL:555.com, date:20130216,16:00:00,11
URL:222.com, date:20130217,15:00:00,6
URL:555.com, date:20130217,15:00:00,10
URL:666.com, date:20130218,14:00:00,8
URL:666.com, date:20130218,15:00:00,9
URL:666.com, date:20130218,16:00:00,5
This needs data to be processed(Second input data)For:
URL:111.com, date:20130215,14:00:00,5
URL:333.com, date:20130215,16:00:00,3
URL:333.com, date:20130216,15:00:00,8
URL:555.com, date:20130216,16:00:00,11
URL:222.com, date:20130217,15:00:00,6
URL:555.com, date:20130217,15:00:00,10
URL:666.com, date:20130218,14:00:00,8
URL:666.com, date:20130218,15:00:00,9
URL:666.com, date:20130218,16:00:00,5
URL:222.com, date:20130219,15:00:00,9
URL:333.com, date:20130219,16:00:00,6
URL:222.com, date:20130219,17:00:00,9
URL:333.com, date:20130219,18:00:00,8
Compare the first input data and the second input data, the first input data can be obtained relative to the second input data
Deleted data be:
URL:111.com, date:20130214,11:00:00,8
URL:222.com, date:20130214,13:00:00,7
URL:111.com, date:20130214,15:00:00,5
URL:222.com, date:20130214,17:00:00,7
And, new interpolation data is:
URL:222.com, date:20130224,15:00:00,9
URL:333.com, date:20130224,16:00:00,6
URL:222.com, date:20130224,17:00:00,9
URL:333.com, date:20130224,18:00:00,8
In step s 102, carry out first to the deleted data and new interpolation data to handle, to obtain deleted number
According to key index collection and new interpolation data key index collection, and respectively with the deleted data key indexed set and new interpolation data key
Indexed set is corresponding using key as the deleted mapping data set of index and new addition mapping data set.Wherein, at described first
Reason can include:Key-value pair is extracted based on data to be dealt with, to obtain the key index collection using key as index, and is formed with key
For the mapping data set of index.Wherein key index refers to using the key of data as index, i.e. can map data to be dealt with
Mapping processing can be carried out as key-value pair data, that is, to data to be dealt with, the key of each data is obtained(key)With
Value(value)Corresponding key-value pair(key-value)Data, and it is possible to according to obtained key-value pair data, with every number
According to key for index, generate it is corresponding using key for index mapping data set.The mapping data set can include at least one
Data subset, wherein, data corresponding with identical key are in using the key as the data subset of index.That is, described reflect
One or more data subsets can be included by penetrating data set, wherein, each data subset includes the one or more numbers of key identical
According to, also, it is used as using the identical key of one or more of data the key index of the data subset.Therefore, to being deleted number
The first processing is carried out respectively according to new interpolation data, can obtain the key and each new interpolation data of each deleted data
Key, each key for being deleted data is constituted and gathered, and is as deleted data key indexed set, and by each new interpolation data
Key composition set, as new interpolation data key index collection.First processing can also include:Extract data to be dealt with
Recording mark, the recording mark can include:File path, line number, recording mark can be used for identifying each data, for example,
Recording mark of the line number as data can be used.
For example, being used as the key of data using URL(key), obtained being deleted data and new addition number in step S101
According to the data progress mapping processing to obtaining can obtain each key for being deleted data(key)Respectively 111.com,
222.com, 111.com, 222.com, the key of each new interpolation data(key)Respectively 222.com, 333.com, 222.com,
333.com, therefore, it can obtain, and it is { 111.com, 222.com }, new interpolation data key index to be deleted data key indexed set
Collect for { 222.com, 333.com }.
Also, the key-value pair data that mapping processing is obtained is carried out according to the deleted data and new interpolation data, can
To obtain using key as the deleted mapping data set of index and new addition mapping data set, and the deleted mapping data set
It is relative with the key index that deleted data key indexed set and new interpolation data key index are concentrated respectively with new addition mapping data set
Should.Using key as the deleted mapping data set of index or new addition mapping data set, including at least one data subset, wherein,
Data corresponding with identical key are in using the key as a data subset of index.That is, being deleted mapping data set
In new addition mapping data set, at least including a subclass being made up of key identical data, these subclass can be with
The identical key of mapping data in each subset is index, therefore, is deleted in mapping data set and new addition mapping data set
The key index of each data subset is corresponding with deleted data key indexed set and new interpolation data key index collection.Also, it is described
The recording mark of data can also be included by being deleted mapping data set and new addition mapping data set, as the mark of data,
That is, recording mark can as each data where it using key as index data subset in mark, for example, can
With the recording mark using the line number of each data as the data.
Thus, it is possible to obtain being by the deleted mapping data set indexed of key:
111.com:
8^A0
5^A2
(It is that, using 111.com as a data subset of the deleted mapping data set indexed, the data subset includes above
Data 8^A0 and 5^A2, the key of these data is all mutually 111.com, wherein, 8 and 5 be the value of data, and A0 and A2 represent data
The line number being expert at is respectively 0 and 2, and the recording mark of data can be used as using line number.)
222.com:
7^A1
7^A3
It is by the new addition mapping data set indexed of key:
222.com:
9^A13
9^A14
333.com:
6^A15
8^A16
Wherein, Ai represents line number, and wherein i is 1,2,3 ..., n.
In step s 103, deleted and the deleted mapping data set from the first mapping data set using key as index
In the corresponding mapping data of deleted mapping data, and the new addition in the new addition mapping data set is mapped into data
It is added in the first mapping data set, is reflected so that formation is corresponding with second input data using key as the second of index
Penetrate data set.Wherein, the first mapping data set is respectively with the first input data, using key as the first output data phase of index
Correspondence.First mapping data set is that set, the set using key as the data of index obtained in previous processing is included at least
One data subset, wherein, data corresponding with identical key are in using the key as a data subset of index.Namely
Say, in the first mapping data set, including one or more subclass being made up of key identical data, these subclass are with each son
The key that the mapping data of concentration are common is index.Also, the first mapping data set can also include the recording mark of data,
The mark of data is used as, i.e. the index of each data subset in using key as the described first mapping data set, with the record mark of data
Note identifies each mapping data, for example, can each data line number as the data recording mark.
That is, can according to it is deleted mapping data set in all key indexes and each data recording mark,
The corresponding mapping data in the first mapping data set are deleted, for example, according to being deleted for being obtained in above-mentioned steps S102
Mapping data set includes the mapping data using 111.com, 222.com as index, also, this is deleted in mapping data set
Data are using the line number of data as recording mark, then, found according to key index and line number corresponding in the first mapping data set
Mapping data, and delete them.Specifically, can according to it is deleted mapping data set in key index 111.com and
222.com, the data subset that key index is 111.com and 222.com is searched in the first mapping data set, and according to deleted
The line number for mapping the deleted data of data centralized recording searches corresponding mapping data and deletes the mapping data that will be found
Delete, because the described first mapping data set is corresponding with the first input data, using the first output data of key as index respectively,
It therefore, it can extract each deleted line number of the data in first input data and map each in data set as first
The recording mark of mapping data is deleted, data set phase just can be mapped first according to the recording mark of each deleted data
Corresponding mapping data are deleted in the subset answered.Also, the new addition mapping data in the new addition mapping data set are added
It is added in the first mapping data set.That is, the mapping data newly added are added in the first mapping data, so that
Obtain corresponding the second mapping data set using key as index of second input data.Specifically, can be by new addition
Mapping data set in each the first mapping according to the key index of each data subset is added to using the data subset of key as index
In data it is corresponding using key in the data subset of index, so as to obtain corresponding second mapping data using key for what is indexed
Collection.So as to, the second mapping data the first mapping data as also depicted, including at least one is made up of key identical data
Subclass, the common key of mapping data that these subclass can be in each subset is index, also, the second mapping data set
The recording mark of data can also be included, the mark of data is used as, i.e. each data in using key as the described second mapping data set
The index of subset, each mapping data are identified with the recording mark of data.
For example, the first mapping data set obtained in previous processing procedure is:
111.com:
8^A0
5^A2
8^A4
222.com:
7^A1
7^A3
6^A8
333.com:
3^A5
8^A6
555.com:
11^A7
10^A9
666.com:
8^A10
9^A11
5^A12
Also, it is { 111.com, 222.com }, new interpolation data to obtain being deleted data key indexed set in step s 102
Key index collection is { 222.com, 333.com }, is deleted mapping data set:
111.com
8^A0
5^A2
222.com:
7^A1
7^A3
New addition maps data set:
222.com:
9^A13
9^A14
333.com:
6^A15
8^A16
Therefore, the deleted mapping data in deleted mapping data set are deleted from the first mapping data set, and
The new addition mapping data new addition mapped in data set are added in the first mapping data set, can obtain:Second mapping
Data set is:
111.com:
8^A4
222.com:
6^A8
9^A13
9^A14
333.com:
3^A5
8^A6
6^A15
8^A16
555.com:
11^A7
10^A9
666.com:
8^A10
9^A11
5^A12
In step S104, to described second mapping data set in deleted data key indexed set and new interpolation data key
The corresponding mapping data of indexed set carry out second processing to obtain the change output data using key as index.Step S104 is specific
Refer to shown in Fig. 3, Fig. 3 is step S104 particular flow sheet, as shown in Figure 3:
In step S301, determine in the second mapping data set with the deleted data key indexed set and described new
The key index identical delta data subset that interpolation data key index is concentrated.In step S103 above, have been obtained for
Key maps data set for the second of index, and the set includes the data subset that at least one is made up of key identical data, its
In, data corresponding with identical key are in using the key as a data subset of index.According to the deleted data key rope
Draw collection and the new interpolation data key index collection described second mapping data in search with deleted data key indexed set and newly
The key index identical data subset that interpolation data key index is concentrated, it is possible to determine the second mapping data relative to the first mapping
The delta data subset of data.
In step s 302, the delta data subset is carried out second processing to obtain change output data.Described
Two processing can include:Data to be dealt with are handled according to pre-defined rule, the output data using key as index is obtained.
Wherein, pre-defined rule can be set according to the specific needs of data processing.That is, to delta data subset obtained above according to
Pre-defined rule is handled, and obtains changing output data, wherein, the change output data can be the output using key as index
Data.
For example, in above-mentioned step S103, obtaining the second mapping data set is:
111.com:
8^A4
222.com:
6^A8
9^A13
9^A14
333.com:
3^A5
8^A6
6^A15
8^A16
555.com:
11^A7
10^A9
666.com:
8^A10
9^A11
5^A12
And be deleted data key indexed set and new interpolation data key index collection be respectively { 111.com, 222.com } and
{ 222.com, 333.com },
Therefore, in step S301, available delta data subset be respectively with 111.com, 222.com,
333.com is the data subset of index, i.e.
111.com:
8^A4
222.com:
6^A8
9^A13
9^A14
333.com:
3^A5
8^A6
6^A15
8^A16
In step s 302, obtained data subset is handled according to pre-defined rule, for example:Add operation is done, can
It is as follows using the change output data of key as index to obtain:
111.com:
8
222.com:
24
333.com:
25
In step S105, by first output data with it is described change output data key index it is corresponding will quilt
Replace output data and be substituted for second output number using key as index of the change output data to obtain this processing procedure
According to.
Step S105, is specifically referred to shown in Fig. 4, and Fig. 4 is step S105 particular flow sheet, as shown in Figure 4:
In step S401, searched in first output data and each key identical in the change output data
Output data to be replaced.That is, according to all key indexes using key as the change output data of index, searching described
With the key identical key index output data of change output data, output data as to be replaced in first output data.
In step S402, the output data to be replaced is substituted for the change output data, and by after replacement
The first output data as this processing procedure the second output data.That is, by will quilt in the first output data found
Replace output data and be substituted for and change output data with its key index identical, the first output data after replacement is at this
Second output data of reason process.Specifically, it is that is, in output data, the data changed are corresponding with key
The change output data output of bonding identical is replaced for index output data, and the corresponding key index of constant data exports number
According to according to the constant output of the first original output data.
For example, previous the first obtained output data that handles is:
111.com:
21
222.com:
20
333.com:
11
555.com:
21
666.com:
22
Therefore,, can be with according to the change output data obtained in step S104 using key as index in step S402
Find and be with each key identical output data to be replaced in the change output data:
111.com:
21
222.com:
20
333.com:
11
They are replaced with to the change output data obtained in step S104, the second output data of this processing is obtained
For:
111.com:
8
222.com:
24
333.com:
25
555.com:
21
666.com:
22
In second output data, do not changed using the output data of 555.com and 666.com as index, and be also not required to
Carry out it is above-mentioned handle twice, so as to reduce treating capacity.
In addition, for first time processing procedure(That is, without previous processing procedure), the number for the previous processing procedure being related to
According to sky is all considered as, such as the first input data, the first mapping data set, the first output data.That is, in this processing implementation procedure,
In step S101, this second input data be previous processing the first input data relative to this handle second
The new interpolation data of input data, and data are deleted for sky, in step s 102, obtain new interpolation data key index collection, quilt
Data key indexed set, new addition mapping data set and the second mapping data set are deleted, wherein, being deleted data key indexed set is
Sky, the second mapping data set is new addition mapping data set, and step S103-S105 by that analogy, is not being repeated herein, because
This, the second output data of this processing finally obtained is to carry out what second processing was obtained to the new addition mapping data
Change output data.
With reference to Fig. 5, Fig. 5 provides a kind of block diagram of data processing equipment according to the another aspect of the application, such as Fig. 5 institutes
Show, the device can include:Comparison module 510, first processing module 520, intermediate process module 530, Second processing module
540th, output data acquisition module 550.
Comparison module 510, the first input data and this need to be processed second that can be used for the previous processing of comparison is defeated
Enter data, to obtain delta data, the delta data includes second input data relative to first input data
Deleted data and new interpolation data.
First processing module 520, can be used for carrying out the first processing to the deleted data and new interpolation data, to obtain
Must be deleted data key indexed set and new interpolation data key index collection, and respectively with deleted data key indexed set and new addition
The corresponding deleted mapping data set of data key indexed set and new addition mapping data set.
Intermediate process module 530, can be used for deleting from the first mapping data set using key as index and is deleted with described
Except the corresponding mapping data of the deleted mapping data in mapping data set, and will be new in the new addition mapping data set
Addition mapping data are added in the first mapping data set, are reflected with forming second corresponding with second input data
Data set is penetrated, wherein, the first mapping data set is respectively with the first input data, using key as the first output data phase of index
Correspondence.
Second processing module 540, in the described second mapping data set with deleted data key indexed set and newly adding
Addend carries out second processing to obtain the change output data using key as index according to the corresponding mapping data of key index collection.
Output data acquisition module 550, for by first output data with it is described change output data key rope
Draw corresponding output data to be replaced be substituted for the change output data using obtain this processing procedure using key to index
The second output data.
Wherein, the new addition mapping data set, the first mapping data set, the second mapping data set include at least one respectively
Individual data subset, wherein, data corresponding with identical key are in using the key as a data subset of index.
The Second processing module 540 can include:Determination sub-module 541 and processing submodule 542.
Determination sub-module 541, is determined in the second mapping data set and the deleted data key index
The key index identical delta data subset that collection and the new interpolation data key index are concentrated.
Submodule 542 is handled, can be used for carrying out the delta data subset second processing to obtain using key as index
Change output data.
The output data acquisition module 550 can include:Search submodule 551, replace submodule 552.
Submodule 551 is searched, can be used for the lookup in first output data and change each in output data with described
Individual key identical output data to be replaced;
Submodule 552 is replaced, can be used for the output data to be replaced being substituted for the change output data, and
Using the first output data after replacement as this processing procedure the second output data.
Wherein, the deleted data be first input data in occur and in second input data not
The data of appearance;The new interpolation data goes out to be not occurring in first input data in second input data
Existing data.
First processing can include:Key-value pair is extracted based on data to be dealt with, to obtain using key as index
Key index collection, and form the mapping data set using key as index;And wherein, first processing also includes:At extracting
The recording mark of the data of reason, the recording mark includes:File path, line number.
The second processing can include:Data to be dealt with are handled according to pre-defined rule, obtain using key as
The output data of index.
By the function that the device of the present embodiment is realized essentially corresponds to earlier figures 1 to the embodiment of the method shown in Fig. 4,
Therefore not detailed part in the description of the present embodiment, the related description in previous embodiment is may refer to, be will not be described here.
The application can be described in the general context of computer executable instructions, such as program
Module or unit.Usually, program module or unit can include performing particular task or realize particular abstract data type
Routine, program, object, component, data structure etc..In general, program module or unit can be by softwares, hardware or both
Combination realize.The application can also be put into practice in a distributed computing environment, in these DCEs, by passing through
Communication network and connected remote processing devices perform task.In a distributed computing environment, program module or unit can
With positioned at including in the local and remote computer-readable storage medium including storage device.
Finally, in addition it is also necessary to explanation, term " comprising ", "comprising" or its any other variant are intended to non-exclusive
Property include so that process, method, commodity or equipment including a series of key elements not only include those key elements, and
Also include other key elements for being not expressly set out, or also include for this process, method, commodity or equipment inherently
Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including described
Also there is other identical element in process, method, commodity or the equipment of key element.
Specific case used herein is set forth to the principle and embodiment of the application, and above example is said
It is bright to be only intended to help and understand the present processes and its main thought;Simultaneously for those of ordinary skill in the art, foundation
The thought of the application, will change in specific embodiments and applications, all in spirit herein and principle
Within, any modification, equivalent substitution and improvements made etc. all should be included within the scope of claims hereof.To sum up institute
State, this specification content should not be construed as the limitation to the application.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the application can be used in one or more computers for wherein including computer usable program code
Usable storage medium(Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)The computer program production of upper implementation
The form of product.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and internal memory.Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM)
And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory.Internal memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved
State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein
Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
Embodiments herein is the foregoing is only, the application is not limited to, for those skilled in the art
For member, the application can have various modifications and variations.All any modifications within spirit herein and principle, made,
Equivalent substitution, improvement etc., should be included within the scope of claims hereof.
Claims (12)
1. a kind of method of data processing, it is characterised in that including:
Compare the first input data of previous processing needs the second input data to be processed, to obtain delta data, institute with this
Stating delta data includes deleted data and new interpolation data of second input data relative to first input data;
First is carried out to the deleted data and new interpolation data to handle, to obtain deleted data key indexed set and new addition
Data key indexed set, and it is corresponding with key with the deleted data key indexed set and new interpolation data key index collection respectively
Deleted mapping data set and new addition mapping data set for index;
Deleted from the first mapping data set using key as index and the deleted mapping number in the deleted mapping data set
It is added to described first according to corresponding mapping data, and by the new addition mapping data in the new addition mapping data set and reflects
Penetrate in data set, to form the second mapping data set using key as index corresponding with second input data, wherein, institute
State the first mapping data set respectively with the first input data, using key as index the first output data it is corresponding;
To described second mapping data set in the deleted data key indexed set and the new interpolation data key index collection phase
Corresponding mapping data carry out second processing to obtain the change output data using key as index;And
Output data to be replaced corresponding with the key index of the change output data in first output data is replaced
The second output data using key as index into the change output data to obtain this processing procedure.
2. according to the method described in claim 1, it is characterised in that the new addition mapping data set, the first mapping data set,
Second mapping data set includes at least one data subset respectively, wherein, data corresponding with identical key are using the key as rope
In the data subset drawn.
3. method according to claim 2, it is characterised in that in the described second mapping data set with the deleted number
Integrate corresponding mapping data according to key index collection and the new interpolation data key index and carry out second processing to obtain using key as rope
The step of change output data drawn, including:
Determine it is described second mapping data set in the deleted data key indexed set and the new interpolation data key index collection
In key index identical delta data subset;
Second processing is carried out to the delta data subset to obtain the change output data using key as index.
4. method according to claim 2, it is characterised in that number will be exported with the change in first output data
According to the corresponding output data to be replaced of key index be substituted for the change output data with obtain this processing procedure with
The step of key is the second output data of index, including:
Searched in first output data and each key identical output data to be replaced in the change output data;
The output data to be replaced is substituted for the change output data, and using the first output data after replacement as
Second output data of this processing procedure.
5. according to the method described in claim 1 characterized in that,
The number that the deleted data do not occur to be occurring in first input data in second input data
According to;
The number that the new interpolation data occurs to be not occurring in first input data in second input data
According to.
6. according to the method described in claim 1, it is characterised in that
First processing includes:Key-value pair is extracted based on data to be dealt with, to obtain the key index collection using key as index,
And formed using key as the mapping data set indexed;And wherein, first processing also includes:Extract data to be dealt with
Recording mark, the recording mark includes:File path, line number;
The second processing includes:Data to be dealt with are handled according to pre-defined rule, obtained using key as the defeated of index
Go out data.
7. a kind of device of data processing, it is characterised in that including:
Comparison module, the first input data and this need second input data to be processed for comparing previous processing, to obtain
Delta data is obtained, the delta data includes deleted data of second input data relative to first input data
With new interpolation data;
First processing module, is handled for carrying out first to the deleted data and new interpolation data, to obtain deleted number
According to key index collection and new interpolation data key index collection, and respectively with the deleted data key indexed set and new interpolation data key
Indexed set is corresponding using key as the deleted mapping data set of index and new addition mapping data set;
Intermediate process module, for being deleted and the deleted mapping data set from the first mapping data set using key as index
In the corresponding mapping data of deleted mapping data, and the new addition in the new addition mapping data set is mapped into data
It is added in the first mapping data set, is reflected so that formation is corresponding with second input data using key as the second of index
Data set is penetrated, wherein, the first mapping data set is respectively with the first input data, using key as the first output data phase of index
Correspondence;
Second processing module, in the described second mapping data set with the deleted data key indexed set and described newly adding
Addend carries out second processing to obtain the change output data using key as index according to the corresponding mapping data of key index collection;And
Output data acquisition module, for the key index in first output data with the change output data is corresponding
Second using key as index that output data to be replaced is substituted for the change output data to obtain this processing procedure is defeated
Go out data.
8. device according to claim 7, it is characterised in that the new addition mapping data set, the first mapping data set,
Second mapping data set includes at least one data subset respectively, wherein, data corresponding with identical key are using the key as rope
In the data subset drawn.
9. device according to claim 8, it is characterised in that the Second processing module includes:
Determination sub-module, for determining in the second mapping data set with the deleted data key indexed set and described newly adding
The key index identical delta data subset that addend is concentrated according to key index;
Submodule is handled, the change for carrying out second processing to the delta data subset to obtain using key as index exports number
According to.
10. device according to claim 8, it is characterised in that the output data acquisition module includes:
Submodule is searched, will with each key identical in the change output data for being searched in first output data
It is replaced output data;
Submodule is replaced, for the output data to be replaced to be substituted for into the change output data, and by after replacement
First output data as this processing procedure the second output data.
11. device according to claim 7, it is characterised in that
The number that the deleted data do not occur to be occurring in first input data in second input data
According to;
The number that the new interpolation data occurs to be not occurring in first input data in second input data
According to.
12. device according to claim 7, it is characterised in that
First processing includes:Key-value pair is extracted based on data to be dealt with, to obtain the key index collection using key as index,
And formed using key as the mapping data set indexed;And wherein, first processing also includes:Extract data to be dealt with
Recording mark, the recording mark includes:File path, line number;
The second processing includes:Data to be dealt with are handled according to pre-defined rule, obtained using key as the defeated of index
Go out data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310268334.2A CN104252486B (en) | 2013-06-28 | 2013-06-28 | A kind of method and device of data processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310268334.2A CN104252486B (en) | 2013-06-28 | 2013-06-28 | A kind of method and device of data processing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104252486A CN104252486A (en) | 2014-12-31 |
CN104252486B true CN104252486B (en) | 2017-09-12 |
Family
ID=52187388
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310268334.2A Active CN104252486B (en) | 2013-06-28 | 2013-06-28 | A kind of method and device of data processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104252486B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399151B (en) * | 2017-02-06 | 2022-02-15 | 百度在线网络技术(北京)有限公司 | Data comparison system and method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102129458A (en) * | 2011-03-09 | 2011-07-20 | 胡劲松 | Method and device for storing relational database |
CN102402559A (en) * | 2010-09-16 | 2012-04-04 | 中兴通讯股份有限公司 | Database upgrade script generating method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040049767A1 (en) * | 2002-09-05 | 2004-03-11 | International Business Machines Corporation | Method and apparatus for comparing computer code listings |
TW201007557A (en) * | 2008-08-06 | 2010-02-16 | Inventec Corp | Method for reading/writing data in a multithread system |
-
2013
- 2013-06-28 CN CN201310268334.2A patent/CN104252486B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402559A (en) * | 2010-09-16 | 2012-04-04 | 中兴通讯股份有限公司 | Database upgrade script generating method and device |
CN102129458A (en) * | 2011-03-09 | 2011-07-20 | 胡劲松 | Method and device for storing relational database |
Non-Patent Citations (4)
Title |
---|
A comparison of data warehousing methodologies;Arun Sen等;《Communications of The ACM》;20050331;第48卷(第3期);第79-84页 * |
Beyond Constant Comparison Qualitative Data Analysis: Using NVivo;Nancy L Leech等;《School Psychology Quarterly》;20110331;第26卷(第1期);第70-84页 * |
基于Hibernate的数据整合系统的研究与开发;王兰香;《中国优秀硕士学位论文全文数据库 信息科技辑》;20070715(第01期);第I138-135页 * |
面向服务的Web异构数据集成存取体系结构研究;巫丹丹;《中国优秀硕士学位论文全文数据库 信息科技辑》;20070715(第01期);第I138-48页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104252486A (en) | 2014-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230126005A1 (en) | Consistent filtering of machine learning data | |
US20220335338A1 (en) | Feature processing tradeoff management | |
US11100420B2 (en) | Input processing for machine learning | |
US10452691B2 (en) | Method and apparatus for generating search results using inverted index | |
US10366053B1 (en) | Consistent randomized record-level splitting of machine learning data | |
US10318882B2 (en) | Optimized training of linear machine learning models | |
US9846702B2 (en) | Indexing of file in a hadoop cluster | |
CN111258978B (en) | Data storage method | |
CN110321329A (en) | Data processing method and device based on big data | |
CN105786808A (en) | Method and apparatus for executing relation type calculating instruction in distributed way | |
US20160171047A1 (en) | Dynamic creation and configuration of partitioned index through analytics based on existing data population | |
CN110674360B (en) | Tracing method and system for data | |
CN105677904B (en) | Small documents storage method and device based on distributed file system | |
CN108062384A (en) | The method and apparatus of data retrieval | |
CN104778182A (en) | Data import method and system based on HBase (Hadoop Database) | |
CN110019298A (en) | Data processing method and device | |
CN108874379A (en) | The processing method and processing device of the page | |
CN112364185B (en) | Method and device for determining characteristics of multimedia resources, electronic equipment and storage medium | |
CN103530369A (en) | De-weight method and system | |
CN109582476A (en) | Data processing method, apparatus and system | |
CN109947702A (en) | Index structuring method and device, electronic equipment | |
CN104252486B (en) | A kind of method and device of data processing | |
US11093566B2 (en) | Router based query results | |
CN103995831A (en) | Object processing method, system and device based on similarity among objects | |
CN106294700A (en) | The storage of a kind of daily record and read method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |