Specific embodiment
In order to realize the purpose of the application, the embodiment of the present application provides a kind of data processing method and dress
Put, in a distributed system, main process task node is when receiving the auxiliary data message processing node transmission, right
Close in the different two datasets of index information, if comprising identical pending data, main process task node can
It is polymerized with closing the two datasets comprising identical pending data, effectively improved data aggregate
Speed, and then improve the treatment effeciency of mass data.
With reference to each specific embodiment of the application and corresponding accompanying drawing, technical scheme is carried out clearly
Chu, it is fully described by.Obviously, described embodiment is only some embodiments of the present application, rather than entirely
The embodiment in portion.Based on the embodiment in the application, those of ordinary skill in the art are not making creativeness
The every other embodiment being obtained under the premise of work, broadly falls into the scope of the application protection.
A kind of schematic flow sheet of data processing method that Fig. 1 provides for the embodiment of the present application.Methods described can
With as described below.
Step 101:First process node receives the data message of at least one second processing node transmission.
Wherein, the first data acquisition system and the corresponding index of described first data acquisition system are comprised in described data message
Information, described index information is used for characterizing the public spy of the pending data comprising in described first data acquisition system
Levy vector.
In a step 101, " the first process node " described in the embodiment of the present application and " second processing section
" first " and " second " in point " is used only for distinguishing different process nodes, does not have other special to contain
Justice.Process node described in the embodiment of the present application can be any one the process node in distributed system
(for example:Processor or server) or non-distributed systems in process node, such
Process node and can be referred to as calculate node again.
The different nodes that processes is classified for locally stored pending data, different types of to be formed
Data acquisition system.Each data acquisition system corresponds to an index information.
Wherein, pending data described in the embodiment of the present application can be text data or picture number
According to can also be view data, be not specifically limited here.
The determination mode of data acquisition system so that one processes node is as a example described below.
A kind of schematic flow sheet of data processing method that Fig. 2 provides for the embodiment of the present application.Institute from Fig. 2
The content shown will be seen that one processes how node determines how data acquisition system (in other words will be locally stored
Pending data is classified).Methods described can be as follows.
S201:First processes node obtains at least two pending datas, and determines that each is described pending
The characteristic set of data.
Wherein, comprise at least one characteristic vector in described characteristic set.
In S201, the first process node obtains many from local data base or other third party databases
Individual pending data (being properly termed as at least two pending datas here), and to the pending number getting
According to carrying out classification process.
Specifically, possesses the feature for being different from other pending datas due to each pending data
Value, then process node for needing pending data to be processed, extract the feature of each pending data
Vector, forms the characteristic vector set for this pending data.
Specifically, for different pending datas, extract the feature of each pending data obtaining to
Amount is different, for example:If pending data belongs to text data, determine the key word of each text data,
The characteristic vector as this article notebook data for the key word obtaining will be extracted, and multiple key words that extraction is obtained
The keyword set being formed is as the characteristic vector set of this article notebook data.
If pending data belongs to image data, each picture can be extracted by picture feature extracting method
The eigenvalues such as the point of data, line, profile, pixel, will extract the eigenvalue obtaining as this image data
Characteristic vector, and the characteristic value collection that the multiple eigenvalues obtaining are formed will be extracted as this image data
Characteristic vector set.
For example, it is assumed that first processes node 5 image datas of acquisition, for each image data, determine
The characteristic vector set of each image data.
Specifically, for the following operation of each image data execution until obtaining the spy of each image data
Levy vector set:
Select one of image data, extracted from the image data selecting using picture feature extracting method
The eigenvalue of this image data, and the characteristic set of the eigenvalue composition extracted is as the spy of this image data
Levy vector set.
It is assumed that 5 image datas that the first process node obtains are:P0, P1, P2, P3 and P4, lead to
The set of eigenvectors crossing each image data that aforesaid way obtains is combined into:The corresponding set of eigenvectors of P0
It is combined into { A1, A2, A3 };The corresponding set of eigenvectors of P1 is combined into { A2, A3, A4 };The corresponding spy of P2
Levy vector set and be combined into { A3, A4, A5 };The corresponding set of eigenvectors of P3 is combined into { A5, A6, A7 };P4
Corresponding set of eigenvectors is combined into { A7, A8, A9 }.
It should be noted that the method processing the characteristic vector set of Node extraction pending data can also be wrapped
Calculate containing scale invariant feature conversion (Scale-invariant feature transform, SIFT) method, simhash
Method etc., is not specifically limited for the method for characteristic vector set extracting pending data here.
S202:Described first process node with described characteristic vector as granularity of division, will comprise same characteristic features to
At least one pending data of amount divides and obtains a data acquisition system.
In S202, described first is processed in the characteristic set according to each described pending data for the node
The characteristic vector comprising, determines the characteristic vector comprising at least two pending datas obtaining.
Described first processes node selects a described characteristic vector, from described at least two getting treat
In reason data, find out the pending data of the described characteristic vector comprising to determine;
The described pending data combination finding is obtained a data acquisition system by described first process node.
For example:First process node obtain 5 image datas be:P0, P1, P2, P3 and P4, lead to
The set of eigenvectors crossing each image data that aforesaid way obtains is combined into:The corresponding set of eigenvectors of P0
It is combined into { A1, A2, A3 };The corresponding set of eigenvectors of P1 is combined into { A2, A3, A4 };The corresponding spy of P2
Levy vector set and be combined into { A3, A4, A5 };The corresponding set of eigenvectors of P3 is combined into { A5, A6, A7 };P4
Corresponding set of eigenvectors is combined into { A7, A8, A9 }.
In the characteristic vector set obtaining each image data, the feature according to each image data to
The characteristic vector comprising in duration set, can obtain the characteristic vector comprising in 5 image datas obtaining:
A1, A2, A3, A4, A5, A6, A7, A8 and A9.
Select a characteristic vector in the characteristic vector comprising from 5 image datas obtaining obtaining, determine
Comprise the image data of the characteristic vector of selection:If select characteristic vector be A1, then comprise feature to
The image data of amount A1 has P0;If the characteristic vector selecting is A2, then comprise characteristic vector A2
Image data has P0 and P1;If the characteristic vector selecting is A3, then comprise the picture of characteristic vector A3
Data has P0, P1 and P2;If the characteristic vector selecting is A4, then comprise the picture of characteristic vector A4
Data has P1 and P2;If the characteristic vector selecting is A5, then comprise the image data of characteristic vector A5
There is P2 and P3;If the characteristic vector selecting is A6, then the image data comprising characteristic vector A6 has
P3;If the characteristic vector selecting is A7, then the image data comprising characteristic vector A7 has P3 and P4;
If the characteristic vector selecting is A8, then the image data comprising characteristic vector A8 has P4;If selecting
Characteristic vector is A9, then the image data comprising characteristic vector A9 has P4.
So obtain 9 data acquisition systems:A1 corresponding { P0 };A2 corresponding { P0, P1 };A3 corresponds to
{ P0, P1, P2 };A4 corresponding { P1, P2 };A5 corresponding { P2, P3 };A6 corresponding { P3 };
A7 corresponding { P3, P4 };A8 corresponding { P4 };A9 corresponding { P4 }.
Alternatively, described first process node when obtaining a data acquisition system, determine described data acquisition system
Index information.
Wherein, described index information is for characterizing the public of the pending data comprising in described data acquisition system
Characteristic vector.
Still, when obtaining 9 data acquisition systems taking above-mentioned example as a example, it is respectively each data acquisition system and determines
Index information.For example:The index information of { P0 } is A1;The index information of { P0, P1 } is A2;{P0、
P1, P2 } index information be A3;The index information of { P1, P2 } is A4;The index information of { P2, P3 }
For A5;The index information of { P3 } is A6;The index information of { P3, P4 } is A7;The index information of { P4 }
For A8;The index information of { P4 } is A9.
It should be noted that the index information described in the embodiment of the present application can adopt the shape of inverted index
Formula, comprises the address information of each data comprising in property value data set in index information.
Alternatively, the described pending data combination finding is obtained a data by described first process node
Set, including:
Described first processes node according to the order searching described pending data, generates one tree, and by institute
State tree and be considered as a data acquisition system.
Wherein, each node of described tree corresponds to a pending data.
In the embodiment of the present application, the first process node will comprise phase with described characteristic vector for granularity of division
When obtaining a data acquisition system with least one pending data division of characteristic vector, can be by the number obtaining
Store in the form of a tree according to set, that is, with described characteristic vector for the root node of tree, comprise this according to finding
The order of the pending data of root node, sequentially generates the leaf node of the root node of this tree.
Still, when obtaining 9 data acquisition systems, respectively by each data acquisition system to set taking above-mentioned example as a example
Form storage.The structure of specification tree taking { P0, P1, P2 } as a example.As shown in figure 3, being characterized vector
The structural representation of the corresponding data acquisition system of A3.
From figure 3, it can be seen that the root node of tree is the index information A3 of data acquisition system, P0, P1 and P2
It is respectively the leaf node of root node.
Need exist for illustrate, simply show a kind of structural representation of tree in Fig. 3, as P0,
Relation between these leaf nodes of P1 and P2 can be to obtain, such as in the way of being provided previously by:In advance
The mode providing can be that each root node comprises two leaf nodes, the leaf node position preferentially finding
Left side in root node;Can also generate according to actual needs, the generating mode for tree does not do specifically here
Limit.
So, after the first process node carries out classification process to pending data that is local or obtaining, obtain
To the corresponding data acquisition system of each characteristic vector, that is, obtain " forest " of pending data.
Alternatively, methods described also includes:
The described data acquisition system obtaining is stored to the data set of the first process node by described first process node
Close in storehouse.
Here data acquisition system storehouse may refer to the local data acquisition system storehouse of the first process node it is also possible to refer to
With the first process node corresponding data acquisition system storehouse, this data acquisition system storehouse is not processing node locally not first
It is specifically limited.
It should be noted that described first process node stores the described data acquisition system obtaining to data acquisition system
In storehouse, the index information of each data acquisition system can be integrated, be stored in index list, so
For each index information comprising in index list it may be determined that each index information corresponding data collection
Close, and then obtain the corresponding pending data of each index information.
The data acquisition system comprising in the data message that second processing node sends can also obtain by the way
Arrive, be no longer described in detail here.
In a distributed system, each processes node to specified location (for example:Local or other the
Three party databases) in pending data carry out classification process after in addition it is also necessary to each process node obtain
Data acquisition system integrated, so mass data can be effectively treated, therefore, for distributed
Any one in system processes node, there is the situation receiving other and processing the data message that node sends.
Step 102:Described first processes node according to the described index information comprising in described data message,
From the data acquisition system storehouse of the described first process node, search and described index information whether identical data set
Close, if differing, execution step 103;Otherwise, execution step 106.
Step 103:Described first process node checks are at least one differing with described index information the
Two data acquisition systems.
In step 103, the rope of each of described first process node traverses data acquisition system storehouse data acquisition system
Fuse ceases, the described index information that will comprise in the index information traversing and the described data message receiving
It is compared, determine the described index comprising in the index information traversing and the described data message receiving
Whether information identical, and obtain traversing with described data message that is receiving in the described index letter that comprises
Cease the corresponding data acquisition system of index information differing as the second data acquisition system.
It should be noted that " the first data acquisition system " that be related in the embodiment of the present application and " the second data
" first " and " second " in set " does not have particular meaning, is used only to distinguish different pieces of information set.
Step 104:Described first process node judge described first data acquisition system and at least one described second
Identical pending data whether is comprised in data acquisition system, if comprising, execution step 105;If not comprising,
Then abandon epicycle operation.
At step 104, described first process node by the pending data comprising in the first data acquisition system with
The pending data comprising in each second data acquisition system is compared, and determines in the first data acquisition system and comprises
Pending data whether identical with the pending data comprising in each second data acquisition system.
Step 105:Described first process node determine described first data acquisition system with least one described the
When comprising at least one identical pending data in two data acquisition systems, by described first data acquisition system and at least
One described second data acquisition system is polymerized to a data acquisition system.
In step 105, described first process node in described first data acquisition system of determination and at least one institute
State when comprising at least one identical pending data in the second data acquisition system, described first data acquisition system is described
There is same or like property with the second data acquisition system at least one described, you can to described first data acquisition system
With the second data acquisition system execution converging operationJu Hecaozuo at least one described, obtain a data acquisition system.
Alternatively, also comprise each of described first data acquisition system pending data in described data message
Positional information in described first data acquisition system;Described first processes node to described first data set
Close with least one described second data acquisition system execution converging operationJu Hecaozuo when, determine described first data acquisition system with extremely
At least one the identical pending data comprising in few described second data acquisition system is in the described first number
According to the positional information in set;According to the described positional information determining, by described first data acquisition system and at least
One described second data acquisition system is polymerized to a data acquisition system.
It is assumed that the first process node receives the data set comprising in the data message of second processing node transmission
It is combined into { P0, P5, P6 }, index information is A5;Find from data acquisition system storehouse index information different and
The data acquisition system comprising identical data is { P0 } corresponding index information is A1;{ P0, P1 } corresponding index
Information is A2;{ P0, P1, P2 } corresponding index information is A3.
That is, { P0, P5, P6 } corresponding index information for A5 index information corresponding with { P0 } is
A1 is polymerized, and { P0, P5, P6 } corresponding index information is that A5 index corresponding with { P0, P1 } is believed
Cease and be polymerized for A2, { P0, P5, P6 } corresponding index information is that A5 is corresponding with { P0, P1, P2 }
Index information be polymerized for A3.
It should be noted that the positional information of execution converging operationJu Hecaozuo is the corresponding positional information of P0.
Alternatively, if described corresponding first tree of first data acquisition system and at least one described second data acquisition system pair
Answer the second tree, then described first process node according to determine described positional information, will described first tree and
Described second tree is polymerized to a distributed tree.
Believed with { P0, P5, P6 } corresponding index information for A5 index corresponding with { P0, P1, P2 } below
Cease and illustrate as a example being polymerized for A3.
As shown in figure 4, the schematic diagram for execution converging operationJu Hecaozuo.
Figure 4, it is seen that { P0, P5, P6 } corresponding index information is the corresponding tree construction of A5 1
For:With A5 as root node, P0 and P5 is the leaf node of root node, and P6 is the leaf node of P0;
{ P0, P1, P2 } corresponding index information for the corresponding tree construction of A3 2 is:With A3 as root node, P0
With P1 for root node leaf node, P2 be P0 leaf node.Due to executing the position of converging operationJu Hecaozuo
Information is the corresponding positional information of P0, is polymerized tree construction 1 with tree construction 2 at P0 node,
Form distributed tree.
Step 106:Described first process node checks to described index information identical at least one the 3rd
Data acquisition system.
Step 107:Described first processes node will be the described 3rd several with least one for described first data acquisition system
Merge according to set, generate the corresponding tree of described index information.
In step 107, described first process node determine described first data acquisition system index information with
During the index information of at least one described 3rd data acquisition system, illustrate described first data acquisition system and at least one
There are same characteristic features in described second data acquisition system, you can to described first data acquisition system with least one described the
Two data acquisition system execution union operations, obtain a data acquisition system.
It is assumed that the first process node receives the data set comprising in the data message of second processing node transmission
It is combined into { P0, P5, P6 }, index information is A5;Index information A5 pair is found from data acquisition system storehouse
The data acquisition system answered is { P2, P3 }, executes union operation to { P0, P5, P6 } with { P2, P3 }, obtains
One data acquisition system { P0, P5, P6, P2, P3 }.
As shown in figure 5, the schematic diagram for execution union operation.
From figure 5 it can be seen that { P0, P5, P6 } corresponding index information is the corresponding tree construction of A5 1
For:With A5 as root node, P0 and P5 is the leaf node of root node, and P6 is the leaf node of P0;
{ P2, P3 } corresponding index information for the corresponding tree construction of A5 2 is:With A5 as root node, P2 and P3
Leaf node for root node.Due to executing union operation, at A5 root node, generate two leaf sections
Point:P2 and P3, forms distributed tree.
By the scheme of the embodiment of the present application, the first process node receives at least one second processing node and sends
Data message, comprise the first data acquisition system and the corresponding rope of described first data acquisition system in described data message
Fuse ceases, and described index information is used for characterizing the public of the pending data comprising in described first data acquisition system
Characteristic vector;According to the described index information comprising in described data message, from the data of the first process node
In set storehouse, find at least one second data acquisition system differing with described index information;Determining institute
State that to comprise at least one identical in the first data acquisition system and the second data acquisition system at least one described pending
During data, described first data acquisition system and at least one described second data acquisition system are polymerized to a data set
Close.So, in a distributed system, main process task node is receiving the auxiliary data message processing node transmission
When, the different two datasets of index information are closed, if comprising identical pending data, main process task
The two datasets comprising identical pending data can be closed and are polymerized by node, effectively improve number
According to polymerization speed, and then improve the treatment effeciency of mass data.
The data processing method providing for the embodiment of the present application above, based on same thinking, the application is implemented
Example additionally provides a kind of data processing equipment, as shown in fig. 6, a kind of data providing for the embodiment of the present application
The structural representation of processing meanss.Described data processing equipment includes:Receiving unit 61, searching unit 62
With polymerized unit 63, wherein:
Receiving unit 61, for receiving the data message of at least one second processing node transmission, wherein,
The first data acquisition system and the corresponding index information of described first data acquisition system is comprised in described data message, described
Index information is used for characterizing the public characteristic vector of the pending data comprising in described first data acquisition system;
Searching unit 62, for according to the described index information comprising in described data message, at first
In the data acquisition system storehouse of reason node, find at least one second data set differing with described index information
Close;
Polymerized unit 63, for determining described first data acquisition system and at least one described second data set
When comprising at least one identical pending data in conjunction, by described in described first data acquisition system and at least one
Second data acquisition system is polymerized to a data acquisition system.
Alternatively, also comprise each of described first data acquisition system pending data in described data message
Positional information in described first data acquisition system;
Described polymerized unit 63, specifically for determine described first data acquisition system with least one described the
When comprising at least one identical pending data in two data acquisition systems, determine that at least one identical described is treated
Positional information in described first data acquisition system for the processing data;
According to the described positional information determining, by described first data acquisition system and at least one described second data
Set is polymerized to a data acquisition system.
Specifically, described polymerized unit 63, if specifically for corresponding first tree of described first data acquisition system and
At least one described second data acquisition system corresponds to the second tree, then described first processes node according to the institute determining
State positional information, described first tree and described second tree are polymerized to a distributed tree.
Alternatively, described data processing equipment also includes:Combining unit 64, wherein:
Described combining unit 64, for according to the described index information comprising in described data message, from institute
State in the data acquisition system storehouse of the first process node, find with described index information identical at least one the 3rd
Data acquisition system;
Described first data acquisition system is merged with the 3rd data acquisition system at least one described, generates described rope
Fuse ceases corresponding tree.
Alternatively, described data processing equipment also includes:Taxon 65, wherein:
Described taxon 65, for receive at least one second processing node send data message it
Before, acquisition at least two pending datas, and determine the characteristic set of each described pending data, its
In, comprise at least one characteristic vector in described characteristic set;With described characteristic vector as granularity of division, will
At least one pending data comprising same characteristic features vector divides and obtains a data acquisition system.
Specifically, described taxon 65, specifically for determining a described characteristic vector;From getting
Described at least two pending datas in, find out comprise determine described characteristic vector pending number
According to;The described pending data combination finding is obtained a data acquisition system.
Alternatively, described data processing equipment also includes:Determining unit 66, wherein:
Described determining unit 66, for when obtaining a data acquisition system, determining the rope of described data acquisition system
Fuse ceases, and wherein, described index information is for characterizing the pending data comprising in described data acquisition system
Public characteristic vector.
Specifically, the described pending data combination finding is obtained a data by described taxon 65
Set, specifically includes:
According to the order searching described pending data, generate one tree, and described tree is considered as a data
Set, wherein, each node of described tree corresponds to a pending data.
Alternatively, described data processing equipment also includes:Memory element 67, wherein:
Described memory element 67, for storing the described data acquisition system obtaining to the described first process node
Data acquisition system storehouse in.
It should be noted that data processing equipment described in the embodiment of the present application can be real by hardware mode
Now it is also possible to realize by software mode, here concrete limit is not done for the implementation of data processing equipment
Fixed.
So, in a distributed system, if data processing equipment is deployed respectively described in the embodiment of the present application
On main process task node and auxiliary process node, then main process task node is receiving the auxiliary number processing node transmission
It is believed that during breath, closing for the different two datasets of index information, if comprising identical pending data,
The two datasets comprising identical pending data can be closed and are polymerized by main process task node, effectively carry
Rise data aggregate speed, and then improve the treatment effeciency of mass data.
The present invention is to produce with reference to method according to embodiments of the present invention, equipment (system) and computer program
The flow chart of product and/or block diagram are describing.It should be understood that can by computer program instructions flowchart and
/ or block diagram in each flow process and/or the flow process in square frame and flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embed
The processor of formula datatron or other programmable data processing device is to produce a machine so that passing through to calculate
The instruction of the computing device of machine or other programmable data processing device produces for realizing in flow chart one
The device of the function of specifying in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or other programmable datas can be guided to process and set
So that being stored in this computer-readable memory in the standby computer-readable memory working in a specific way
Instruction produce and include the manufacture of command device, the realization of this command device is in one flow process or multiple of flow chart
The function of specifying in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makes
Obtain and series of operation steps is executed on computer or other programmable devices to produce computer implemented place
Reason, thus the instruction of execution is provided for realizing in flow chart one on computer or other programmable devices
The step of the function of specifying in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
In a typical configuration, computing device includes one or more processors (CPU), input/output
Interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory
(RAM) and/or the form such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash
RAM).Internal memory is the example of computer-readable medium.
Computer-readable medium include permanent and non-permanent, removable and non-removable media can by appoint
What method or technique is realizing information Store.Information can be computer-readable instruction, data structure, program
Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory
(PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its
The random access memory (RAM) of his type, read only memory (ROM), electrically erasable are read-only
Memorizer (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory
(CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, tape magnetic
Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storage can be calculated
The information that equipment accesses.Define according to herein, computer-readable medium does not include temporary computer-readable matchmaker
Body (transitory media), the such as data signal of modulation and carrier wave.
Also, it should be noted term " inclusion ", "comprising" or its any other variant be intended to non-
The comprising of exclusiveness, so that include a series of process of key elements, method, commodity or equipment not only wrap
Include those key elements, but also include other key elements being not expressly set out, or also include for this process,
Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, " included by sentence
One ... " key element that limits is being it is not excluded that including the process of described key element, method, commodity or setting
Also there is other identical element in standby.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey
Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.And, the application can adopt and wherein include calculating one or more
Machine usable program code computer-usable storage medium (including but not limited to disk memory, CD-ROM,
Optical memory etc.) the upper computer program implemented form.
The foregoing is only embodiments herein, be not limited to the application.For this area skill
For art personnel, the application can have various modifications and variations.All institutes within spirit herein and principle
Any modification, equivalent substitution and improvement made etc., within the scope of should be included in claims hereof.