CN106407215A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN106407215A
CN106407215A CN201510463292.7A CN201510463292A CN106407215A CN 106407215 A CN106407215 A CN 106407215A CN 201510463292 A CN201510463292 A CN 201510463292A CN 106407215 A CN106407215 A CN 106407215A
Authority
CN
China
Prior art keywords
data
acquisition system
data acquisition
pending
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510463292.7A
Other languages
Chinese (zh)
Other versions
CN106407215B (en
Inventor
郭真林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510463292.7A priority Critical patent/CN106407215B/en
Publication of CN106407215A publication Critical patent/CN106407215A/en
Application granted granted Critical
Publication of CN106407215B publication Critical patent/CN106407215B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing method and device, and aims at improving the data processing efficiency. The data processing method comprises the following steps of: receiving data information sent by at least one second processing node by a first processing node, wherein the data information comprises a first data set and index information corresponding to the first data set, and the index information is used for representing public feature vectors of to-be-processed data in the first data set; searching at least one second data set different from the index information from a data set library of the first processing node according to the index information in the data information; and clustering the first data set and the at least one second data set into one data set when the fact that the first data set and the at least one second data set comprise at least one same to-be-processed data is determined.

Description

A kind of data processing method and device
Technical field
The application is related to internet information processing technology field, more particularly, to a kind of data processing method and dress Put.
Background technology
At present, developing rapidly with network technology, a large number of users can produce different types of a large amount of number daily According to accordingly, server also can receive the mass data of user's upload, and can be to these mass data Processed.For example, there are substantial amounts of video data, view data, text data, user's dependency number daily According to etc. can upload onto the server, server is often processed to the mass data receiving.
Server when carrying out data processing, often through comparing two-by-two mode determine two-by-two data it Between similarity, and then determine the processing mode of this data two-by-two.For example, for pending 100,000 Image, for each image, each feature of each feature of every image and other images is carried out by server Relatively, determine the similarity between every image and other images, and then pending figure is determined according to similarity The processing mode of picture.
For mass data, by way of comparing two-by-two between data, data is processed, So, not only there is a problem of that data-handling efficiency is relatively low, also the larger problem of presence server expense.
Content of the invention
The embodiment of the present application provides a kind of data processing method and device, in order to improve the effect to data processing Rate.
The embodiment of the present application provides a kind of data processing method, including:
First process node receives the data message of at least one second processing node transmission, wherein, described number It is believed that comprising the first data acquisition system and the corresponding index information of described first data acquisition system in breath, described index letter Cease the public characteristic vector for characterizing the pending data comprising in described first data acquisition system;
Described first processes node according to the described index information comprising in described data message, from described first Process in the data acquisition system storehouse of node, find at least one second data differing with described index information Set;
Described first processes node is determining described first data acquisition system and at least one described second data set When comprising at least one identical pending data in conjunction, by described in described first data acquisition system and at least one Second data acquisition system is polymerized to a data acquisition system.
The embodiment of the present application provides a kind of data processing equipment, including:
Receiving unit, for receiving the data message of at least one second processing node transmission, wherein, described The first data acquisition system and the corresponding index information of described first data acquisition system, described index is comprised in data message Information is used for characterizing the public characteristic vector of the pending data comprising in described first data acquisition system;
Searching unit, for according to the described index information comprising in described data message, at described first In the data acquisition system storehouse of reason node, find at least one second data set differing with described index information Close;
Polymerized unit, for determining in described first data acquisition system and at least one described second data acquisition system When comprising at least one identical pending data, by described first data acquisition system with least one described second Data acquisition system is polymerized to a data acquisition system.
Beneficial effect:
The embodiment of the present application first processes the data message that node receives the transmission of at least one second processing node, The first data acquisition system and the corresponding index information of described first data acquisition system is comprised in described data message, described Index information is used for characterizing the public characteristic vector of the pending data comprising in described first data acquisition system;Root According to the described index information comprising in described data message, from the data acquisition system storehouse of the described first process node In, find at least one second data acquisition system differing with described index information;Determining described first When comprising at least one identical pending data in data acquisition system and at least one described second data acquisition system, Described first data acquisition system and at least one described second data acquisition system are polymerized to a data acquisition system.This Sample, in a distributed system, main process task node is when receiving the auxiliary data message processing node transmission, right Close in the different two datasets of index information, if comprising identical pending data, main process task node can It is polymerized with closing the two datasets comprising identical pending data, effectively improved data aggregate Speed, and then improve the treatment effeciency of mass data.
Brief description
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes of the application Point, the schematic description and description of the application is used for explaining the application, does not constitute to the application not Work as restriction.In the accompanying drawings:
A kind of schematic flow sheet of data processing method that Fig. 1 provides for the embodiment of the present application;
A kind of schematic flow sheet of data processing method that Fig. 2 provides for the embodiment of the present application;
Fig. 3 is characterized the structural representation of the corresponding data acquisition system of vectorial A3;
Fig. 4 is the schematic diagram of execution converging operationJu Hecaozuo;
Fig. 5 is the schematic diagram of execution union operation;
A kind of structural representation of data processing equipment that Fig. 6 provides for the embodiment of the present application.
Specific embodiment
In order to realize the purpose of the application, the embodiment of the present application provides a kind of data processing method and dress Put, in a distributed system, main process task node is when receiving the auxiliary data message processing node transmission, right Close in the different two datasets of index information, if comprising identical pending data, main process task node can It is polymerized with closing the two datasets comprising identical pending data, effectively improved data aggregate Speed, and then improve the treatment effeciency of mass data.
With reference to each specific embodiment of the application and corresponding accompanying drawing, technical scheme is carried out clearly Chu, it is fully described by.Obviously, described embodiment is only some embodiments of the present application, rather than entirely The embodiment in portion.Based on the embodiment in the application, those of ordinary skill in the art are not making creativeness The every other embodiment being obtained under the premise of work, broadly falls into the scope of the application protection.
A kind of schematic flow sheet of data processing method that Fig. 1 provides for the embodiment of the present application.Methods described can With as described below.
Step 101:First process node receives the data message of at least one second processing node transmission.
Wherein, the first data acquisition system and the corresponding index of described first data acquisition system are comprised in described data message Information, described index information is used for characterizing the public spy of the pending data comprising in described first data acquisition system Levy vector.
In a step 101, " the first process node " described in the embodiment of the present application and " second processing section " first " and " second " in point " is used only for distinguishing different process nodes, does not have other special to contain Justice.Process node described in the embodiment of the present application can be any one the process node in distributed system (for example:Processor or server) or non-distributed systems in process node, such Process node and can be referred to as calculate node again.
The different nodes that processes is classified for locally stored pending data, different types of to be formed Data acquisition system.Each data acquisition system corresponds to an index information.
Wherein, pending data described in the embodiment of the present application can be text data or picture number According to can also be view data, be not specifically limited here.
The determination mode of data acquisition system so that one processes node is as a example described below.
A kind of schematic flow sheet of data processing method that Fig. 2 provides for the embodiment of the present application.Institute from Fig. 2 The content shown will be seen that one processes how node determines how data acquisition system (in other words will be locally stored Pending data is classified).Methods described can be as follows.
S201:First processes node obtains at least two pending datas, and determines that each is described pending The characteristic set of data.
Wherein, comprise at least one characteristic vector in described characteristic set.
In S201, the first process node obtains many from local data base or other third party databases Individual pending data (being properly termed as at least two pending datas here), and to the pending number getting According to carrying out classification process.
Specifically, possesses the feature for being different from other pending datas due to each pending data Value, then process node for needing pending data to be processed, extract the feature of each pending data Vector, forms the characteristic vector set for this pending data.
Specifically, for different pending datas, extract the feature of each pending data obtaining to Amount is different, for example:If pending data belongs to text data, determine the key word of each text data, The characteristic vector as this article notebook data for the key word obtaining will be extracted, and multiple key words that extraction is obtained The keyword set being formed is as the characteristic vector set of this article notebook data.
If pending data belongs to image data, each picture can be extracted by picture feature extracting method The eigenvalues such as the point of data, line, profile, pixel, will extract the eigenvalue obtaining as this image data Characteristic vector, and the characteristic value collection that the multiple eigenvalues obtaining are formed will be extracted as this image data Characteristic vector set.
For example, it is assumed that first processes node 5 image datas of acquisition, for each image data, determine The characteristic vector set of each image data.
Specifically, for the following operation of each image data execution until obtaining the spy of each image data Levy vector set:
Select one of image data, extracted from the image data selecting using picture feature extracting method The eigenvalue of this image data, and the characteristic set of the eigenvalue composition extracted is as the spy of this image data Levy vector set.
It is assumed that 5 image datas that the first process node obtains are:P0, P1, P2, P3 and P4, lead to The set of eigenvectors crossing each image data that aforesaid way obtains is combined into:The corresponding set of eigenvectors of P0 It is combined into { A1, A2, A3 };The corresponding set of eigenvectors of P1 is combined into { A2, A3, A4 };The corresponding spy of P2 Levy vector set and be combined into { A3, A4, A5 };The corresponding set of eigenvectors of P3 is combined into { A5, A6, A7 };P4 Corresponding set of eigenvectors is combined into { A7, A8, A9 }.
It should be noted that the method processing the characteristic vector set of Node extraction pending data can also be wrapped Calculate containing scale invariant feature conversion (Scale-invariant feature transform, SIFT) method, simhash Method etc., is not specifically limited for the method for characteristic vector set extracting pending data here.
S202:Described first process node with described characteristic vector as granularity of division, will comprise same characteristic features to At least one pending data of amount divides and obtains a data acquisition system.
In S202, described first is processed in the characteristic set according to each described pending data for the node The characteristic vector comprising, determines the characteristic vector comprising at least two pending datas obtaining.
Described first processes node selects a described characteristic vector, from described at least two getting treat In reason data, find out the pending data of the described characteristic vector comprising to determine;
The described pending data combination finding is obtained a data acquisition system by described first process node.
For example:First process node obtain 5 image datas be:P0, P1, P2, P3 and P4, lead to The set of eigenvectors crossing each image data that aforesaid way obtains is combined into:The corresponding set of eigenvectors of P0 It is combined into { A1, A2, A3 };The corresponding set of eigenvectors of P1 is combined into { A2, A3, A4 };The corresponding spy of P2 Levy vector set and be combined into { A3, A4, A5 };The corresponding set of eigenvectors of P3 is combined into { A5, A6, A7 };P4 Corresponding set of eigenvectors is combined into { A7, A8, A9 }.
In the characteristic vector set obtaining each image data, the feature according to each image data to The characteristic vector comprising in duration set, can obtain the characteristic vector comprising in 5 image datas obtaining: A1, A2, A3, A4, A5, A6, A7, A8 and A9.
Select a characteristic vector in the characteristic vector comprising from 5 image datas obtaining obtaining, determine Comprise the image data of the characteristic vector of selection:If select characteristic vector be A1, then comprise feature to The image data of amount A1 has P0;If the characteristic vector selecting is A2, then comprise characteristic vector A2 Image data has P0 and P1;If the characteristic vector selecting is A3, then comprise the picture of characteristic vector A3 Data has P0, P1 and P2;If the characteristic vector selecting is A4, then comprise the picture of characteristic vector A4 Data has P1 and P2;If the characteristic vector selecting is A5, then comprise the image data of characteristic vector A5 There is P2 and P3;If the characteristic vector selecting is A6, then the image data comprising characteristic vector A6 has P3;If the characteristic vector selecting is A7, then the image data comprising characteristic vector A7 has P3 and P4; If the characteristic vector selecting is A8, then the image data comprising characteristic vector A8 has P4;If selecting Characteristic vector is A9, then the image data comprising characteristic vector A9 has P4.
So obtain 9 data acquisition systems:A1 corresponding { P0 };A2 corresponding { P0, P1 };A3 corresponds to { P0, P1, P2 };A4 corresponding { P1, P2 };A5 corresponding { P2, P3 };A6 corresponding { P3 }; A7 corresponding { P3, P4 };A8 corresponding { P4 };A9 corresponding { P4 }.
Alternatively, described first process node when obtaining a data acquisition system, determine described data acquisition system Index information.
Wherein, described index information is for characterizing the public of the pending data comprising in described data acquisition system Characteristic vector.
Still, when obtaining 9 data acquisition systems taking above-mentioned example as a example, it is respectively each data acquisition system and determines Index information.For example:The index information of { P0 } is A1;The index information of { P0, P1 } is A2;{P0、 P1, P2 } index information be A3;The index information of { P1, P2 } is A4;The index information of { P2, P3 } For A5;The index information of { P3 } is A6;The index information of { P3, P4 } is A7;The index information of { P4 } For A8;The index information of { P4 } is A9.
It should be noted that the index information described in the embodiment of the present application can adopt the shape of inverted index Formula, comprises the address information of each data comprising in property value data set in index information.
Alternatively, the described pending data combination finding is obtained a data by described first process node Set, including:
Described first processes node according to the order searching described pending data, generates one tree, and by institute State tree and be considered as a data acquisition system.
Wherein, each node of described tree corresponds to a pending data.
In the embodiment of the present application, the first process node will comprise phase with described characteristic vector for granularity of division When obtaining a data acquisition system with least one pending data division of characteristic vector, can be by the number obtaining Store in the form of a tree according to set, that is, with described characteristic vector for the root node of tree, comprise this according to finding The order of the pending data of root node, sequentially generates the leaf node of the root node of this tree.
Still, when obtaining 9 data acquisition systems, respectively by each data acquisition system to set taking above-mentioned example as a example Form storage.The structure of specification tree taking { P0, P1, P2 } as a example.As shown in figure 3, being characterized vector The structural representation of the corresponding data acquisition system of A3.
From figure 3, it can be seen that the root node of tree is the index information A3 of data acquisition system, P0, P1 and P2 It is respectively the leaf node of root node.
Need exist for illustrate, simply show a kind of structural representation of tree in Fig. 3, as P0, Relation between these leaf nodes of P1 and P2 can be to obtain, such as in the way of being provided previously by:In advance The mode providing can be that each root node comprises two leaf nodes, the leaf node position preferentially finding Left side in root node;Can also generate according to actual needs, the generating mode for tree does not do specifically here Limit.
So, after the first process node carries out classification process to pending data that is local or obtaining, obtain To the corresponding data acquisition system of each characteristic vector, that is, obtain " forest " of pending data.
Alternatively, methods described also includes:
The described data acquisition system obtaining is stored to the data set of the first process node by described first process node Close in storehouse.
Here data acquisition system storehouse may refer to the local data acquisition system storehouse of the first process node it is also possible to refer to With the first process node corresponding data acquisition system storehouse, this data acquisition system storehouse is not processing node locally not first It is specifically limited.
It should be noted that described first process node stores the described data acquisition system obtaining to data acquisition system In storehouse, the index information of each data acquisition system can be integrated, be stored in index list, so For each index information comprising in index list it may be determined that each index information corresponding data collection Close, and then obtain the corresponding pending data of each index information.
The data acquisition system comprising in the data message that second processing node sends can also obtain by the way Arrive, be no longer described in detail here.
In a distributed system, each processes node to specified location (for example:Local or other the Three party databases) in pending data carry out classification process after in addition it is also necessary to each process node obtain Data acquisition system integrated, so mass data can be effectively treated, therefore, for distributed Any one in system processes node, there is the situation receiving other and processing the data message that node sends.
Step 102:Described first processes node according to the described index information comprising in described data message, From the data acquisition system storehouse of the described first process node, search and described index information whether identical data set Close, if differing, execution step 103;Otherwise, execution step 106.
Step 103:Described first process node checks are at least one differing with described index information the Two data acquisition systems.
In step 103, the rope of each of described first process node traverses data acquisition system storehouse data acquisition system Fuse ceases, the described index information that will comprise in the index information traversing and the described data message receiving It is compared, determine the described index comprising in the index information traversing and the described data message receiving Whether information identical, and obtain traversing with described data message that is receiving in the described index letter that comprises Cease the corresponding data acquisition system of index information differing as the second data acquisition system.
It should be noted that " the first data acquisition system " that be related in the embodiment of the present application and " the second data " first " and " second " in set " does not have particular meaning, is used only to distinguish different pieces of information set.
Step 104:Described first process node judge described first data acquisition system and at least one described second Identical pending data whether is comprised in data acquisition system, if comprising, execution step 105;If not comprising, Then abandon epicycle operation.
At step 104, described first process node by the pending data comprising in the first data acquisition system with The pending data comprising in each second data acquisition system is compared, and determines in the first data acquisition system and comprises Pending data whether identical with the pending data comprising in each second data acquisition system.
Step 105:Described first process node determine described first data acquisition system with least one described the When comprising at least one identical pending data in two data acquisition systems, by described first data acquisition system and at least One described second data acquisition system is polymerized to a data acquisition system.
In step 105, described first process node in described first data acquisition system of determination and at least one institute State when comprising at least one identical pending data in the second data acquisition system, described first data acquisition system is described There is same or like property with the second data acquisition system at least one described, you can to described first data acquisition system With the second data acquisition system execution converging operationJu Hecaozuo at least one described, obtain a data acquisition system.
Alternatively, also comprise each of described first data acquisition system pending data in described data message Positional information in described first data acquisition system;Described first processes node to described first data set Close with least one described second data acquisition system execution converging operationJu Hecaozuo when, determine described first data acquisition system with extremely At least one the identical pending data comprising in few described second data acquisition system is in the described first number According to the positional information in set;According to the described positional information determining, by described first data acquisition system and at least One described second data acquisition system is polymerized to a data acquisition system.
It is assumed that the first process node receives the data set comprising in the data message of second processing node transmission It is combined into { P0, P5, P6 }, index information is A5;Find from data acquisition system storehouse index information different and The data acquisition system comprising identical data is { P0 } corresponding index information is A1;{ P0, P1 } corresponding index Information is A2;{ P0, P1, P2 } corresponding index information is A3.
That is, { P0, P5, P6 } corresponding index information for A5 index information corresponding with { P0 } is A1 is polymerized, and { P0, P5, P6 } corresponding index information is that A5 index corresponding with { P0, P1 } is believed Cease and be polymerized for A2, { P0, P5, P6 } corresponding index information is that A5 is corresponding with { P0, P1, P2 } Index information be polymerized for A3.
It should be noted that the positional information of execution converging operationJu Hecaozuo is the corresponding positional information of P0.
Alternatively, if described corresponding first tree of first data acquisition system and at least one described second data acquisition system pair Answer the second tree, then described first process node according to determine described positional information, will described first tree and Described second tree is polymerized to a distributed tree.
Believed with { P0, P5, P6 } corresponding index information for A5 index corresponding with { P0, P1, P2 } below Cease and illustrate as a example being polymerized for A3.
As shown in figure 4, the schematic diagram for execution converging operationJu Hecaozuo.
Figure 4, it is seen that { P0, P5, P6 } corresponding index information is the corresponding tree construction of A5 1 For:With A5 as root node, P0 and P5 is the leaf node of root node, and P6 is the leaf node of P0; { P0, P1, P2 } corresponding index information for the corresponding tree construction of A3 2 is:With A3 as root node, P0 With P1 for root node leaf node, P2 be P0 leaf node.Due to executing the position of converging operationJu Hecaozuo Information is the corresponding positional information of P0, is polymerized tree construction 1 with tree construction 2 at P0 node, Form distributed tree.
Step 106:Described first process node checks to described index information identical at least one the 3rd Data acquisition system.
Step 107:Described first processes node will be the described 3rd several with least one for described first data acquisition system Merge according to set, generate the corresponding tree of described index information.
In step 107, described first process node determine described first data acquisition system index information with During the index information of at least one described 3rd data acquisition system, illustrate described first data acquisition system and at least one There are same characteristic features in described second data acquisition system, you can to described first data acquisition system with least one described the Two data acquisition system execution union operations, obtain a data acquisition system.
It is assumed that the first process node receives the data set comprising in the data message of second processing node transmission It is combined into { P0, P5, P6 }, index information is A5;Index information A5 pair is found from data acquisition system storehouse The data acquisition system answered is { P2, P3 }, executes union operation to { P0, P5, P6 } with { P2, P3 }, obtains One data acquisition system { P0, P5, P6, P2, P3 }.
As shown in figure 5, the schematic diagram for execution union operation.
From figure 5 it can be seen that { P0, P5, P6 } corresponding index information is the corresponding tree construction of A5 1 For:With A5 as root node, P0 and P5 is the leaf node of root node, and P6 is the leaf node of P0; { P2, P3 } corresponding index information for the corresponding tree construction of A5 2 is:With A5 as root node, P2 and P3 Leaf node for root node.Due to executing union operation, at A5 root node, generate two leaf sections Point:P2 and P3, forms distributed tree.
By the scheme of the embodiment of the present application, the first process node receives at least one second processing node and sends Data message, comprise the first data acquisition system and the corresponding rope of described first data acquisition system in described data message Fuse ceases, and described index information is used for characterizing the public of the pending data comprising in described first data acquisition system Characteristic vector;According to the described index information comprising in described data message, from the data of the first process node In set storehouse, find at least one second data acquisition system differing with described index information;Determining institute State that to comprise at least one identical in the first data acquisition system and the second data acquisition system at least one described pending During data, described first data acquisition system and at least one described second data acquisition system are polymerized to a data set Close.So, in a distributed system, main process task node is receiving the auxiliary data message processing node transmission When, the different two datasets of index information are closed, if comprising identical pending data, main process task The two datasets comprising identical pending data can be closed and are polymerized by node, effectively improve number According to polymerization speed, and then improve the treatment effeciency of mass data.
The data processing method providing for the embodiment of the present application above, based on same thinking, the application is implemented Example additionally provides a kind of data processing equipment, as shown in fig. 6, a kind of data providing for the embodiment of the present application The structural representation of processing meanss.Described data processing equipment includes:Receiving unit 61, searching unit 62 With polymerized unit 63, wherein:
Receiving unit 61, for receiving the data message of at least one second processing node transmission, wherein, The first data acquisition system and the corresponding index information of described first data acquisition system is comprised in described data message, described Index information is used for characterizing the public characteristic vector of the pending data comprising in described first data acquisition system;
Searching unit 62, for according to the described index information comprising in described data message, at first In the data acquisition system storehouse of reason node, find at least one second data set differing with described index information Close;
Polymerized unit 63, for determining described first data acquisition system and at least one described second data set When comprising at least one identical pending data in conjunction, by described in described first data acquisition system and at least one Second data acquisition system is polymerized to a data acquisition system.
Alternatively, also comprise each of described first data acquisition system pending data in described data message Positional information in described first data acquisition system;
Described polymerized unit 63, specifically for determine described first data acquisition system with least one described the When comprising at least one identical pending data in two data acquisition systems, determine that at least one identical described is treated Positional information in described first data acquisition system for the processing data;
According to the described positional information determining, by described first data acquisition system and at least one described second data Set is polymerized to a data acquisition system.
Specifically, described polymerized unit 63, if specifically for corresponding first tree of described first data acquisition system and At least one described second data acquisition system corresponds to the second tree, then described first processes node according to the institute determining State positional information, described first tree and described second tree are polymerized to a distributed tree.
Alternatively, described data processing equipment also includes:Combining unit 64, wherein:
Described combining unit 64, for according to the described index information comprising in described data message, from institute State in the data acquisition system storehouse of the first process node, find with described index information identical at least one the 3rd Data acquisition system;
Described first data acquisition system is merged with the 3rd data acquisition system at least one described, generates described rope Fuse ceases corresponding tree.
Alternatively, described data processing equipment also includes:Taxon 65, wherein:
Described taxon 65, for receive at least one second processing node send data message it Before, acquisition at least two pending datas, and determine the characteristic set of each described pending data, its In, comprise at least one characteristic vector in described characteristic set;With described characteristic vector as granularity of division, will At least one pending data comprising same characteristic features vector divides and obtains a data acquisition system.
Specifically, described taxon 65, specifically for determining a described characteristic vector;From getting Described at least two pending datas in, find out comprise determine described characteristic vector pending number According to;The described pending data combination finding is obtained a data acquisition system.
Alternatively, described data processing equipment also includes:Determining unit 66, wherein:
Described determining unit 66, for when obtaining a data acquisition system, determining the rope of described data acquisition system Fuse ceases, and wherein, described index information is for characterizing the pending data comprising in described data acquisition system Public characteristic vector.
Specifically, the described pending data combination finding is obtained a data by described taxon 65 Set, specifically includes:
According to the order searching described pending data, generate one tree, and described tree is considered as a data Set, wherein, each node of described tree corresponds to a pending data.
Alternatively, described data processing equipment also includes:Memory element 67, wherein:
Described memory element 67, for storing the described data acquisition system obtaining to the described first process node Data acquisition system storehouse in.
It should be noted that data processing equipment described in the embodiment of the present application can be real by hardware mode Now it is also possible to realize by software mode, here concrete limit is not done for the implementation of data processing equipment Fixed.
So, in a distributed system, if data processing equipment is deployed respectively described in the embodiment of the present application On main process task node and auxiliary process node, then main process task node is receiving the auxiliary number processing node transmission It is believed that during breath, closing for the different two datasets of index information, if comprising identical pending data, The two datasets comprising identical pending data can be closed and are polymerized by main process task node, effectively carry Rise data aggregate speed, and then improve the treatment effeciency of mass data.
The present invention is to produce with reference to method according to embodiments of the present invention, equipment (system) and computer program The flow chart of product and/or block diagram are describing.It should be understood that can by computer program instructions flowchart and / or block diagram in each flow process and/or the flow process in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embed The processor of formula datatron or other programmable data processing device is to produce a machine so that passing through to calculate The instruction of the computing device of machine or other programmable data processing device produces for realizing in flow chart one The device of the function of specifying in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or other programmable datas can be guided to process and set So that being stored in this computer-readable memory in the standby computer-readable memory working in a specific way Instruction produce and include the manufacture of command device, the realization of this command device is in one flow process or multiple of flow chart The function of specifying in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makes Obtain and series of operation steps is executed on computer or other programmable devices to produce computer implemented place Reason, thus the instruction of execution is provided for realizing in flow chart one on computer or other programmable devices The step of the function of specifying in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
In a typical configuration, computing device includes one or more processors (CPU), input/output Interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or the form such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.
Computer-readable medium include permanent and non-permanent, removable and non-removable media can by appoint What method or technique is realizing information Store.Information can be computer-readable instruction, data structure, program Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its The random access memory (RAM) of his type, read only memory (ROM), electrically erasable are read-only Memorizer (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, tape magnetic Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storage can be calculated The information that equipment accesses.Define according to herein, computer-readable medium does not include temporary computer-readable matchmaker Body (transitory media), the such as data signal of modulation and carrier wave.
Also, it should be noted term " inclusion ", "comprising" or its any other variant be intended to non- The comprising of exclusiveness, so that include a series of process of key elements, method, commodity or equipment not only wrap Include those key elements, but also include other key elements being not expressly set out, or also include for this process, Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, " included by sentence One ... " key element that limits is being it is not excluded that including the process of described key element, method, commodity or setting Also there is other identical element in standby.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.And, the application can adopt and wherein include calculating one or more Machine usable program code computer-usable storage medium (including but not limited to disk memory, CD-ROM, Optical memory etc.) the upper computer program implemented form.
The foregoing is only embodiments herein, be not limited to the application.For this area skill For art personnel, the application can have various modifications and variations.All institutes within spirit herein and principle Any modification, equivalent substitution and improvement made etc., within the scope of should be included in claims hereof.

Claims (18)

1. a kind of data processing method is it is characterised in that include:
First process node receives the data message of at least one second processing node transmission, wherein, described number It is believed that comprising the first data acquisition system and the corresponding index information of described first data acquisition system in breath, described index letter Cease the public characteristic vector for characterizing the pending data comprising in described first data acquisition system;
Described first processes node according to the described index information comprising in described data message, from described first Process in the data acquisition system storehouse of node, find at least one second data differing with described index information Set;
Described first processes node is determining described first data acquisition system and at least one described second data set When comprising at least one identical pending data in conjunction, by described in described first data acquisition system and at least one Second data acquisition system is polymerized to a data acquisition system.
2. data processing method as claimed in claim 1 is it is characterised in that go back in described data message Comprise position in described first data acquisition system for each of the described first data acquisition system pending data Information;
Described first processes node is determining described first data acquisition system and at least one described second data set When comprising at least one identical pending data in conjunction, by described in described first data acquisition system and at least one Second data acquisition system is polymerized to a data acquisition system, including:
Described first processes node is determining described first data acquisition system and at least one described second data set When comprising at least one identical pending data in conjunction, determine at least one identical pending data described Positional information in described first data acquisition system;
Described first processes node according to the described positional information determining, by described first data acquisition system and at least One described second data acquisition system is polymerized to a data acquisition system.
3. data processing method as claimed in claim 2 is it is characterised in that described first processes node According to the described positional information determining, by described first data acquisition system and at least one described second data acquisition system It is polymerized to a data acquisition system, including:
If described corresponding first tree of first data acquisition system and at least one described second data acquisition system corresponding second Tree, then described first processes node according to the described positional information determining, will described first tree and described the Two trees are polymerized to a distributed tree.
4. the data processing method as described in any one of claims 1 to 3 is it is characterised in that described side Method also includes:
Described first processes node according to the described index information comprising in described data message, from described first Process in the data acquisition system storehouse of node, find and at least one the 3rd data set of described index information identical Close;
Described first data acquisition system is entered by described first process node with the 3rd data acquisition system at least one described Row merges, and generates the corresponding tree of described index information.
5. data processing method as claimed in claim 1 is it is characterised in that described first processes node Before receiving the data message that at least one second processing node sends, methods described also includes:
Described first processes node obtains at least two pending datas, and determines each described pending number According to characteristic set, wherein, in described characteristic set, comprise at least one characteristic vector;
Described first processes node with described characteristic vector as granularity of division, will comprise same characteristic features vector extremely A few pending data divides and obtains a data acquisition system.
6. data processing method as claimed in claim 5 is it is characterised in that described first processes node With described characteristic vector as granularity of division, at least one pending data comprising same characteristic features vector is divided Obtain a data acquisition system, including:
Described first process node determines a described characteristic vector;
Described first processes node from described at least two pending datas getting, and finds out and comprises really Fixed described characteristic vector pending data;
The described pending data combination finding is obtained a data acquisition system by described first process node.
7. the data processing method as described in claim 5 or 6 is it is characterised in that methods described is also wrapped Include:
Described first processes node when obtaining a data acquisition system, determines the index letter of described data acquisition system Breath, wherein, described index information is for characterizing the public of the pending data comprising in described data acquisition system Characteristic vector.
8. data processing method as claimed in claim 7 is it is characterised in that described first processes node The described pending data combination finding is obtained a data acquisition system, including:
Described first processes node according to the order searching described pending data, generates one tree, and by institute State tree and be considered as a data acquisition system, wherein, each node of described tree corresponds to a pending data.
9. data processing method as claimed in claim 8 is it is characterised in that methods described also includes:
The described data acquisition system obtaining is stored to the number of the described first process node by described first process node According in set storehouse.
10. a kind of data processing equipment is it is characterised in that include:
Receiving unit, for receiving the data message of at least one second processing node transmission, wherein, described The first data acquisition system and the corresponding index information of described first data acquisition system, described index is comprised in data message Information is used for characterizing the public characteristic vector of the pending data comprising in described first data acquisition system;
Searching unit, for according to the described index information comprising in described data message, processing section from first In the data acquisition system storehouse of point, find at least one second data acquisition system differing with described index information;
Polymerized unit, for determining in described first data acquisition system and at least one described second data acquisition system When comprising at least one identical pending data, by described first data acquisition system with least one described second Data acquisition system is polymerized to a data acquisition system.
11. data processing equipments as claimed in claim 10 are it is characterised in that in described data message Also comprise position in described first data acquisition system for each of the described first data acquisition system pending data Confidence ceases;
Described polymerized unit, specifically for determining described first data acquisition system and at least one described second number During according to comprising at least one identical pending data in set, determine that at least one identical described is pending Positional information in described first data acquisition system for the data;
According to the described positional information determining, by described first data acquisition system and at least one described second data Set is polymerized to a data acquisition system.
12. data processing equipments as claimed in claim 11 it is characterised in that
Described polymerized unit, if specifically for corresponding first tree of described first data acquisition system and described at least one Second data acquisition system corresponds to the second tree, then described first processes node according to the described positional information determining, Described first tree and described second tree are polymerized to a distributed tree.
13. data processing equipments as described in any one of claim 10 to 12 are it is characterised in that described Data processing equipment also includes:Combining unit, wherein:
Described combining unit, for according to the described index information comprising in described data message, from described In the data acquisition system storehouse of one process node, find and at least one the 3rd data of described index information identical Set;
Described first data acquisition system is merged with the 3rd data acquisition system at least one described, generates described rope Fuse ceases corresponding tree.
14. data processing equipments as claimed in claim 13 are it is characterised in that described data processing fills Put and also include:Taxon, wherein:
Described taxon, for receive at least one second processing node send data message before, Obtain at least two pending datas, and determine the characteristic set of each described pending data, wherein, At least one characteristic vector is comprised in described characteristic set;With described characteristic vector as granularity of division, will comprise At least one pending data of same characteristic features vector divides and obtains a data acquisition system.
15. data processing equipments as claimed in claim 14 it is characterised in that
Described taxon, specifically for determining a described characteristic vector;At least two described in get In individual pending data, find out comprise determine described characteristic vector pending data;To find The combination of described pending data obtain a data acquisition system.
16. data processing equipments as described in claims 14 or 15 are it is characterised in that at described data Reason device also includes:Determining unit, wherein:
Described determining unit, the index for when obtaining a data acquisition system, determining described data acquisition system is believed Breath, wherein, described index information is for characterizing the public of the pending data comprising in described data acquisition system Characteristic vector.
17. data processing equipments as claimed in claim 16 will be it is characterised in that described taxon will The described pending data combination finding obtains a data acquisition system, specifically includes:
According to the order searching described pending data, generate one tree, and described tree is considered as a data Set, wherein, each node of described tree corresponds to a pending data.
18. data processing equipments as claimed in claim 17 are it is characterised in that described data processing fills Put and also include:Memory element, wherein:
Described memory element, for storing the described data acquisition system obtaining to the number of the described first process node According in set storehouse.
CN201510463292.7A 2015-07-31 2015-07-31 A kind of data processing method and device Active CN106407215B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510463292.7A CN106407215B (en) 2015-07-31 2015-07-31 A kind of data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510463292.7A CN106407215B (en) 2015-07-31 2015-07-31 A kind of data processing method and device

Publications (2)

Publication Number Publication Date
CN106407215A true CN106407215A (en) 2017-02-15
CN106407215B CN106407215B (en) 2019-08-13

Family

ID=58007810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510463292.7A Active CN106407215B (en) 2015-07-31 2015-07-31 A kind of data processing method and device

Country Status (1)

Country Link
CN (1) CN106407215B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325167A (en) * 2017-07-31 2019-02-12 株式会社理光 Characteristic analysis method, device, equipment, computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694657A (en) * 2009-09-18 2010-04-14 浙江大学 Picture retrieval clustering method facing to Web2.0 label picture shared space
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN101814112A (en) * 2010-01-11 2010-08-25 北京世纪高通科技有限公司 Method and device for processing data
CN102054001A (en) * 2009-10-28 2011-05-11 中国移动通信集团公司 Data preprocessing method, system and device in data mining system
CN102054002A (en) * 2009-10-28 2011-05-11 中国移动通信集团公司 Method and device for generating decision tree in data mining system
US20140095512A1 (en) * 2012-09-28 2014-04-03 International Business Machines Corporation Ranking supervised hashing
CN104063376A (en) * 2013-03-18 2014-09-24 阿里巴巴集团控股有限公司 Multi-dimensional grouping operation method and system
CN104239324A (en) * 2013-06-17 2014-12-24 阿里巴巴集团控股有限公司 Methods and systems for user behavior based feature extraction and personalized recommendation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694657A (en) * 2009-09-18 2010-04-14 浙江大学 Picture retrieval clustering method facing to Web2.0 label picture shared space
CN102054001A (en) * 2009-10-28 2011-05-11 中国移动通信集团公司 Data preprocessing method, system and device in data mining system
CN102054002A (en) * 2009-10-28 2011-05-11 中国移动通信集团公司 Method and device for generating decision tree in data mining system
CN101814112A (en) * 2010-01-11 2010-08-25 北京世纪高通科技有限公司 Method and device for processing data
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
US20140095512A1 (en) * 2012-09-28 2014-04-03 International Business Machines Corporation Ranking supervised hashing
CN104063376A (en) * 2013-03-18 2014-09-24 阿里巴巴集团控股有限公司 Multi-dimensional grouping operation method and system
CN104239324A (en) * 2013-06-17 2014-12-24 阿里巴巴集团控股有限公司 Methods and systems for user behavior based feature extraction and personalized recommendation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325167A (en) * 2017-07-31 2019-02-12 株式会社理光 Characteristic analysis method, device, equipment, computer readable storage medium
CN109325167B (en) * 2017-07-31 2022-02-18 株式会社理光 Feature analysis method, device, equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN106407215B (en) 2019-08-13

Similar Documents

Publication Publication Date Title
US9460117B2 (en) Image searching
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
EP3561674B1 (en) Method and apparatus for verifying block data in a blockchain
US10452691B2 (en) Method and apparatus for generating search results using inverted index
CN106055574B (en) Method and device for identifying illegal uniform resource identifier (URL)
CN109918506B (en) Text classification method and device
US20180046721A1 (en) Systems and Methods for Automatic Customization of Content Filtering
CN107784110B (en) Index establishing method and device
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
KR20200002332A (en) Terminal apparatus and method for searching image using deep learning
CN110851761A (en) Infringement detection method, device and equipment based on block chain and storage medium
JP2019536171A (en) Web page clustering method and apparatus
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
US20180018392A1 (en) Topic identification based on functional summarization
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN107368489A (en) A kind of information data processing method and device
KR20180129001A (en) Method and System for Entity summarization based on multilingual projected entity space
CN111460783A (en) Data processing method and device, computer equipment and storage medium
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
CN116034401A (en) System and method for retrieving video using natural language descriptions
CN113849679A (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN107341152B (en) Parameter input method and device
CN110737633B (en) Resource management method and system based on cloud management platform
WO2016101737A1 (en) Search query method and apparatus
CN106407215A (en) Data processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200921

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Patentee after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Patentee before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200921

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Patentee after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Ltd.