CN108280176A

CN108280176A - Data mining optimization method based on MapReduce

Info

Publication number: CN108280176A
Application number: CN201810059358.XA
Authority: CN
Inventors: 李垚霖
Original assignee: Chengdu Boruide Science & Technology Co Ltd
Current assignee: Chengdu Boruide Science & Technology Co Ltd
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2018-07-13

Abstract

The present invention provides a kind of data mining optimization method based on MapReduce, this method include：The mapping relations of virtual machine nodes and true calculate node defined in MapReduce Computational frames；In the Map stages, virtual machine nodes are found first, then search for corresponding true calculate node further according to virtual machine nodes, each cluster is mapped on a node；It transfers to the Reduce stages to do node data to merge, exports query result.The present invention proposes a kind of data mining optimization method based on MapReduce and improves the efficiency of data mining for the back end of distributed environment.

Description

Data mining optimization method based on MapReduce

Technical field

The present invention relates to data, more particularly to a kind of data mining optimization method based on MapReduce.

Background technology

The aggregation of data is executed in large-scale distributed back end and analysis needs to design efficient data mining Method.It is current in the related technology, traditional centralized data management and searching method are faced with Single Point of Faliure, scalability The problems such as poor, cannot be satisfied data mining demand flexible, expansible and healthy and strong under distributed environment.Therefore, how using non- The back end management of centralization and data digging method, to meet expansible back end management and the number of structure data service It is still a challenging problem according to assembling and analyzing demand.In addition, existing big data parallel computation frame is in data directory Stage, data query time and cost have much room for improvement, and according to traditional sorting in parallel merger, then data characteristics field point Cloth is uneven, will be decreased obviously in access phase efficiency.

Invention content

To solve the problems of above-mentioned prior art, the present invention proposes a kind of data digging based on MapReduce Optimization method is dug, including：

The mapping relations of virtual machine nodes and true calculate node defined in MapReduce Computational frames；

In the Map stages, virtual machine nodes are found first, then search for corresponding true meter further according to virtual machine nodes Each cluster is mapped on a node by operator node；

It transfers to the Reduce stages to do node data to merge, exports query result.

Preferably, the mapping relations for defining virtual machine nodes and true calculate node, further comprise：

In one new file format HMF of bottom-layer design so that if there are user's HMF file sets：O (F)={ f₁, f₂..., f_n, present node collection is combined into P={ γ₁, γ₂..., γ_x, the virtual machine nodes set Λ of corresponding node ={ v (γ₁), v (γ₂) ..., v (γ_x)}；v(γ_i) indicate virtual machine nodes and true calculate node mapping relations.

Preferably, before the step of Map stages find virtual machine nodes, further include：

Entire hash-value space is organized into a virtual end to end ring；

Using the mode of the network address of calculate node as keyword Hash, each node determines it on hash space Position；

HMF files are mapped to a value of hash space with hash function, backward along the value, encounter first is saved Point is as processing node.

Preferably,

In the Map stages, when according to HMF search nodes, corresponding true calculate node is searched for according to virtual machine nodes, Each cluster is mapped on a node.

Preferably, after the step each cluster being mapped on a node, further include：

In access phase, the load data of each node is collected, once find there is uneven situation, the node mapped Cluster is then reassigned to new node, and new node quantity is determined according to loading condition；Origin node resource reclaim after replacement, so as to again Distribution；

After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages to close together And.

Preferably, one of table is gone out as Hash according to the Map task choosings of each node and connects base table to build Kazakhstan Uncommon table reads the connection attribute word of the base table in HMF file system using the connection attribute for participating in attended operation as hashkey In section to the node memory of MapReduce distributed systems, then, to all key assignments of link field, hash function behaviour is carried out Make；

By the processed base table connection row of Hash, it is stored in one piece opened up in the memory together with data and specially deposits Put such data space；Then, according to different hash function values, cluster dividing operation is carried out to base table；Each cluster includes institute There are the base table data of identical hash function value.

The present invention compared with prior art, has the following advantages：

The present invention proposes a kind of data mining optimization method based on MapReduce, for the data of distributed environment Node facilitates user by matching service description information to use data, improves the efficiency of data mining；For by using cloud The computing resource or storage resource that end service provides provide a feasible scheme to develop structure data service.

Description of the drawings

Fig. 1 is the flow chart of the data mining optimization method according to the ... of the embodiment of the present invention based on MapReduce.

Specific implementation mode

Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of data mining optimization method based on MapReduce.Fig. 1 is according to this hair The data mining optimization method flow chart based on MapReduce of bright embodiment.

The present invention data characteristics digging system include storage subsystem, tagsort subsystem, trusted key subsystem, Feature mining subsystem, task scheduling subsystem.

Trusted key subsystem is used to ensure data by identity authentication result to obtain, including key generates, authentication And decryption；Key schedule is as follows：

1) data are divided into the block of multiple key string length scales；

2) with each character of the integer of 0~26 range substitution plaintext and key, space character=00, A=01 ..., Z= 26；

3) to each block of plaintext, the corresponding calculated value of each of which character is replaced, the corresponding calculated value is will The integer coding of corresponding character removes 27 obtained values of remainder again after being added with the integer coding of the character of corresponding position in key；

4) character replaced with corresponding calculated value is substituted with its character of equal value again；

The authentication is logged in by user and voice print verification is realized；The successful user of authentication can pass through decryption Module obtains key, completes decryption；

Storage subsystem includes memory module and disaster tolerance module, and the net of information storage is carried out needed for the memory module certification Node in network builds the trusting relationship of stored information, based on the data being distributed under distributed environment, to characteristic According to storage is packaged, using combined type aspect indexing structure, faster inquiry is generated to text-type data and numeric type data Speed；The disaster tolerance module restores data for loss of data or in the case of being destroyed；

The memory module on the basis of tradition indexes, draw by data attribute key and attribute value that data characteristics is concentrated It separates, builds the double-deck aspect indexing structure.It is the high layer index of attribute structure of data characteristics intensive data first.Secondly to height Key assignments construction feature index corresponding to layer characteristic attribute, if numeric type data just builds R tree aspect indexing structures, if literary This type data just build Reverse features index.When logarithm type data carry out range query, low layer will be directly targeted to Tree-like aspect indexing is completed, and data query time and cost are reduced.

High-rise tree-like aspect indexing is for characteristic attribute structure, the data in the layer index included in data characteristics collection Specific features attribute be stored entirely in n omicronn-leaf subobject, and three parts information A is then stored in all leaf objects of R trees_i、 The meaning of Pcat, Psi, expression is respectively：(1)A_iIt is the specific features attribute of index data feature set, wherein n is all features The number of attribute, i ∈ [1, n]；(2) what Pcat was indicated is pointer type；(3) Psi is the pointer for being directed toward low-level feature index, root According to the difference of data type, which is directed toward different aspect indexing structures, that is, is directed toward the root section of reverse document table gauge outfit or R trees Point.

Low-level feature index is the index constructed by the key assignments corresponding to the characteristic attribute for high level, including is numeric type number R tree aspect indexing structures according to structure and the reverse document table aspect indexing for text-type data structure.Practical key assignments stores In the n omicronn-leaf subobject of R tree aspect indexing structures, and leaf objects are ordered arrangement and include the three of feature index file Partial information R_S, Pos, Fileid, indicate to be meant that respectively：(1)R_SFor the S attribute key assignments of the R characteristic attribute key, R ∈ [1, n₂], S ∈ [1, p], n₂It is the spy of the R property key that the number for the numerical characteristics attribute for including, P are concentrated for data characteristics Levy quantity.(2) Pos is to include the location information where the file of this attribute value.(3) Fileid is to include query characteristics word File ID.

Reverse features index is divided into two parts, and first part is the aspect indexing table being made of different index word, record Different text keyword and their relevant information.Second part has recorded the collection of document for each index terms occurred And its storage address.Include specifically A in Reverse features index structure_ij, tetra- partial information of Fileid, Pos, Freq, expression Meaning is respectively：(1)A_ijFor j-th of characteristic attribute key assignments of ith feature property key, i ∈ [1, n₁], j ∈ [1, m], n₁For The number of text attribute, m are the number for the attribute value that ith attribute key includes.(2) Fileid is to include query characteristics word File ID, Fileid are unique.(3) Pos is to include the position where query characteristics word file.(4) Freq is query characteristics Word concentrates the frequency occurred in data characteristics.The establishment process of aspect indexing is as follows：

The data that construction feature indexes are wanted in step 1, first analysis, if there is no current data in the index built, The new aspect indexing object of the high layer building one of combined type aspect indexing；

Step 2 judges to increase the characteristic attribute value type of data newly, is then its establishment R tree feature if numeric type data Index；If text-type attribute then builds Reverse features index structure for it；

Step 3 repeats step 1, if there are current attributes in the aspect indexing built before, no longer to feature rope Draw the new object of high-rise increase, only the data of the attribute are added in the corresponding aspect indexing of low layer；

Step 4 repeats above step, until completion until being indexed for all data construction features.

In search index, querying condition is analyzed first and obtains Feature Words, query characteristics word is handed into index dictionary, if Index marker position is false, returns to null value and indicates that the characteristic to be inquired is not present in index file, if if true, judgement should The data type that query word returns the result, according to different type navigate to different characteristic index, read this feature word ID and Including Feature Words number of documents, the relevant information of querying condition is obtained by these.R tree feature ropes are read further according to Feature Words ID Draw or Converse Index in content, the retrieval content integrated, finally with search condition carry out correlation comparison, to inquiry tie Fruit, which sorts to obtain final result, returns to user.Using the key assignments key_id in characteristic table as the input value of search algorithm, Output is Boolean, and detailed process is as follows：

(1) using root, key_id, level number level as input parameter, search function lookup (root, key_ are called Id, level), search result is assigned to leaf node record；

(2) if leaf node is recorded as sky, null value is directly returned；Otherwise, real search result rid is returned；

Using current block as the input of search function lookup, key is search key, and level is the initial number of plies, may include The leaf of search key key records the output as function, and detailed process is as follows：

(3.1) if what is be currently located is leaf node, key keys are searched for using binary search algorithm, and provide search result.

(3.2) if current block is not leaf node, (3.3) is thened follow the steps and arrive (3.6).

(3.3) current block and key values are pressed, the subtree containing key assignments is selected, obtains the block number of child node.

(3.4) the child node block that it is included is read according to block number in the buffer.

(3.5) it if the child node block found is leaf node, returns (3.1).

(3.6) if the child node block is branch's block, child node block, key, level is subtracted 1 as new input, passed Call function is returned to return to output result.

The tagsort subsystem is used to carry out Classification Management to characteristic using the method for cluster；The present invention uses Method is determined with lower class quantity.Definition estimation density first：

Wherein, X_tr, X_teIt indicates to carry out initial data the feature training set and characteristic test obtained by random division respectively Collection；C(X_tr, k) indicate feature training set cluster process, be copolymerized into k classes；A_k1, A_k2..., A_kkIndicate that characteristic test collection itself is poly- At k classes, i, i ' be sample point in same class, n_kjIt is A_kjThe number of middle sample point；D[C(X_tr, k), X_te] indicate a k The element of × k matrixes, the i-th row and the i-th ' row takes 0 or 1, and value 0 indicates that, not in same class, value 1 indicates to use feature training set It is right：I and i ' are clustered；Ps (k) indicates the estimation density of cluster result when class quantity is k.

Estimate that density calculating process is as follows：

(1) initial data to be clustered is randomly divided into feature training set and characteristic test collection；

(2) it is k to take class quantity, is clustered to above-mentioned two subset, and cluster result is denoted as I types cluster；

(3) characteristic test collection is differentiated with the cluster result of feature training set, is as a result denoted as II types cluster；

(4) in k-th of class that characteristic test collection is polymerized to itself, whether examination is any poly- in II types to sample point i and i ' Accidentally divided in different classes in class, and records the ratio correctly divided；

(5) in k composition of proportions, reckling is the estimation density under current class quantity k.

To estimate density as majorized function, class quantity and variable subset are to influence the factor of estimation density, are closed by selecting Suitable class quantity and variable subset make estimation density maximize.

It is unevenly distributed the problem of Hash access phase efficiency declines for characteristic field, the present invention is based on features The Hash join algorithm of the MapReduce data queries of field storage, makes the field under MapReduce distributed environments each It is evenly distributed on node, improves data-handling efficiency.Query execution projection operation is changed into the feature field on each node Operation reduces the I/O wastes that repeated accesses come with watchband.In feature field storage, under the target object that pushes away specific to certain A feature field, each feature field are equivalent to a small table being made of (lineid, value).

In order to solve the problems, such as that data are unbalanced, first in bottom-layer design one in MapReduce distributed computing frameworks New file format HMF so that if there are user's HMF file sets：O (F)={ f₁, f₂..., f_n, by present node set For P={ γ₁, γ₂..., γ_x, virtual machine nodes set Λ={ v (γ of corresponding node₁), v (γ₂) ..., v (γ_x)}。v(γ_i) indicate virtual machine nodes and true calculate node mapping relations.

Parallel computation based on hash algorithm realizes that step includes：

Entire hash-value space is organized into a virtual end to end ring by step 1.；

For step 2. using the mode of the network address of calculate node as keyword Hash, each node determines it in Hash sky Between on position；

HMF files are mapped to a value of hash space by step 3. with hash function, backward along the value, will encountered One node is as processing node；

Step 4. is in the Map stages, and when according to HMF search nodes, what is found is virtual machine nodes, then further according to void Quasi- calculate node searches for corresponding true calculate node, and each cluster is mapped on a node；

Step 5：In access phase, the load data of each node is collected, once find there is uneven situation, the node institute The cluster of mapping is then reassigned to new node, and new node quantity is determined according to loading condition.Origin node resource reclaim after replacement, with Just sub-distribution again；

Step 6：After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages together It does and merges, finally export query result.

In Hash connection, two relationships R and S, number of tuples are respectively T_RAnd T_S, and T_R＞ T_S.One Hash letter Initial division S is mapped to B cluster, the serial number of cluster by number：The public attribute of 1,2 ..., B, R and S are A respectively₁, A₂... A_k, right In the connection attribute A for the relationship R being distributed on m nodes_iComponent, after hash operations, it is determined that in the B cluster Match.

The optimization process of above-mentioned Hash connection is divided into two stages：Structure and connection.

In the structure stage：The Map tasks of each node select one of table and connect base table as Hash, to build Hash table reads the connection attribute of the base table in HMF file system using the connection attribute for participating in attended operation as hashkey In field to the node memory of MapReduce distributed systems, then, to all key assignments of link field, hash function is carried out Operation.

By the processed base table connection row of Hash, it is stored in one piece opened up in the memory together with data and specially deposits Put such data space.Then, according to different hash function values, cluster dividing operation is carried out to base table.Each cluster includes institute There are the base table data of identical hash function value.

In access phase：The fact that carry out Hash connection table is used as by base table on each node, it will be connected Data field read in batches successively from HMF, and hash operations are done to connection attribute field, it is determined that in which cluster Removal search is navigated to using Hash searching algorithms on cluster appropriate；

In the cluster navigated to, qualified line number is obtained by accurate matching.To the qualified line number of each node, It is done and is merged by Reduce, in HMF file system, read the inquiry row involved in SQL statement, finally export query result.

Because the search range of algorithm has reduced, it is high to carry out matched success rate accuracy.Matching operation is in memory Middle progress, it is many soon when speed is compared with ordering by merging, realize optimization purpose.

In order to reduce the influence of accidental factor, it is preferable that data characteristics collection is randomly divided into several deciles first, it will be each Decile is used as characteristic test collection successively, and after finding out respective estimation density, then it is the estimation under this kind of quantity to take its average value Density.Hierarchy clustering method based on improved estimator density is credible to the cluster result of data in example and is of practical significance , the more conventional clustering method clustering preferably for being used for carrying out data.

The feature mining subsystem is under the safe cloud environment having verified that from being dispersed in data Layer in cloud platform everywhere Data set provider at search for and match the characteristic for meeting application demand, and formed by aggregation and analysis and arrangement pending Characteristic；The modelling phase be used for using storage cluster to distributed environment under calculate node model, each non- The shared of characteristic is carried out between local node, is searched for and is matched the characteristic for meeting application demand；If x_iCollect for storage A node in group, { x_i1, x_i2... x_imIt is x_iNon-local nodes collection, PL_iFor local resource pond, PN_iFor non-local nodes Data pool, i ∈ [1, n], n are the sum that storage cluster includes node, and m indicates the number of non-local nodes, m<n；

Carry out data it is shared when using based on the data protocol between non-local nodes：Work as x_iIt, will when P2P networks are added x_iWith { x_i1, x_i2... x_imStructure connection, x_iFurther according to PL_iIn information on services, create sharing feature data, and by institute Sharing feature data forwarding is stated to all non-local nodes x_imIt is shared, if any node in storage cluster receives one When sharing feature data, judged whether to receive the sharing feature data according to the id information of sharing feature data, if having connect It received, and abandoned the sharing feature data, if receiving for the first time, according to the data and node location information of sharing feature data, Update PN_iIn content, and identified according to the validity of sharing feature data, determine forwarding or abandon the sharing feature number According to, wherein data needs periodically synchronize between non-local nodes；

In resource searching, the operation specifically executed is：If initiating sharing request M_jNode be x_j, in x_jIt is non-local According to Probability p in node set_jThe set of node picked out at random is p_j×{x_j1, x_j2... x_jm, j ∈ [1, n]；As node x_iIt receives To x_jThe sharing request M of transmission_jWhen, check PN_iAnd PL_iIn whether containing meeting sharing request M_jCharacteristic, if so, according to The location information of node, creates the response message of inquiry and according to x where the characteristic and data_jLocation information, will The response message returns to x_j, then by x_jValidity mark subtract 1, if x_jValidity be identified as 0, abandon sharing request M_jIf not 0, p is calculated using EM algorithms_j×{x_j1, x_j2... x_jmIn each node desired value, by sharing request M_jIt is transmitted to p_j×{x_j1, x_j2... x_jmThe maximum node of expected value；Set the calculation formula of desired value as：

E_new=E_old+αE_learn+β×I[N_xjμ(t)(T_xjμ-T’_xjμ)/(T_xjμ×T’_xjμ)]×(N_xjμ(t))/T_xjμ

Wherein, E_newIndicate the new value of E, E_oldIndicate the old value of E, E_learnIndicate that the value learnt, α indicate learning rate, β Indicate congestion factor, N_xjμ(t) moment t node x is indicated_jμBuffer queue in pending sharing request message count, T '_xjμIt indicates p_j×{x_j1, x_j2... x_jmIn node x_jμHandle the time of a sharing request message defined, T_xjμIndicate p_j×{x_j1, x_j2... x_jmIn node x_jμHandle a sharing request message actually required time；Function I [x] is in x>Value is 1 when 0, Value is 0 when x≤0；

The task scheduling subsystem carries out task scheduling to data handling procedure, by complicated data processing calculating task Fractionation is had a single function to one group and independent subtask, and meets the cloud service resource pool of its demand, shape for subtask matching At Services Composition scheme, to obtain storage resource or computing resource needed for data handling procedure；It is taken according to the data of generation The task scheduling of business executes the estimation of Services Composition scheme：

(1), according to cloud service resource pool SP_vWith corresponding service quality historical record, CS is carried out_γEfficiency function X Modeling and according to each parameter of efficiency function in application example initialization model, if being constrained to by task scheduling is correspondingCorresponding qos constraint is C={ C₁, C₂.., C_d, each subtask G_vCorresponding money Source pond SP_vShared m_vA service, for cloud service resource pool SP_vIn each service SP_vω, it includes historical record Number is L_vω, by SP_vThe γ feasible Services Composition schemes formed are CS_γ, ω ∈ [1, m_v], defining service model is：

Wherein, QoS_max(k) it is the service quality maximum value of kth dimension, QoS_min(k) most for the service quality of kth dimension Small value, d be corresponding to maximum dimension, q_d() is majorized function, SP_vωR_hTo be under the jurisdiction of SP_vωHistorical record, x_vω-hIt indicates The parameter of efficiency function in model；

(2), each feasible Services Composition scheme is ranked up by sequence from small to large according to efficiency function value, before selection As preferred service assembled scheme, the value of Z is set Z feasible Services Composition schemes according to application example；

(3), the average value of its efficiency function value is calculated each group of preferred service assembled scheme；

(4), the average value of efficiency of selection functional value is maximum preferred service assembled scheme as optimal Services Composition Scheme；

Record preferred service assembled scheme efficiency function value and optimal Services Composition scheme, and as sample into Row study, if new preferred service assembled scheme had occurred, directly invokes its functional value.

By taking network topics data mining as an example, the present invention is on the basis of constructed index structure, with training characteristics collection The difference of the contribution margin of each training characteristics set pair sample space construction when expressing test feature collection, using σ (I^c)V^cSquare Battle array constructs new sparse expression dictionary, wherein σ (I^c) it is every class training characteristics collection, V^cValue matrix is contributed for dictionary；In sparse expression Bound term in category set is added in constraint so that the sample of the same category can flock together in the less space of sum, Effectively excavate the hiding feature of complex data.The present invention is based on the network topics method for digging of big data to comprise the steps of：

Step 1：To neural network to topic Text Feature Extraction topic feature after.

Step 2：Training characteristics collection is inputted, using the article sample training classified lexicon for including C type, training characteristics Collection space is indicated with I, is expressed as I=[I¹, I²..., I^c..., I^C]∈R^D×N, D indicates the characteristic dimension of training characteristics collection, and N is Training characteristics collection total number, IⁱIt indicates the i-th class sample, defines NⁱIt indicates per class training characteristics collection quantity, then N=N¹+N²+ ...+N^c +…+N^C；

Step 3：Regularization is carried out to training characteristics collection, obtains the training characteristics collection collection I of regularization；

Step 4：Its dictionary is respectively trained to every a kind of training characteristics collection, the process of training dictionary is：

1, c class samples I is taken out^c, by I^cIt is mapped to kernel space σ (I^c)；

2, sparse coding dictionary σ (I^c)V^cTraining need to meet constraints, the majorized function of the constraints is：

In formula, α is the constraint factor of sparse item constraint in sparse coding, and δ is that grouped accumulation constrains in coding dictionary Ic Constraint factor, S^cFor the eigenmatrix of c class kernel space training characteristics collection, m rowIndicate kernel space sample to structure The contribution margin of each entry, dictionary B in word making library^c=σ (I^c)V^c, mapping of the σ expression samples in kernel space.

3, the majorized function of constraints in step 2 is solved：First to V^cAnd S^cIt is initialized, it is random to generate Two matrixes, wherein V^cIt is N^c× K matrix, S^cIt is K × N^cMatrix, K are dictionary sizes；Then, alternating iteration updates V^cAnd S^c, Seek optimal contribution margin matrix V^cWith eigenmatrix S^cSo that majorized function value is minimum, by the contribution of every a kind of training characteristics collection Value matrix V^cIt is placed into a unit matrix, obtains contribution margin matrix V, which is classified lexicon；It has Body solution procedure is：

(1) fixed V^c, update S^c；By V^cThe majorized function of constraints is substituted into, i.e. majorized function is converted into：

To S^cEach element in matrix is updated, and is kept majorized function optimal, that is, is defined S^cIn row k n-th arrange Element seeks optimal eigenmatrix S^c。

(2) the fixed eigenmatrix S sought^c, update contribution margin matrix V^c, i.e. majorized function is converted into：

f(W^c)=| | σ (I^c)-σ(I^c)V^c S^c||²

To contribution margin matrix V^cEach row be gradually updated, when updating a certain row, remaining row then be fixed value；

Traverse V^cEach row update V^cContribution margin；

(3) iteration updates above-mentioned steps (1) and step (2) to update S^cAnd V^cContribution margin, as above-mentioned majorized function value f (V^c, S^c) when tending towards stability, update finishes；

(4) the eigenmatrix S per a kind of training characteristics collection is trained successively^cWith contribution margin matrix V^c；

(5) the contribution margin matrix V c integrated by every a kind of training characteristics obtains the contribution value matrix that dimension arranges as N rows C × K V, as classified lexicon.

Step 3：Text is identified, step is：

(1) text feature of test feature collection to be identified is extracted after using to neural network, y is defined and identifies test sample The feature of topic.

(2) acquired contribution margin matrix V is used, test feature collection text feature σ (y) is predicted, obtains prediction The anticipation function of function, acquisition is：

F (s)=| | σ (y)-σ (I) s × V^c||²+2αs

In formula, s indicates that the sparse coding of test feature collection σ (y), σ (I) indicate training characteristics collection I reflecting in kernel space It penetrates.

(3) it asks kernel space σ (y) in the prediction error of every constituted sample space of class sample, is indicated with r (c), table It is up to formula：

R (c)=| | σ (y)-σ (I^c)V^c S^c||²

(5) compare kernel space σ (y) and the prediction error per class sample, it is minimum that text to be identified then belongs to prediction error Classification.

In conclusion the present invention proposes a kind of data mining optimization method based on MapReduce, for distributed ring The back end in border facilitates user by matching service description information to use data, improves the efficiency of data mining；It is logical The computing resource provided using cloud service or storage resource are provided and provide a feasible scheme to develop structure data service.

Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.

It should be understood that the above-mentioned specific implementation mode of the present invention is used only for exemplary illustration or explains the present invention's Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims

1. a kind of data mining optimization method based on MapReduce, which is characterized in that including：

In the Map stages, virtual machine nodes are found first, are then saved further according to the corresponding true calculating of virtual machine nodes search Each cluster is mapped on a node by point；

2. according to the method described in claim 1, it is characterized in that, described define virtual machine nodes and true calculate node Mapping relations further comprise：

In one new file format HMF of bottom-layer design so that if there are user's HMF file sets：O (F)={ f₁, f₂..., f_n, present node collection is combined into P={ γ₁, γ₂..., γ_x, virtual machine nodes set Λ={ v of corresponding node (γ₁), v (γ₂) ..., v (γ_x)}；v(γ_i) indicate virtual machine nodes and true calculate node mapping relations.

3. according to the method described in claim 1, it is characterized in that, the step of Map stages find virtual machine nodes it Before, further include：

Entire hash-value space is organized into a virtual end to end ring；

Using the mode of the network address of calculate node as keyword Hash, each node determines its position on hash space It sets；

HMF files are mapped to a value of hash space with hash function, backward along the value, encounter first node is made To handle node.

4. according to the method described in claim 3, it is characterized in that, further comprising：

In the Map stages, when according to HMF search nodes, corresponding true calculate node is searched for according to virtual machine nodes, it will be every A cluster is mapped on a node.

5. according to the method described in claim 1, it is characterized in that, it is described by each cluster be mapped to the step on a node it Afterwards, further include：

In access phase, the load data of each node is collected, once finding there is uneven situation, the node mapped cluster is then It is reassigned to new node, new node quantity is determined according to loading condition；Origin node resource reclaim after replacement, to divide again Match；

After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages to do to merge together.

6. according to the method described in claim 1, it is characterized in that, the Map task choosings according to each node go out one of them Table connects base table to build Hash table as Hash, using the connection attribute for participating in attended operation as hashkey, reads in HMF In file system in the connection attribute field of base table to the node memory of MapReduce distributed systems, then, to link field All key assignments, carry out hash function operation；

By the processed base table connection row of Hash, it is stored in one piece opened up in the memory together with data and specially stores this Class data space；Then, according to different hash function values, cluster dividing operation is carried out to base table；Each cluster includes all phases With the base table data of hash function value.