CN108280176A - Data mining optimization method based on MapReduce - Google Patents

Data mining optimization method based on MapReduce Download PDF

Info

Publication number
CN108280176A
CN108280176A CN201810059358.XA CN201810059358A CN108280176A CN 108280176 A CN108280176 A CN 108280176A CN 201810059358 A CN201810059358 A CN 201810059358A CN 108280176 A CN108280176 A CN 108280176A
Authority
CN
China
Prior art keywords
node
data
hash
virtual machine
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810059358.XA
Other languages
Chinese (zh)
Inventor
李垚霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Boruide Science & Technology Co Ltd
Original Assignee
Chengdu Boruide Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Boruide Science & Technology Co Ltd filed Critical Chengdu Boruide Science & Technology Co Ltd
Priority to CN201810059358.XA priority Critical patent/CN108280176A/en
Publication of CN108280176A publication Critical patent/CN108280176A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of data mining optimization method based on MapReduce, this method include:The mapping relations of virtual machine nodes and true calculate node defined in MapReduce Computational frames;In the Map stages, virtual machine nodes are found first, then search for corresponding true calculate node further according to virtual machine nodes, each cluster is mapped on a node;It transfers to the Reduce stages to do node data to merge, exports query result.The present invention proposes a kind of data mining optimization method based on MapReduce and improves the efficiency of data mining for the back end of distributed environment.

Description

Data mining optimization method based on MapReduce
Technical field
The present invention relates to data, more particularly to a kind of data mining optimization method based on MapReduce.
Background technology
The aggregation of data is executed in large-scale distributed back end and analysis needs to design efficient data mining Method.It is current in the related technology, traditional centralized data management and searching method are faced with Single Point of Faliure, scalability The problems such as poor, cannot be satisfied data mining demand flexible, expansible and healthy and strong under distributed environment.Therefore, how using non- The back end management of centralization and data digging method, to meet expansible back end management and the number of structure data service It is still a challenging problem according to assembling and analyzing demand.In addition, existing big data parallel computation frame is in data directory Stage, data query time and cost have much room for improvement, and according to traditional sorting in parallel merger, then data characteristics field point Cloth is uneven, will be decreased obviously in access phase efficiency.
Invention content
To solve the problems of above-mentioned prior art, the present invention proposes a kind of data digging based on MapReduce Optimization method is dug, including:
The mapping relations of virtual machine nodes and true calculate node defined in MapReduce Computational frames;
In the Map stages, virtual machine nodes are found first, then search for corresponding true meter further according to virtual machine nodes Each cluster is mapped on a node by operator node;
It transfers to the Reduce stages to do node data to merge, exports query result.
Preferably, the mapping relations for defining virtual machine nodes and true calculate node, further comprise:
In one new file format HMF of bottom-layer design so that if there are user's HMF file sets:O (F)={ f1, f2..., fn, present node collection is combined into P={ γ1, γ2..., γx, the virtual machine nodes set Λ of corresponding node ={ v (γ1), v (γ2) ..., v (γx)};v(γi) indicate virtual machine nodes and true calculate node mapping relations.
Preferably, before the step of Map stages find virtual machine nodes, further include:
Entire hash-value space is organized into a virtual end to end ring;
Using the mode of the network address of calculate node as keyword Hash, each node determines it on hash space Position;
HMF files are mapped to a value of hash space with hash function, backward along the value, encounter first is saved Point is as processing node.
Preferably,
In the Map stages, when according to HMF search nodes, corresponding true calculate node is searched for according to virtual machine nodes, Each cluster is mapped on a node.
Preferably, after the step each cluster being mapped on a node, further include:
In access phase, the load data of each node is collected, once find there is uneven situation, the node mapped Cluster is then reassigned to new node, and new node quantity is determined according to loading condition;Origin node resource reclaim after replacement, so as to again Distribution;
After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages to close together And.
Preferably, one of table is gone out as Hash according to the Map task choosings of each node and connects base table to build Kazakhstan Uncommon table reads the connection attribute word of the base table in HMF file system using the connection attribute for participating in attended operation as hashkey In section to the node memory of MapReduce distributed systems, then, to all key assignments of link field, hash function behaviour is carried out Make;
By the processed base table connection row of Hash, it is stored in one piece opened up in the memory together with data and specially deposits Put such data space;Then, according to different hash function values, cluster dividing operation is carried out to base table;Each cluster includes institute There are the base table data of identical hash function value.
The present invention compared with prior art, has the following advantages:
The present invention proposes a kind of data mining optimization method based on MapReduce, for the data of distributed environment Node facilitates user by matching service description information to use data, improves the efficiency of data mining;For by using cloud The computing resource or storage resource that end service provides provide a feasible scheme to develop structure data service.
Description of the drawings
Fig. 1 is the flow chart of the data mining optimization method according to the ... of the embodiment of the present invention based on MapReduce.
Specific implementation mode
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of data mining optimization method based on MapReduce.Fig. 1 is according to this hair The data mining optimization method flow chart based on MapReduce of bright embodiment.
The present invention data characteristics digging system include storage subsystem, tagsort subsystem, trusted key subsystem, Feature mining subsystem, task scheduling subsystem.
Trusted key subsystem is used to ensure data by identity authentication result to obtain, including key generates, authentication And decryption;Key schedule is as follows:
1) data are divided into the block of multiple key string length scales;
2) with each character of the integer of 0~26 range substitution plaintext and key, space character=00, A=01 ..., Z= 26;
3) to each block of plaintext, the corresponding calculated value of each of which character is replaced, the corresponding calculated value is will The integer coding of corresponding character removes 27 obtained values of remainder again after being added with the integer coding of the character of corresponding position in key;
4) character replaced with corresponding calculated value is substituted with its character of equal value again;
The authentication is logged in by user and voice print verification is realized;The successful user of authentication can pass through decryption Module obtains key, completes decryption;
Storage subsystem includes memory module and disaster tolerance module, and the net of information storage is carried out needed for the memory module certification Node in network builds the trusting relationship of stored information, based on the data being distributed under distributed environment, to characteristic According to storage is packaged, using combined type aspect indexing structure, faster inquiry is generated to text-type data and numeric type data Speed;The disaster tolerance module restores data for loss of data or in the case of being destroyed;
The memory module on the basis of tradition indexes, draw by data attribute key and attribute value that data characteristics is concentrated It separates, builds the double-deck aspect indexing structure.It is the high layer index of attribute structure of data characteristics intensive data first.Secondly to height Key assignments construction feature index corresponding to layer characteristic attribute, if numeric type data just builds R tree aspect indexing structures, if literary This type data just build Reverse features index.When logarithm type data carry out range query, low layer will be directly targeted to Tree-like aspect indexing is completed, and data query time and cost are reduced.
High-rise tree-like aspect indexing is for characteristic attribute structure, the data in the layer index included in data characteristics collection Specific features attribute be stored entirely in n omicronn-leaf subobject, and three parts information A is then stored in all leaf objects of R treesi、 The meaning of Pcat, Psi, expression is respectively:(1)AiIt is the specific features attribute of index data feature set, wherein n is all features The number of attribute, i ∈ [1, n];(2) what Pcat was indicated is pointer type;(3) Psi is the pointer for being directed toward low-level feature index, root According to the difference of data type, which is directed toward different aspect indexing structures, that is, is directed toward the root section of reverse document table gauge outfit or R trees Point.
Low-level feature index is the index constructed by the key assignments corresponding to the characteristic attribute for high level, including is numeric type number R tree aspect indexing structures according to structure and the reverse document table aspect indexing for text-type data structure.Practical key assignments stores In the n omicronn-leaf subobject of R tree aspect indexing structures, and leaf objects are ordered arrangement and include the three of feature index file Partial information RS, Pos, Fileid, indicate to be meant that respectively:(1)RSFor the S attribute key assignments of the R characteristic attribute key, R ∈ [1, n2], S ∈ [1, p], n2It is the spy of the R property key that the number for the numerical characteristics attribute for including, P are concentrated for data characteristics Levy quantity.(2) Pos is to include the location information where the file of this attribute value.(3) Fileid is to include query characteristics word File ID.
Reverse features index is divided into two parts, and first part is the aspect indexing table being made of different index word, record Different text keyword and their relevant information.Second part has recorded the collection of document for each index terms occurred And its storage address.Include specifically A in Reverse features index structureij, tetra- partial information of Fileid, Pos, Freq, expression Meaning is respectively:(1)AijFor j-th of characteristic attribute key assignments of ith feature property key, i ∈ [1, n1], j ∈ [1, m], n1For The number of text attribute, m are the number for the attribute value that ith attribute key includes.(2) Fileid is to include query characteristics word File ID, Fileid are unique.(3) Pos is to include the position where query characteristics word file.(4) Freq is query characteristics Word concentrates the frequency occurred in data characteristics.The establishment process of aspect indexing is as follows:
The data that construction feature indexes are wanted in step 1, first analysis, if there is no current data in the index built, The new aspect indexing object of the high layer building one of combined type aspect indexing;
Step 2 judges to increase the characteristic attribute value type of data newly, is then its establishment R tree feature if numeric type data Index;If text-type attribute then builds Reverse features index structure for it;
Step 3 repeats step 1, if there are current attributes in the aspect indexing built before, no longer to feature rope Draw the new object of high-rise increase, only the data of the attribute are added in the corresponding aspect indexing of low layer;
Step 4 repeats above step, until completion until being indexed for all data construction features.
In search index, querying condition is analyzed first and obtains Feature Words, query characteristics word is handed into index dictionary, if Index marker position is false, returns to null value and indicates that the characteristic to be inquired is not present in index file, if if true, judgement should The data type that query word returns the result, according to different type navigate to different characteristic index, read this feature word ID and Including Feature Words number of documents, the relevant information of querying condition is obtained by these.R tree feature ropes are read further according to Feature Words ID Draw or Converse Index in content, the retrieval content integrated, finally with search condition carry out correlation comparison, to inquiry tie Fruit, which sorts to obtain final result, returns to user.Using the key assignments key_id in characteristic table as the input value of search algorithm, Output is Boolean, and detailed process is as follows:
(1) using root, key_id, level number level as input parameter, search function lookup (root, key_ are called Id, level), search result is assigned to leaf node record;
(2) if leaf node is recorded as sky, null value is directly returned;Otherwise, real search result rid is returned;
Using current block as the input of search function lookup, key is search key, and level is the initial number of plies, may include The leaf of search key key records the output as function, and detailed process is as follows:
(3.1) if what is be currently located is leaf node, key keys are searched for using binary search algorithm, and provide search result.
(3.2) if current block is not leaf node, (3.3) is thened follow the steps and arrive (3.6).
(3.3) current block and key values are pressed, the subtree containing key assignments is selected, obtains the block number of child node.
(3.4) the child node block that it is included is read according to block number in the buffer.
(3.5) it if the child node block found is leaf node, returns (3.1).
(3.6) if the child node block is branch's block, child node block, key, level is subtracted 1 as new input, passed Call function is returned to return to output result.
The tagsort subsystem is used to carry out Classification Management to characteristic using the method for cluster;The present invention uses Method is determined with lower class quantity.Definition estimation density first:
Wherein, Xtr, XteIt indicates to carry out initial data the feature training set and characteristic test obtained by random division respectively Collection;C(Xtr, k) indicate feature training set cluster process, be copolymerized into k classes;Ak1, Ak2..., AkkIndicate that characteristic test collection itself is poly- At k classes, i, i ' be sample point in same class, nkjIt is AkjThe number of middle sample point;D[C(Xtr, k), Xte] indicate a k The element of × k matrixes, the i-th row and the i-th ' row takes 0 or 1, and value 0 indicates that, not in same class, value 1 indicates to use feature training set It is right:I and i ' are clustered;Ps (k) indicates the estimation density of cluster result when class quantity is k.
Estimate that density calculating process is as follows:
(1) initial data to be clustered is randomly divided into feature training set and characteristic test collection;
(2) it is k to take class quantity, is clustered to above-mentioned two subset, and cluster result is denoted as I types cluster;
(3) characteristic test collection is differentiated with the cluster result of feature training set, is as a result denoted as II types cluster;
(4) in k-th of class that characteristic test collection is polymerized to itself, whether examination is any poly- in II types to sample point i and i ' Accidentally divided in different classes in class, and records the ratio correctly divided;
(5) in k composition of proportions, reckling is the estimation density under current class quantity k.
To estimate density as majorized function, class quantity and variable subset are to influence the factor of estimation density, are closed by selecting Suitable class quantity and variable subset make estimation density maximize.
It is unevenly distributed the problem of Hash access phase efficiency declines for characteristic field, the present invention is based on features The Hash join algorithm of the MapReduce data queries of field storage, makes the field under MapReduce distributed environments each It is evenly distributed on node, improves data-handling efficiency.Query execution projection operation is changed into the feature field on each node Operation reduces the I/O wastes that repeated accesses come with watchband.In feature field storage, under the target object that pushes away specific to certain A feature field, each feature field are equivalent to a small table being made of (lineid, value).
In order to solve the problems, such as that data are unbalanced, first in bottom-layer design one in MapReduce distributed computing frameworks New file format HMF so that if there are user's HMF file sets:O (F)={ f1, f2..., fn, by present node set For P={ γ1, γ2..., γx, virtual machine nodes set Λ={ v (γ of corresponding node1), v (γ2) ..., v (γx)}。v(γi) indicate virtual machine nodes and true calculate node mapping relations.
Parallel computation based on hash algorithm realizes that step includes:
Entire hash-value space is organized into a virtual end to end ring by step 1.;
For step 2. using the mode of the network address of calculate node as keyword Hash, each node determines it in Hash sky Between on position;
HMF files are mapped to a value of hash space by step 3. with hash function, backward along the value, will encountered One node is as processing node;
Step 4. is in the Map stages, and when according to HMF search nodes, what is found is virtual machine nodes, then further according to void Quasi- calculate node searches for corresponding true calculate node, and each cluster is mapped on a node;
Step 5:In access phase, the load data of each node is collected, once find there is uneven situation, the node institute The cluster of mapping is then reassigned to new node, and new node quantity is determined according to loading condition.Origin node resource reclaim after replacement, with Just sub-distribution again;
Step 6:After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages together It does and merges, finally export query result.
In Hash connection, two relationships R and S, number of tuples are respectively TRAnd TS, and TR> TS.One Hash letter Initial division S is mapped to B cluster, the serial number of cluster by number:The public attribute of 1,2 ..., B, R and S are A respectively1, A2... Ak, right In the connection attribute A for the relationship R being distributed on m nodesiComponent, after hash operations, it is determined that in the B cluster Match.
The optimization process of above-mentioned Hash connection is divided into two stages:Structure and connection.
In the structure stage:The Map tasks of each node select one of table and connect base table as Hash, to build Hash table reads the connection attribute of the base table in HMF file system using the connection attribute for participating in attended operation as hashkey In field to the node memory of MapReduce distributed systems, then, to all key assignments of link field, hash function is carried out Operation.
By the processed base table connection row of Hash, it is stored in one piece opened up in the memory together with data and specially deposits Put such data space.Then, according to different hash function values, cluster dividing operation is carried out to base table.Each cluster includes institute There are the base table data of identical hash function value.
In access phase:The fact that carry out Hash connection table is used as by base table on each node, it will be connected Data field read in batches successively from HMF, and hash operations are done to connection attribute field, it is determined that in which cluster Removal search is navigated to using Hash searching algorithms on cluster appropriate;
In the cluster navigated to, qualified line number is obtained by accurate matching.To the qualified line number of each node, It is done and is merged by Reduce, in HMF file system, read the inquiry row involved in SQL statement, finally export query result.
Because the search range of algorithm has reduced, it is high to carry out matched success rate accuracy.Matching operation is in memory Middle progress, it is many soon when speed is compared with ordering by merging, realize optimization purpose.
In order to reduce the influence of accidental factor, it is preferable that data characteristics collection is randomly divided into several deciles first, it will be each Decile is used as characteristic test collection successively, and after finding out respective estimation density, then it is the estimation under this kind of quantity to take its average value Density.Hierarchy clustering method based on improved estimator density is credible to the cluster result of data in example and is of practical significance , the more conventional clustering method clustering preferably for being used for carrying out data.
The feature mining subsystem is under the safe cloud environment having verified that from being dispersed in data Layer in cloud platform everywhere Data set provider at search for and match the characteristic for meeting application demand, and formed by aggregation and analysis and arrangement pending Characteristic;The modelling phase be used for using storage cluster to distributed environment under calculate node model, each non- The shared of characteristic is carried out between local node, is searched for and is matched the characteristic for meeting application demand;If xiCollect for storage A node in group, { xi1, xi2... ximIt is xiNon-local nodes collection, PLiFor local resource pond, PNiFor non-local nodes Data pool, i ∈ [1, n], n are the sum that storage cluster includes node, and m indicates the number of non-local nodes, m<n;
Carry out data it is shared when using based on the data protocol between non-local nodes:Work as xiIt, will when P2P networks are added xiWith { xi1, xi2... ximStructure connection, xiFurther according to PLiIn information on services, create sharing feature data, and by institute Sharing feature data forwarding is stated to all non-local nodes ximIt is shared, if any node in storage cluster receives one When sharing feature data, judged whether to receive the sharing feature data according to the id information of sharing feature data, if having connect It received, and abandoned the sharing feature data, if receiving for the first time, according to the data and node location information of sharing feature data, Update PNiIn content, and identified according to the validity of sharing feature data, determine forwarding or abandon the sharing feature number According to, wherein data needs periodically synchronize between non-local nodes;
In resource searching, the operation specifically executed is:If initiating sharing request MjNode be xj, in xjIt is non-local According to Probability p in node setjThe set of node picked out at random is pj×{xj1, xj2... xjm, j ∈ [1, n];As node xiIt receives To xjThe sharing request M of transmissionjWhen, check PNiAnd PLiIn whether containing meeting sharing request MjCharacteristic, if so, according to The location information of node, creates the response message of inquiry and according to x where the characteristic and datajLocation information, will The response message returns to xj, then by xjValidity mark subtract 1, if xjValidity be identified as 0, abandon sharing request MjIf not 0, p is calculated using EM algorithmsj×{xj1, xj2... xjmIn each node desired value, by sharing request MjIt is transmitted to pj×{xj1, xj2... xjmThe maximum node of expected value;Set the calculation formula of desired value as:
Enew=Eold+αElearn+β×I[Nxjμ(t)(Txjμ-T’xjμ)/(Txjμ×T’xjμ)]×(Nxjμ(t))/Txjμ
Wherein, EnewIndicate the new value of E, EoldIndicate the old value of E, ElearnIndicate that the value learnt, α indicate learning rate, β Indicate congestion factor, Nxjμ(t) moment t node x is indicatedBuffer queue in pending sharing request message count, T 'xjμIt indicates pj×{xj1, xj2... xjmIn node xHandle the time of a sharing request message defined, TxjμIndicate pj×{xj1, xj2... xjmIn node xHandle a sharing request message actually required time;Function I [x] is in x>Value is 1 when 0, Value is 0 when x≤0;
The task scheduling subsystem carries out task scheduling to data handling procedure, by complicated data processing calculating task Fractionation is had a single function to one group and independent subtask, and meets the cloud service resource pool of its demand, shape for subtask matching At Services Composition scheme, to obtain storage resource or computing resource needed for data handling procedure;It is taken according to the data of generation The task scheduling of business executes the estimation of Services Composition scheme:
(1), according to cloud service resource pool SPvWith corresponding service quality historical record, CS is carried outγEfficiency function X Modeling and according to each parameter of efficiency function in application example initialization model, if being constrained to by task scheduling is correspondingCorresponding qos constraint is C={ C1, C2.., Cd, each subtask GvCorresponding money Source pond SPvShared mvA service, for cloud service resource pool SPvIn each service SP, it includes historical record Number is L, by SPvThe γ feasible Services Composition schemes formed are CSγ, ω ∈ [1, mv], defining service model is:
Wherein, QoSmax(k) it is the service quality maximum value of kth dimension, QoSmin(k) most for the service quality of kth dimension Small value, d be corresponding to maximum dimension, qd() is majorized function, SPRhTo be under the jurisdiction of SPHistorical record, xvω-hIt indicates The parameter of efficiency function in model;
(2), each feasible Services Composition scheme is ranked up by sequence from small to large according to efficiency function value, before selection As preferred service assembled scheme, the value of Z is set Z feasible Services Composition schemes according to application example;
(3), the average value of its efficiency function value is calculated each group of preferred service assembled scheme;
(4), the average value of efficiency of selection functional value is maximum preferred service assembled scheme as optimal Services Composition Scheme;
Record preferred service assembled scheme efficiency function value and optimal Services Composition scheme, and as sample into Row study, if new preferred service assembled scheme had occurred, directly invokes its functional value.
By taking network topics data mining as an example, the present invention is on the basis of constructed index structure, with training characteristics collection The difference of the contribution margin of each training characteristics set pair sample space construction when expressing test feature collection, using σ (Ic)VcSquare Battle array constructs new sparse expression dictionary, wherein σ (Ic) it is every class training characteristics collection, VcValue matrix is contributed for dictionary;In sparse expression Bound term in category set is added in constraint so that the sample of the same category can flock together in the less space of sum, Effectively excavate the hiding feature of complex data.The present invention is based on the network topics method for digging of big data to comprise the steps of:
Step 1:To neural network to topic Text Feature Extraction topic feature after.
Step 2:Training characteristics collection is inputted, using the article sample training classified lexicon for including C type, training characteristics Collection space is indicated with I, is expressed as I=[I1, I2..., Ic..., IC]∈RD×N, D indicates the characteristic dimension of training characteristics collection, and N is Training characteristics collection total number, IiIt indicates the i-th class sample, defines NiIt indicates per class training characteristics collection quantity, then N=N1+N2+ ...+Nc +…+NC
Step 3:Regularization is carried out to training characteristics collection, obtains the training characteristics collection collection I of regularization;
Step 4:Its dictionary is respectively trained to every a kind of training characteristics collection, the process of training dictionary is:
1, c class samples I is taken outc, by IcIt is mapped to kernel space σ (Ic);
2, sparse coding dictionary σ (Ic)VcTraining need to meet constraints, the majorized function of the constraints is:
In formula, α is the constraint factor of sparse item constraint in sparse coding, and δ is that grouped accumulation constrains in coding dictionary Ic Constraint factor, ScFor the eigenmatrix of c class kernel space training characteristics collection, m rowIndicate kernel space sample to structure The contribution margin of each entry, dictionary B in word making libraryc=σ (Ic)Vc, mapping of the σ expression samples in kernel space.
3, the majorized function of constraints in step 2 is solved:First to VcAnd ScIt is initialized, it is random to generate Two matrixes, wherein VcIt is Nc× K matrix, ScIt is K × NcMatrix, K are dictionary sizes;Then, alternating iteration updates VcAnd Sc, Seek optimal contribution margin matrix VcWith eigenmatrix ScSo that majorized function value is minimum, by the contribution of every a kind of training characteristics collection Value matrix VcIt is placed into a unit matrix, obtains contribution margin matrix V, which is classified lexicon;It has Body solution procedure is:
(1) fixed Vc, update Sc;By VcThe majorized function of constraints is substituted into, i.e. majorized function is converted into:
To ScEach element in matrix is updated, and is kept majorized function optimal, that is, is defined ScIn row k n-th arrange Element seeks optimal eigenmatrix Sc
(2) the fixed eigenmatrix S soughtc, update contribution margin matrix Vc, i.e. majorized function is converted into:
f(Wc)=| | σ (Ic)-σ(Ic)Vc Sc||2
To contribution margin matrix VcEach row be gradually updated, when updating a certain row, remaining row then be fixed value;
Traverse VcEach row update VcContribution margin;
(3) iteration updates above-mentioned steps (1) and step (2) to update ScAnd VcContribution margin, as above-mentioned majorized function value f (Vc, Sc) when tending towards stability, update finishes;
(4) the eigenmatrix S per a kind of training characteristics collection is trained successivelycWith contribution margin matrix Vc
(5) the contribution margin matrix V c integrated by every a kind of training characteristics obtains the contribution value matrix that dimension arranges as N rows C × K V, as classified lexicon.
Step 3:Text is identified, step is:
(1) text feature of test feature collection to be identified is extracted after using to neural network, y is defined and identifies test sample The feature of topic.
(2) acquired contribution margin matrix V is used, test feature collection text feature σ (y) is predicted, obtains prediction The anticipation function of function, acquisition is:
F (s)=| | σ (y)-σ (I) s × Vc||2+2αs
In formula, s indicates that the sparse coding of test feature collection σ (y), σ (I) indicate training characteristics collection I reflecting in kernel space It penetrates.
(3) it asks kernel space σ (y) in the prediction error of every constituted sample space of class sample, is indicated with r (c), table It is up to formula:
R (c)=| | σ (y)-σ (Ic)Vc Sc||2
(5) compare kernel space σ (y) and the prediction error per class sample, it is minimum that text to be identified then belongs to prediction error Classification.
In conclusion the present invention proposes a kind of data mining optimization method based on MapReduce, for distributed ring The back end in border facilitates user by matching service description information to use data, improves the efficiency of data mining;It is logical The computing resource provided using cloud service or storage resource are provided and provide a feasible scheme to develop structure data service.
Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.
It should be understood that the above-mentioned specific implementation mode of the present invention is used only for exemplary illustration or explains the present invention's Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (6)

1. a kind of data mining optimization method based on MapReduce, which is characterized in that including:
The mapping relations of virtual machine nodes and true calculate node defined in MapReduce Computational frames;
In the Map stages, virtual machine nodes are found first, are then saved further according to the corresponding true calculating of virtual machine nodes search Each cluster is mapped on a node by point;
It transfers to the Reduce stages to do node data to merge, exports query result.
2. according to the method described in claim 1, it is characterized in that, described define virtual machine nodes and true calculate node Mapping relations further comprise:
In one new file format HMF of bottom-layer design so that if there are user's HMF file sets:O (F)={ f1, f2..., fn, present node collection is combined into P={ γ1, γ2..., γx, virtual machine nodes set Λ={ v of corresponding node (γ1), v (γ2) ..., v (γx)};v(γi) indicate virtual machine nodes and true calculate node mapping relations.
3. according to the method described in claim 1, it is characterized in that, the step of Map stages find virtual machine nodes it Before, further include:
Entire hash-value space is organized into a virtual end to end ring;
Using the mode of the network address of calculate node as keyword Hash, each node determines its position on hash space It sets;
HMF files are mapped to a value of hash space with hash function, backward along the value, encounter first node is made To handle node.
4. according to the method described in claim 3, it is characterized in that, further comprising:
In the Map stages, when according to HMF search nodes, corresponding true calculate node is searched for according to virtual machine nodes, it will be every A cluster is mapped on a node.
5. according to the method described in claim 1, it is characterized in that, it is described by each cluster be mapped to the step on a node it Afterwards, further include:
In access phase, the load data of each node is collected, once finding there is uneven situation, the node mapped cluster is then It is reassigned to new node, new node quantity is determined according to loading condition;Origin node resource reclaim after replacement, to divide again Match;
After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages to do to merge together.
6. according to the method described in claim 1, it is characterized in that, the Map task choosings according to each node go out one of them Table connects base table to build Hash table as Hash, using the connection attribute for participating in attended operation as hashkey, reads in HMF In file system in the connection attribute field of base table to the node memory of MapReduce distributed systems, then, to link field All key assignments, carry out hash function operation;
By the processed base table connection row of Hash, it is stored in one piece opened up in the memory together with data and specially stores this Class data space;Then, according to different hash function values, cluster dividing operation is carried out to base table;Each cluster includes all phases With the base table data of hash function value.
CN201810059358.XA 2018-01-22 2018-01-22 Data mining optimization method based on MapReduce Pending CN108280176A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810059358.XA CN108280176A (en) 2018-01-22 2018-01-22 Data mining optimization method based on MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810059358.XA CN108280176A (en) 2018-01-22 2018-01-22 Data mining optimization method based on MapReduce

Publications (1)

Publication Number Publication Date
CN108280176A true CN108280176A (en) 2018-07-13

Family

ID=62804453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810059358.XA Pending CN108280176A (en) 2018-01-22 2018-01-22 Data mining optimization method based on MapReduce

Country Status (1)

Country Link
CN (1) CN108280176A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268333A (en) * 2021-06-21 2021-08-17 成都深思科技有限公司 Hierarchical clustering algorithm optimization method based on multi-core calculation
CN114707039A (en) * 2022-03-29 2022-07-05 安徽体育运动职业技术学院 Rapid data management method based on mass data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945240A (en) * 2012-09-11 2013-02-27 杭州斯凯网络科技有限公司 Method and device for realizing association rule mining algorithm supporting distributed computation
CN104852934A (en) * 2014-02-13 2015-08-19 阿里巴巴集团控股有限公司 Method for realizing flow distribution based on front-end scheduling, device and system thereof
US9485197B2 (en) * 2014-01-15 2016-11-01 Cisco Technology, Inc. Task scheduling using virtual clusters
CN107197035A (en) * 2017-06-21 2017-09-22 中国民航大学 A kind of compatibility dynamic load balancing method based on uniformity hash algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945240A (en) * 2012-09-11 2013-02-27 杭州斯凯网络科技有限公司 Method and device for realizing association rule mining algorithm supporting distributed computation
US9485197B2 (en) * 2014-01-15 2016-11-01 Cisco Technology, Inc. Task scheduling using virtual clusters
CN104852934A (en) * 2014-02-13 2015-08-19 阿里巴巴集团控股有限公司 Method for realizing flow distribution based on front-end scheduling, device and system thereof
CN107197035A (en) * 2017-06-21 2017-09-22 中国民航大学 A kind of compatibility dynamic load balancing method based on uniformity hash algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GERMANY006: "Oracle表连接操作—Hash Join(哈希连接)上", 《HTTP://BLOG.ITPUB.NET/28371090/VIEWSPACE-1184848/》 *
李伟卫 等: "基于MapReduce 的海量数据挖掘技术研究", 《计算机工程与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268333A (en) * 2021-06-21 2021-08-17 成都深思科技有限公司 Hierarchical clustering algorithm optimization method based on multi-core calculation
CN113268333B (en) * 2021-06-21 2024-03-19 成都锋卫科技有限公司 Hierarchical clustering algorithm optimization method based on multi-core computing
CN114707039A (en) * 2022-03-29 2022-07-05 安徽体育运动职业技术学院 Rapid data management method based on mass data

Similar Documents

Publication Publication Date Title
US9805079B2 (en) Executing constant time relational queries against structured and semi-structured data
CN109919316A (en) The method, apparatus and equipment and storage medium of acquisition network representation study vector
CN105653691B (en) Management of information resources method and managing device
CN105760443B (en) Item recommendation system, project recommendation device and item recommendation method
Lin et al. Website reorganization using an ant colony system
CN106462620A (en) Distance queries on massive networks
CN106815310A (en) A kind of hierarchy clustering method and system to magnanimity document sets
CN106095951B (en) Data space multi-dimensional indexing method based on load balancing and inquiry log
Cao et al. HitFraud: a broad learning approach for collective fraud detection in heterogeneous information networks
Motwani et al. A study on initial centroids selection for partitional clustering algorithms
CN113222181A (en) Federated learning method facing k-means clustering algorithm
CN108228787A (en) According to the method and apparatus of multistage classification processing information
Abdelli et al. A novel and efficient index based web service discovery approach
CN107066328A (en) The construction method of large-scale data processing platform
Rahman et al. Seed-Detective: A Novel Clustering Technique Using High Quality Seed for K-Means on Categorical and Numerical Attributes.
CN108280176A (en) Data mining optimization method based on MapReduce
CN107193940A (en) Big data method for optimization analysis
CN108256083A (en) Content recommendation method based on deep learning
CN108256086A (en) Data characteristics statistical analysis technique
Xiao et al. ORHRC: Optimized recommendations of heterogeneous resource configurations in cloud-fog orchestrated computing environments
Al Aghbari et al. Geosimmr: A mapreduce algorithm for detecting communities based on distance and interest in social networks
Alhaj Ali et al. Distributed data mining systems: techniques, approaches and algorithms
Guo et al. K-loop free assignment in conference review systems
CN107103095A (en) Method for computing data based on high performance network framework
Jayachitra Devi et al. Link prediction model based on geodesic distance measure using various machine learning classification models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180713