CN108491437A - A kind of magnanimity web data excavation genetic method based on Hadoop - Google Patents

A kind of magnanimity web data excavation genetic method based on Hadoop Download PDF

Info

Publication number
CN108491437A
CN108491437A CN201810141328.3A CN201810141328A CN108491437A CN 108491437 A CN108491437 A CN 108491437A CN 201810141328 A CN201810141328 A CN 201810141328A CN 108491437 A CN108491437 A CN 108491437A
Authority
CN
China
Prior art keywords
genetic
individual
data
hadoop
method described
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810141328.3A
Other languages
Chinese (zh)
Inventor
王利鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong ICity Information Technology Co., Ltd.
Original Assignee
Shandong Hui Trade Electronic Port Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Hui Trade Electronic Port Co Ltd filed Critical Shandong Hui Trade Electronic Port Co Ltd
Priority to CN201810141328.3A priority Critical patent/CN108491437A/en
Publication of CN108491437A publication Critical patent/CN108491437A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Abstract

The present invention provides a kind of magnanimity web data excavation genetic method based on Hadoop, belongs to data mining, analysis field, and the present invention merges genetic algorithm with MapReduce, for the web data analysis in Hadoop cluster environment.The experimental results showed that the platform can obtain information implicit, with practical value, execution efficiency is high.It can not only improve digging efficiency, and the drawbacks of overcome network environment.

Description

A kind of magnanimity web data excavation genetic method based on Hadoop
Technical field
The present invention relates to data mining, data analysis technique more particularly to a kind of magnanimity web data diggings based on Hadoop Dig genetic method.
Background technology
Currently, with the rapid expansion of data scale, cannot meet on a large scale by the computing capability of single node The requirement of Data Analysis Services, can be used for mass data storage with processing " cloud computing technology is come into being." cloud computing is A kind of calculating Internet-based, wherein shared resource, software and information etc. by it is a kind of it is on-demand in a manner of be supplied to calculating Machine and equipment." cloud computing technology leads to the complicated calculations for consuming a large amount of computing resources by computing resource powerful in network It crosses network and is distributed on multinode and calculated, be to work as the effective solution of former.Internet is maximum as the whole world Data acquisition system, the hot spot that the data mining based on Web is always studied both at home and abroad.But at present to the research master of data mining It concentrates on improving the validity aspect of digging system, and ignores the processing speed to mass data.With network technology It grows rapidly, the data in internet are just skyrocketed through with exponential scale.This makes the Mining Platform based on single node not It can complete storage and the analyzing processing task of current magnanimity web data.Therefore, it can need by " the powerful storage of cloud computing Problems are solved with computing capability.
Data digging method or algorithm towards magnanimity web data must can carry out parallel processing.Practical big In type enterprise Web website, URL quantity usually reaches tens of thousands of or even hundreds of thousands, this will cause the web access matrix mistake constructed Greatly, tradition handles the ability of mass data as a bottleneck of development based on single machine.In order to solve the problems, such as this, to frequently-used data Mining algorithm C4.5, SPRINT, correlation rule, K-means etc. are improved, it is proposed that it is many based on traditional mining algorithm and Row algorithm.Wherein, the method divided based on data is generally to use a kind of method of parallel processing:Data acquisition system is divided first For sub-block appropriate, then handled with traditional mining algorithm (such as Aprior algorithms) in each sub-block, it finally will be each Result in a sub-block merges.By statement above, MapReduce has been realized in the division to data, but premise Be data context it is loose coupling.
Genetic algorithm is a kind of global randomization searching algorithm of the highly-parallel of Solve problems, is non-company in Solve problems Continuous, multimodal and it is noisy in the case of, can have the preferable overall situation with prodigious convergence in probability to optimal solution or satisfactory solution Optimal solution solves ability.Simultaneously as genetic algorithm need to only carry out genetic evolutionary operations on the basis of initial population, do not need Database is taken multiple scan, greatly reduces the transmission quantity of data, therefore be used for genetic algorithm to be based on Hadoop clusters The user browsing preferred paths of frame excavate, and the transmission quantity of data can also be reduced while improving algorithm performs efficiency.
" the maximum advantage of cloud computing platform is it realizes " calculating close to storage " thought to Hadoop, and traditional is " mobile Data are with close to calculating " overhead of the pattern when data scale reaches magnanimity is too big, and " mobile computing is with close to storage " The network transmission of mass data this large overhead can be eliminated, it can substantially trim process time.
Invention content
In order to solve the above technical problems, the present invention proposes a kind of magnanimity web data excavation heredity based on Hadoop Algorithm.The present invention is based on the designs of the magnanimity web data digging system of Hadoop cloud computing platform, by calculating traditional heredity Method is merged with the Map/Reduce parallel computation frames of Hadoop platform, verifies the availability of the system.
The present invention has merged the segmentation of the data based on MapReduce and the new algorithm of genetic algorithm combines data segmentation skill The advantages of global search optimal solution of the distributed treatment of art and genetic algorithm.
Technical scheme is as follows:
A kind of magnanimity web data excavation genetic algorithm based on Hadoop, mainly includes the following steps:
Step 1 data dividing processing.The characteristics of for web data, web data is split, such as to Web daily record texts Part is split by user and access date, and is transferred in different child nodes, while obtaining user-defined support S.
Step 2 initializes group.Each child node is converted data set using Map and Reduce operations under Hadoop platform To meet the 1- item collection forms that user defines the preference subpath of support, the initial population as genetic method.
Step 3 fitness value calculation.By the frequency of an access path weigh its whether be user's preferences access road Diameter, therefore, fitness function are defined as follows:
Wherein, S ' is the visiting frequency in path.Retain individuals of the Fitness more than 1 and enters the next generation.
If step 4 is equal to 1 as the fitness value Fitness of all individuals in evolution generation, show group by it is hereditary into It cannot improve after change, go to step 7, otherwise, continue.
Step 5 selects the generation, crossover operation.In genetic evolution process, elite retention strategy is taken first, is protected The elite individual in genetic process is stayed, allows them to be not involved in crossover operation and is directly entered the next generation.Herein by the choosing of i-th of individual Selecting definition of probability is:
Pi=S 'i/Savg
Wherein, S 'iFor the path access frequency of i-th of individual;SavgFor in group all individual path access frequency it is flat Mean value.
In the individual of reservation, individual is chosen according to select probability and carries out crossover operation, when algebraically is 50 multiple, into Row migration operation, marriage between adjacent 2 sub-groups, and the optimized individual in marriage offspring is copied in relevant source population.
Entire progeny population is substituted parent group by step 6, is formed a new generation, is then gone to step 2.
For step 7 after certain genetic algebra, k values are still unchanged, then exit evolution, and output end user, which has a preference for, to be accessed Path.
The beneficial effects of the invention are as follows
The present invention merges genetic algorithm with MapReduce, for the web data point in Hadoop cluster environment Analysis.The experimental results showed that the platform can obtain information implicit, with practical value, execution efficiency is high.Based on " cloud computing Web excacations it is significant, it can not only improve digging efficiency, and the drawbacks of overcome network environment.
Description of the drawings
Fig. 1 be single machine with based on the user preference similarity of paths schematic diagram under Hadoop environment;
Fig. 2 is one-of-a-kind system and data processing time contrast schematic diagram under Hadoop platform.
Specific implementation mode
More detailed elaboration is carried out to present disclosure below:
The present invention's is expressed as follows:
Step 1 data dividing processing.The characteristics of for web data, web data is split, such as to Web daily record texts Part is split by user and access date, and is transferred in different child nodes, while obtaining user-defined support S.
Step 2 initializes group.Each child node is converted data set using Map and Reduce operations under Hadoop platform To meet the 1- item collection forms that user defines the preference subpath of support, the initial population as genetic algorithm.
Step 3 fitness value calculation.By the frequency of an access path weigh its whether be user's preferences access road Diameter, therefore, fitness function are defined as follows:
Wherein, S ' is the visiting frequency in path.Retain individuals of the Fitness more than 1 and enters the next generation.
If step 4 is equal to 1 as the fitness value Fitness of all individuals in evolution generation, show group by it is hereditary into It cannot improve after change, go to step 7, otherwise, continue.
Step 5 selects the generation, crossover operation.In genetic evolution process, elite retention strategy is taken first, is protected The elite individual in genetic process is stayed, allows them to be not involved in crossover operation and is directly entered the next generation.Herein by the choosing of i-th of individual Selecting definition of probability is:
Pi=S 'i/Savg
Wherein, S 'iFor the path access frequency of i-th of individual;SavgFor in group all individual path access frequency it is flat Mean value.
In the individual of reservation, individual is chosen according to select probability and carries out crossover operation, when algebraically is 50 multiple, into Row migration operation, marriage between adjacent 2 sub-groups, and the optimized individual in marriage offspring is copied in relevant source population.
Entire progeny population is substituted parent group by step 6, is formed a new generation, is then gone to step 2.
For step 7 after certain genetic algebra, k values are still unchanged, then exit evolution, and output end user, which has a preference for, to be accessed Path.
The present invention has carried out two groups of experiments, and 1 verification single machine mining algorithm of experiment and the genetic algorithm based on MapReduce exist In the case of different data scale, the similarity of obtained result;2 verification of experiment taking under different scales data cases.It examines File block divisions in HDFS are set as 16MB by the loading condition for considering single machine processor, when data are less than 16MB, Data are on a back end;When data are less than 80MB, each back end does not have the multiple of this document very in maximum probability block。
1 verification similarity of experiment
In experimentation, data are all independent, do not have between data comprising with by comprising relationship, so the phase measured It is also independent like degree.The results are shown in Figure 1.The experimental results showed that the similarity under two kinds of processing environments has reached 90%, And with the increase of file, similarity fluctuates to a very small extent.This shows the web data analysis platform energy based on Hadoop Enough accurate preference access path for finding user, and the Web file sizes handled do not interfere with the validity of platform.
Test 2 testing time expenses
This experiment compares processing time of two kinds of processing environments under different data scale.Experimental result is as schemed Shown in 2.From the point of view of experimental result, when the data volume of processing is smaller, the web data analysis platform based on Hadoop is due to needing Intermediate file and definitive document are generated and transmitted, Hadoop is opened and is also required to regular hour, therefore the total time of concurrent operation It is more than the time that single machine executes instead.But with the increase of data volume, the parallel processing platform based on Hadoop divides data After be dispatched to the processing of multiple nodal parallels, so that the total time of concurrent operation is less than the time that single machine executes, and with input data Increase, the gap of the two execution efficiency is also increasing.Figure it is seen that the interstitial content possessed in cluster is more, base It is higher in the efficiency of the parallel processing platform of Hadoop.

Claims (8)

1. a kind of magnanimity web data based on Hadoop excavates genetic method, which is characterized in that mainly include the following steps:
Step 1), data dividing processing
Step 2), initialization group
Step 3), fitness value calculation
If step 4), the fitness value Fitness for working as all individuals in evolution generation are equal to 1, show that genetic evolution is passed through by group After cannot improve, go to step 7, otherwise, continue;
Step 5) selects the generation, crossover operation;
Entire progeny population is substituted parent group by step 6), is formed a new generation, is then gone to step 2;
Step 7), after genetic algebra, k values are still unchanged, then exit evolution, and output end user has a preference for access path.
2. according to the method described in claim 1, it is characterized in that, in step 1, the characteristics of for web data, to web data It is split.
3. according to the method described in claim 2, it is characterized in that, dividing by user and access date Web journal files It cuts, and is transferred in different child nodes, while obtaining user-defined support S.
4. according to the method described in claim 1, it is characterized in that, in step 2, each child node utilizes under Hadoop platform Map and Reduce operations convert data set to the 1- item collection forms for meeting the preference subpath that user defines support, as The initial population of genetic method.
5. according to the method described in claim 1, it is characterized in that, in step 3), weighed by the frequency of an access path Its whether be user's preferences access path, therefore, fit
Response function is defined as follows:
Wherein, S ' is the visiting frequency in path.Retain individuals of the Fitness more than 1 and enters the next generation.
6. according to the method described in claim 1, it is characterized in that, in step 5), in genetic evolution process, essence is taken first English retention strategy retains the elite individual in genetic process, allows them to be not involved in crossover operation and be directly entered the next generation.
7. according to the method described in claim 6, it is characterized in that,
The select probability of i-th of individual is defined as:
Pi=S 'i/Savg
Wherein, S 'iFor the path access frequency of i-th of individual;SavgFor the average value of all individual path access frequency in group.
8. the method according to the description of claim 7 is characterized in that in the individual of reservation, individual is chosen according to select probability Crossover operation is carried out, when algebraically is 50 multiple, carries out migration operation, marriage between adjacent 2 sub-groups, and by marriage offspring In optimized individual copy in relevant source population.
CN201810141328.3A 2018-02-11 2018-02-11 A kind of magnanimity web data excavation genetic method based on Hadoop Pending CN108491437A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810141328.3A CN108491437A (en) 2018-02-11 2018-02-11 A kind of magnanimity web data excavation genetic method based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810141328.3A CN108491437A (en) 2018-02-11 2018-02-11 A kind of magnanimity web data excavation genetic method based on Hadoop

Publications (1)

Publication Number Publication Date
CN108491437A true CN108491437A (en) 2018-09-04

Family

ID=63340238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810141328.3A Pending CN108491437A (en) 2018-02-11 2018-02-11 A kind of magnanimity web data excavation genetic method based on Hadoop

Country Status (1)

Country Link
CN (1) CN108491437A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162704A (en) * 2019-05-21 2019-08-23 西安电子科技大学 More scale key user extracting methods based on multiple-factor inheritance algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162704A (en) * 2019-05-21 2019-08-23 西安电子科技大学 More scale key user extracting methods based on multiple-factor inheritance algorithm
CN110162704B (en) * 2019-05-21 2022-06-10 西安电子科技大学 Multi-scale key user extraction method based on multi-factor genetic algorithm

Similar Documents

Publication Publication Date Title
Ramírez‐Gallego et al. Fast‐mRMR: Fast minimum redundancy maximum relevance algorithm for high‐dimensional big data
Gupta et al. Scalable machine‐learning algorithms for big data analytics: a comprehensive review
CN103699606A (en) Large-scale graphical partition method based on vertex cut and community detection
He et al. Parallel implementation of classification algorithms based on MapReduce
CN105205052B (en) A kind of data digging method and device
Bagui et al. Positive and negative association rule mining in Hadoop’s MapReduce environment
CN111582325B (en) Multi-order feature combination method based on automatic feature coding
Madan et al. k-DDD measure and mapreduce based anonymity model for secured privacy-preserving big data publishing
Bernardino et al. Surrogate-assisted clonal selection algorithms for expensive optimization problems
Bamha et al. Frequency-adaptive join for shared nothing machines
Yimin et al. PFIMD: a parallel MapReduce-based algorithm for frequent itemset mining
KR101361080B1 (en) Apparatus, method and computer readable recording medium for calculating between matrices
Wang et al. Parable: A parallel random-partition based hierarchical clustering algorithm for the MapReduce framework
Senthilkumar et al. An efficient FP-Growth based association rule mining algorithm using Hadoop MapReduce
CN108491437A (en) A kind of magnanimity web data excavation genetic method based on Hadoop
Abdolazimi et al. Connected components of big graphs in fixed mapreduce rounds
US20220383137A1 (en) Enterprise Market Volatility Predictions through Synthetic DNA and Mutant Nucleotides
CN108280176A (en) Data mining optimization method based on MapReduce
Sarkar et al. MapReduce: A comprehensive study on applications, scope and challenges
Klos et al. Neural architecture search based on genetic algorithm and deployed in a bare-metal kubernetes cluster
Kolias et al. A Covering Classification Rule Induction Approach for Big Datasets
Vanahalli et al. Distributed mining of significant frequent colossal closed itemsets from long biological dataset
Savvas et al. Distributed and multi-core version of k-means algorithm
Wang et al. Clustering ensemble for categorical geological text based on diversity and quality
US11823064B2 (en) Enterprise market volatility prediction through synthetic DNA and mutant nucleotides

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200814

Address after: 250100 Room 3110, S01 Building, Tidal Building, 1036 Tidal Road, Jinan High-tech Zone, Shandong Province

Applicant after: Shandong Aicheng Network Information Technology Co.,Ltd.

Address before: 250100 S06 Floor, No. 1036 Tidal Road, Jinan High-tech Zone, Shandong Province

Applicant before: SHANDONG HUIMAO ELECTRONIC PORT Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20180904

RJ01 Rejection of invention patent application after publication