CN108491437A

CN108491437A - A kind of magnanimity web data excavation genetic method based on Hadoop

Info

Publication number: CN108491437A
Application number: CN201810141328.3A
Authority: CN
Inventors: 王利鑫
Original assignee: Shandong Hui Trade Electronic Port Co Ltd
Current assignee: Shandong ICity Information Technology Co., Ltd.
Priority date: 2018-02-11
Filing date: 2018-02-11
Publication date: 2018-09-04

Abstract

The present invention provides a kind of magnanimity web data excavation genetic method based on Hadoop, belongs to data mining, analysis field, and the present invention merges genetic algorithm with MapReduce, for the web data analysis in Hadoop cluster environment.The experimental results showed that the platform can obtain information implicit, with practical value, execution efficiency is high.It can not only improve digging efficiency, and the drawbacks of overcome network environment.

Description

A kind of magnanimity web data excavation genetic method based on Hadoop

Technical field

The present invention relates to data mining, data analysis technique more particularly to a kind of magnanimity web data diggings based on Hadoop Dig genetic method.

Background technology

Currently, with the rapid expansion of data scale, cannot meet on a large scale by the computing capability of single node The requirement of Data Analysis Services, can be used for mass data storage with processing " cloud computing technology is come into being." cloud computing is A kind of calculating Internet-based, wherein shared resource, software and information etc. by it is a kind of it is on-demand in a manner of be supplied to calculating Machine and equipment." cloud computing technology leads to the complicated calculations for consuming a large amount of computing resources by computing resource powerful in network It crosses network and is distributed on multinode and calculated, be to work as the effective solution of former.Internet is maximum as the whole world Data acquisition system, the hot spot that the data mining based on Web is always studied both at home and abroad.But at present to the research master of data mining It concentrates on improving the validity aspect of digging system, and ignores the processing speed to mass data.With network technology It grows rapidly, the data in internet are just skyrocketed through with exponential scale.This makes the Mining Platform based on single node not It can complete storage and the analyzing processing task of current magnanimity web data.Therefore, it can need by " the powerful storage of cloud computing Problems are solved with computing capability.

Data digging method or algorithm towards magnanimity web data must can carry out parallel processing.Practical big In type enterprise Web website, URL quantity usually reaches tens of thousands of or even hundreds of thousands, this will cause the web access matrix mistake constructed Greatly, tradition handles the ability of mass data as a bottleneck of development based on single machine.In order to solve the problems, such as this, to frequently-used data Mining algorithm C4.5, SPRINT, correlation rule, K-means etc. are improved, it is proposed that it is many based on traditional mining algorithm and Row algorithm.Wherein, the method divided based on data is generally to use a kind of method of parallel processing：Data acquisition system is divided first For sub-block appropriate, then handled with traditional mining algorithm (such as Aprior algorithms) in each sub-block, it finally will be each Result in a sub-block merges.By statement above, MapReduce has been realized in the division to data, but premise Be data context it is loose coupling.

Genetic algorithm is a kind of global randomization searching algorithm of the highly-parallel of Solve problems, is non-company in Solve problems Continuous, multimodal and it is noisy in the case of, can have the preferable overall situation with prodigious convergence in probability to optimal solution or satisfactory solution Optimal solution solves ability.Simultaneously as genetic algorithm need to only carry out genetic evolutionary operations on the basis of initial population, do not need Database is taken multiple scan, greatly reduces the transmission quantity of data, therefore be used for genetic algorithm to be based on Hadoop clusters The user browsing preferred paths of frame excavate, and the transmission quantity of data can also be reduced while improving algorithm performs efficiency.

" the maximum advantage of cloud computing platform is it realizes " calculating close to storage " thought to Hadoop, and traditional is " mobile Data are with close to calculating " overhead of the pattern when data scale reaches magnanimity is too big, and " mobile computing is with close to storage " The network transmission of mass data this large overhead can be eliminated, it can substantially trim process time.

Invention content

In order to solve the above technical problems, the present invention proposes a kind of magnanimity web data excavation heredity based on Hadoop Algorithm.The present invention is based on the designs of the magnanimity web data digging system of Hadoop cloud computing platform, by calculating traditional heredity Method is merged with the Map/Reduce parallel computation frames of Hadoop platform, verifies the availability of the system.

The present invention has merged the segmentation of the data based on MapReduce and the new algorithm of genetic algorithm combines data segmentation skill The advantages of global search optimal solution of the distributed treatment of art and genetic algorithm.

Technical scheme is as follows：

A kind of magnanimity web data excavation genetic algorithm based on Hadoop, mainly includes the following steps：

Step 1 data dividing processing.The characteristics of for web data, web data is split, such as to Web daily record texts Part is split by user and access date, and is transferred in different child nodes, while obtaining user-defined support S.

Step 2 initializes group.Each child node is converted data set using Map and Reduce operations under Hadoop platform To meet the 1- item collection forms that user defines the preference subpath of support, the initial population as genetic method.

Step 3 fitness value calculation.By the frequency of an access path weigh its whether be user's preferences access road Diameter, therefore, fitness function are defined as follows：

Wherein, S ' is the visiting frequency in path.Retain individuals of the Fitness more than 1 and enters the next generation.

If step 4 is equal to 1 as the fitness value Fitness of all individuals in evolution generation, show group by it is hereditary into It cannot improve after change, go to step 7, otherwise, continue.

Step 5 selects the generation, crossover operation.In genetic evolution process, elite retention strategy is taken first, is protected The elite individual in genetic process is stayed, allows them to be not involved in crossover operation and is directly entered the next generation.Herein by the choosing of i-th of individual Selecting definition of probability is：

P_i=S '_i/S_avg

Wherein, S '_iFor the path access frequency of i-th of individual；S_avgFor in group all individual path access frequency it is flat Mean value.

In the individual of reservation, individual is chosen according to select probability and carries out crossover operation, when algebraically is 50 multiple, into Row migration operation, marriage between adjacent 2 sub-groups, and the optimized individual in marriage offspring is copied in relevant source population.

Entire progeny population is substituted parent group by step 6, is formed a new generation, is then gone to step 2.

For step 7 after certain genetic algebra, k values are still unchanged, then exit evolution, and output end user, which has a preference for, to be accessed Path.

The beneficial effects of the invention are as follows

The present invention merges genetic algorithm with MapReduce, for the web data point in Hadoop cluster environment Analysis.The experimental results showed that the platform can obtain information implicit, with practical value, execution efficiency is high.Based on " cloud computing Web excacations it is significant, it can not only improve digging efficiency, and the drawbacks of overcome network environment.

Description of the drawings

Fig. 1 be single machine with based on the user preference similarity of paths schematic diagram under Hadoop environment；

Fig. 2 is one-of-a-kind system and data processing time contrast schematic diagram under Hadoop platform.

Specific implementation mode

More detailed elaboration is carried out to present disclosure below：

The present invention's is expressed as follows：

Step 2 initializes group.Each child node is converted data set using Map and Reduce operations under Hadoop platform To meet the 1- item collection forms that user defines the preference subpath of support, the initial population as genetic algorithm.

P_i=S '_i/S_avg

The present invention has carried out two groups of experiments, and 1 verification single machine mining algorithm of experiment and the genetic algorithm based on MapReduce exist In the case of different data scale, the similarity of obtained result；2 verification of experiment taking under different scales data cases.It examines File block divisions in HDFS are set as 16MB by the loading condition for considering single machine processor, when data are less than 16MB, Data are on a back end；When data are less than 80MB, each back end does not have the multiple of this document very in maximum probability block。

1 verification similarity of experiment

In experimentation, data are all independent, do not have between data comprising with by comprising relationship, so the phase measured It is also independent like degree.The results are shown in Figure 1.The experimental results showed that the similarity under two kinds of processing environments has reached 90%, And with the increase of file, similarity fluctuates to a very small extent.This shows the web data analysis platform energy based on Hadoop Enough accurate preference access path for finding user, and the Web file sizes handled do not interfere with the validity of platform.

Test 2 testing time expenses

This experiment compares processing time of two kinds of processing environments under different data scale.Experimental result is as schemed Shown in 2.From the point of view of experimental result, when the data volume of processing is smaller, the web data analysis platform based on Hadoop is due to needing Intermediate file and definitive document are generated and transmitted, Hadoop is opened and is also required to regular hour, therefore the total time of concurrent operation It is more than the time that single machine executes instead.But with the increase of data volume, the parallel processing platform based on Hadoop divides data After be dispatched to the processing of multiple nodal parallels, so that the total time of concurrent operation is less than the time that single machine executes, and with input data Increase, the gap of the two execution efficiency is also increasing.Figure it is seen that the interstitial content possessed in cluster is more, base It is higher in the efficiency of the parallel processing platform of Hadoop.

Claims

1. a kind of magnanimity web data based on Hadoop excavates genetic method, which is characterized in that mainly include the following steps：

Step 1), data dividing processing

Step 2), initialization group

Step 3), fitness value calculation

If step 4), the fitness value Fitness for working as all individuals in evolution generation are equal to 1, show that genetic evolution is passed through by group After cannot improve, go to step 7, otherwise, continue；

Step 5) selects the generation, crossover operation；

Entire progeny population is substituted parent group by step 6), is formed a new generation, is then gone to step 2；

Step 7), after genetic algebra, k values are still unchanged, then exit evolution, and output end user has a preference for access path.

2. according to the method described in claim 1, it is characterized in that, in step 1, the characteristics of for web data, to web data It is split.

3. according to the method described in claim 2, it is characterized in that, dividing by user and access date Web journal files It cuts, and is transferred in different child nodes, while obtaining user-defined support S.

4. according to the method described in claim 1, it is characterized in that, in step 2, each child node utilizes under Hadoop platform Map and Reduce operations convert data set to the 1- item collection forms for meeting the preference subpath that user defines support, as The initial population of genetic method.

5. according to the method described in claim 1, it is characterized in that, in step 3), weighed by the frequency of an access path Its whether be user's preferences access path, therefore, fit

Response function is defined as follows：

6. according to the method described in claim 1, it is characterized in that, in step 5), in genetic evolution process, essence is taken first English retention strategy retains the elite individual in genetic process, allows them to be not involved in crossover operation and be directly entered the next generation.

7. according to the method described in claim 6, it is characterized in that,

The select probability of i-th of individual is defined as：

P_i=S '_i/S_avg

Wherein, S '_iFor the path access frequency of i-th of individual；S_avgFor the average value of all individual path access frequency in group.

8. the method according to the description of claim 7 is characterized in that in the individual of reservation, individual is chosen according to select probability Crossover operation is carried out, when algebraically is 50 multiple, carries out migration operation, marriage between adjacent 2 sub-groups, and by marriage offspring In optimized individual copy in relevant source population.