CN105930469A

CN105930469A - Hadoop-based individualized tourism recommendation system and method

Info

Publication number: CN105930469A
Application number: CN201610258743.8A
Authority: CN
Inventors: 张新峰; 郑楠
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2016-04-23
Filing date: 2016-04-23
Publication date: 2016-09-07

Abstract

The invention discloses a Hadoop-based individualized tourism recommendation system and method, which belongs to the field of the internet technology and big data. Five modules are supplemented each other to finish whole system functions, wherein the five modules are independently a web crawler module, a data module, a big data processing module, a recommendation calculation module and an UI (User Interface) module and have a connection relationship that the web crawler module is in unilateral connection with a metadata module and is simultaneously in unilateral connection with the UI module; the data module is in unilateral connection with the big data processing module and is simultaneously in unilateral connection with the UI module; the big data processing module is in unilateral connection with the recommendation calculation module and is simultaneously in bidirectional connection with the UI module; and the recommendation calculation module is in bidirectional connection with the UI module. The invention develops the Hadoop-based individualized tourism recommendation system, which can quickly and accurately make individualized recommendation for tourists and brings pleasant and proper choices when the tourists select a destination.

Description

Personalized tourism commending system based on Hadoop and method

Technical field

The present invention relates to Internet technology, big data fields, data mining, the personalization for tourism industry exploitation pushes away Recommend system.

Background technology

Traditional tour site, is that the popular degree according to sight spot is recommended mostly, not for visitor individual's Interest and behavior carry out personalized recommendation so that when visitor selects destination in a large amount of sight spots very blindly, and be difficult to Match personal interest point.And personalized recommendation system in other field, conventional method has based on commending contents and association Article similar for content are always pushed away by same filtered recommendation, but both approaches all existing defects based on content recommendation method Recommending to user so that user produces fatigue to recommendation results, it is bigger to there is popular article accounting in collaborative filtering recommending method Problem, this can reduce the occurrence rate of long-tail article so that final recommendation not new meaning for user.At place Technically, traditional handling process is in the face of mass data when, and processing speed is slow and efficiency is low, and this does not meets for reason Principle rapidly and efficiently is run in website.

Summary of the invention

The present invention is directed to three problems proposed in background technology, develop personalized tourism based on Hadoop and recommend System, the most quickly can formulate personalized recommendation for visitor, brings more comfortable suitable when selecting destination for visitor The selection closed.

For achieving the above object, the present invention provides following technical scheme:

The present invention is with Eclipse as developing instrument, and Hadoop is big data processing platform (DPP), and Java is programming language, logical Cross the JSCH local Windows system of cross-platform connection and server CentOS system, i.e. may be implemented in this locality and browse On server, corresponding operating request is sent on device.By the interactive information of the page, backstage uses in Hadoop MapReduce Computational frame, carries out substep in distributed file system and searches and calculate, and result is integrated return Front end page.

The present invention has five modules and has complemented each other whole system function, they be respectively webcrawler module, Data module, big data processing module, recommendation computing module, UI interface module.Their annexation is, network Reptile module is unidirectional with meta data block to be connected, and simultaneously the most unidirectional with UI interface module is connected；At data module and big data The reason unidirectional connection of module, is bi-directionally connected with UI interface module simultaneously；Big data processing module is unidirectional with recommendation computing module Connect, be bi-directionally connected with UI interface module simultaneously；Computing module is recommended to be bi-directionally connected with UI interface module.Each module Connect flow process as it is shown in figure 1, concrete connection procedure is as follows:

1. webcrawler module mainly crawls sight spot information and user profile data, and the order that crawls of sight spot information is basis Province and urban information crawl successively, first each province urban information data in ergodic data module, and backstage is passed through The city name of URL in amendment tour site, obtains the Cookie of this website simultaneously, obtains each city, each province successively Under sight name list, further according to this attraction list, successively the relevant field information retrieval needed for each sight spot is gone out Come, and record and store in sight spot information table corresponding in data base.The information data of user is according to each sight spot Review pages obtains the information commenting on this sight spot, and obtains the commentator i.e. details of user according to review information, will User profile and review information record respectively and store in user message table corresponding in data and evaluation table.Crawl flow process As follows:

List of countries → province list → city list → attraction list → sight spot field information → sight spot comment → commentator Information

Webcrawler module mainly triggers, by two approach, the program of crawling, one be every day right place to data base Read scene data, and triggering crawls sight spot and user profile program accordingly, and result record is stored data base In corresponding data table in.Another is to be triggered, when the sight name inquired about by the search function of UI interface module When can not find corresponding result in data base, crawlers will be touched and go tour site to inquire about and crawl phase Pass information, if finding corresponding sight spot, then by the relevant field information crawler at this sight spot out, and records storage to number According to sight spot information table corresponding in storehouse, result is fed back to UI page correspondence position again simultaneously.

2. data module is mainly used to store master data information, other including three major types, be respectively sight spot master data, User's master data, user sight spot relation data.Wherein sight spot master data comprises province list, city list, scape Point list, each sight spot Basic Information Table；The sight spot information that user's master data comprises user basic information and user went； User sight spot relation data comprises user's evaluating data to sight spot.On the one hand data module carries for big data processing module For the data supporting on basis, on the other hand can therefrom inquire about information needed by UI interface search function.

3. big data processing module is MapReduce Computational frame (hereinafter referred to as MR) based on Hadoop platform Running, this framework is broadly divided into two parts of Map and Reduce, after first being split by initial data by host node Being distributed to each working node performing map task, each working node starts simultaneously at execution map task, when map appoints After business terminates, using output result as the input value of reduce task, send the working node performing reduce task to, Reduce is responsible for merging the result of map statistical disposition, and final result is integrated output.MR framework flow process As shown in Figure 2.

The purpose of this module is to improve data processing speed, can pass through the similarity search function at UI interface, classification scape Point word cloud function, scene types forecast function trigger and call the data in data base and carry out processing calculating, and will calculate Result returns to the correspondence position of the UI page.Calculating mainly for following four aspect content parallel processing, one is network The process of reptile have employed MR thought, is effectively improved and crawls speed；Two is to be applied to user's similarity and sight spot phase Calculating aspect like degree, it is achieved at short notice user or sight spot are completed Similarity Measure, used data among these User's master data, sight spot master data and user sight spot relation data in module.Three is to be applied at text mining Reason, has done participle statistics respectively, and has shown corresponding dynamic class word cloud design sketch each classification sight spot information, In addition class prediction calculating has been done at the sight spot to UNKNOWN TYPE, mainly uses sight spot master data as classification based training.

4. recommending computing module is that the result data according to big data processing module carries out specific aim recommendation calculating, and will push away Recommend result to feed back in UI page correspondence field.This module has three big content recommendations, and one is to recommend phase for login user Like user, the user's similarity i.e. calculated according to big data processing module sets up the similarity matrix of user and user, Find the top ten list user the highest with this user's similarity as recommendation results；Two is content-based recommendation, according to The sight spot gone before family, analyzes and extracts these sight spot features, as the hobby of user, in sight spot similarity Matrix is found similar sight spot.Three is that mixing is recommended, and namely personalized recommendation, it has merged based on commending contents side Method and collaborative filtering recommending method based on article, improve and recommend accuracy, provide the user the recommendation results being more suitable for. Mixing recommendation method is first to form sight spot homologous factors according to user sight spot relation data, secondly by sight spot content characteristic Being weighted in the scoring of particular user sight spot, form user behavior matrix, homologous factors is multiplied with behavioural matrix and obtains this use The family score value to all sight spots, takes the top ten that score value is the highest, is presented to user as consequently recommended result.

5.UI interface module is the most relevant with above-mentioned 4 modules, except between webcrawler module being unidirectional triggering pass Outside system, it is all bi-directional association with other three modules, after on the one hand being triggered in each relating module by page corresponding function The program of platform, result of calculation is fed back to the corresponding field of the page and shows by the most each module.

UI interface module mainly has the three big pages, is the RECOMENDATION page respectively, and classification recommends the page and personalized recommendation The page.The function having at each page is to carry out sight spot content retrieval, and i.e. one sight name of input, permissible Show the essential information at this sight spot；Also have sight spot Similarity value inquiry, i.e. two sight name of input, can help to use Similarity value and the analog result at the two sight spot is found at family.Hot spot recommends the page to be according in data base 1,000,000 The selection of user and evaluation information comprehensive statistics, at this page, it will show sight spot and the sight spot information of Top10, And the whole nation tourist arrivals of each province and the optimal travelling season of each province；What classification recommended page presentation is each classification scape Point statistics obtains front ten sight spots that the category is the most popular, and all sight spots are always divided into 27 classifications, for each classification Do text data digging respectively, extracted the key feature word of each classification, and its frequency of occurrences of pin is to each classification It is made that corresponding word cloud, makes user can see Feature Words of all categories more intuitively, further, it is also possible to be not Know that the sight spot of sight spot type carries out type prediction, the maximum probability of which classification belonging to this sight spot can be calculated；Individual character Change and recommend the page, the method display recommendation results according to commending contents can be selected, it is also possible to according to mixing recommendation method Display recommendation results, in addition, illustrates the graph of a relation between user always according to user's similarity size, and by phase This user is recommended like spending the highest top ten user.

Accompanying drawing explanation

Fig. 1 is personalized tourism commending system flow chart based on Hadoop

Fig. 2 is Mapreduce workflow diagram in Hadoop

Fig. 3 is that in Hadoop, Mapreduce adds up word frequency flow chart

Detailed description of the invention

One, based on mixing the thought of proposed algorithm and realizing process sample

1) thought based on mixing proposed algorithm:

1. preparing raw data list, content includes that ID, user class, sight spot ID, sight spot rank, user are to scape The scoring of point；

2. setting up the homologous factors of scene data, statistics occurs the number of times once simultaneously occurred with other sight spots the most respectively；

3. set up the similarity matrix of scene data, try to achieve sight spot similarity according to sight spot co-user；

4. setting up user's weighted scoring matrix to sight spot, this score value is by original scoring, user class and the number of sight spot rank Get according to weighting；

5. similarity matrix calculates recommendation results score value with weighted scoring matrix multiple；

6. by result score value by sorting from big to small, get rid of user and gone to sight spot, recommend by score height.

2) sample based on mixing proposed algorithm thought realizes process:

1. preparing raw data list, content includes that ID, user class, sight spot ID, sight spot rank, user are to scape The scoring of point.Initial data sample is listed as follows:

UserID	UserLevel	SceneID	Score	SceneLevel
					User1	5	Scene1	5	5A
User1	5	Scene2	3	4A
					User1	5	Scene4	2.5	2A
User2	4	Scene1	4	5A
					User2	4	Scene3	4	3A
User2	4	Scene4	3	2A
					User3	3	Scene2	4.5	4A
User3	3	Scene3	4.5	3A
					User3	3	Scene4	3.5	2A
User3	3	Scene5	4	1A

2. setting up the homologous factors of scene data, statistics occurs the number of times once simultaneously occurred with other sight spots the most respectively, with User divides for unit, counts according to the sight spot that each user is evaluated, and calculates sight spot respectively independent The number of times occurred and the number of times jointly occurred with other sight spots.Scene data homologous factors sample is as follows:

The calculating formula of similarity of article i and article j:

Wherein, N (i) represents the number of users removing sight spot i, and N (j) represents the number of users removing sight spot j, and molecule represents simultaneously Removing sight spot i and the number of users of sight spot j, the calculating of denominator is in order to avoid hot spot and other sight spot similarities Close to the problem of 1, therefore hot spot is carried out fall heat treatment.Scene data similarity matrix sample is as follows:

	Scene1	Scene2	Scene3	Scene4	Scene5
						Scene1	1	0.5	0.5	0.816	0
Scene2	0.5	1	0.5	0.816	0.707
						Scene3	0.5	0.5	1	0.816	0.707
Scene4	0.816	0.816	0.816	1	0.447
						Scene5	0	0.707	0.707	0.447	1

4. setting up user's weighted scoring matrix to sight spot, this scoring is made up of three parts, and a part is that user is to this sight spot Directly scoring, another part is to have weighted the rank at sight spot to divide and divide with user class.

Score=w1 × original_score+w2 × (scene_level+1)+w3 × user_level (1)

W1+w2+w3=1

Weight calculation is to get according to number statistical contained by each index.S6-S1 represents the scoring total number of persons that 5-0 divides respectively； SL6-SL1 represents 5A-0A sight spot at different levels sum respectively；UL6-UL1 represents each section of user class respectively Total number of persons, wherein UL6 represents the number of more than 15 grades, and UL5 represents the number of 13-15 level, and UL4 represents The number of 10-12 level, UL3 represents 7-9 level number, and UL2 represents 4-6 level number, UL1 represent 1-3 level with Upper number.

Index

Weight

6

5

4

3

2

1

Former scoring

w1

S6

S5

S4

S3

S2

S1

Sight spot rank

w2

SL6

SL5

SL4

SL3

SL2

SL0

User class

w3

UL6

UL5

UL4

UL3

UL2

UL1

Proportion computing formula shared by each score value is as follows:

P 6 = \frac{6}{6 + 5 + 4 + 3 + 2 + 1} = 0.29, P 5 = \frac{5}{6 + 5 + 4 + 3 + 2 + 1} = 0.24, P 4 = \frac{4}{6 + 5 + 4 + 3 + 2 + 1} = 0.19,

P 3 = \frac{3}{6 + 5 + 4 + 3 + 2 + 1} = 0.14, P 2 = \frac{2}{6 + 5 + 4 + 3 + 2 + 1} = 0.09, P 1 = \frac{3}{6 + 5 + 4 + 3 + 2 + 1} = 0.05,

Weight calculation formula shared by each index is as follows:

w 1 = \frac{Σ_{i = 1}^{6} S_{i} P_{i}}{Σ_{i = 1}^{6} (S_{i} P_{i} + {SL}_{i} P_{i} + {UL}_{i})}, w 2 = \frac{Σ_{i = 1}^{6} {SL}_{i} P_{i}}{Σ_{i = 1}^{6} (S_{i} P_{i} + {SL}_{i} P_{i} + {UL}_{i})}, w 3 = \frac{Σ_{i = 1}^{6} {UL}_{i} P_{i}}{Σ_{i = 1}^{6} (S_{i} P_{i} + {SL}_{i} P_{i} + {UL}_{i})}

Through backstage to 50,000 users, 10,000 sight spots calculate, w1=0.72, w2=0.17, w3=1-w1- W2=0.11

By result computed above, bring scoring formula (1) into, draw each user weighted scoring to each sight spot, And this scoring is converted into matrix format, user sight spot rating matrix sample is as follows:

	User1	User2	User3
				Scene1	5	4.3	1.5
Scene2	3.3	1.2	4.35
				Scene3	0.9	0.9	4.05
Scene4	2.35	3.4	3.05
				Scene5	0.3	2.4	3.1

5. similarity matrix calculates recommendation results score value with weighted scoring matrix multiple, and as a example by user 1, result of calculation is as follows；

6. score is sorted from high to low, get rid of the sight spot that user 1 had gone, remaining sight spot is pushed away by score height Recommend.

Recommendation results: Scene3, it is recommended that reason: Scene3 and the Scene4 similarity gone higher, and user Higher to Scene3 interest-degree.

Two, the similarity thought between two users and sample process:

1) the similarity thought between two users is calculated

1. prepare raw data list；

2. calculate the similarity between two users, be mainly made up of four part Similarity-Weighteds, form similarity matrix；

3. for certain user, it is sorted from high to low with other user's similarity score, and by former masterpieces higher for score Recommend for similar users.

2) Similarity Measure process sample between two users

1. preparing raw data list, content includes ID, user class, sight spot ID, sight spot rank, scene types. Initial data sample is as follows:

2. calculating the similarity between two users, be mainly made up of four partial weightings, i.e. two users went the grade at sight spot Similarity, went the type similarity at sight spot, and whether had identical sight spot, the most also to weight two The grade of user.We are added four part scores according to certain weight, and finally draw between the two is similar Degree, it is possible to use following formula represents:

Similarity=w1 × sim_category+w2 × sim_sceneLevel+w3 × sim_userLevel+w4 × sim_scene Wherein w1+w2+w3=1, w1, w2, w3 calculation is similar to above-mentioned rating matrix Computational Methods, What sim_category represented is the similarity of sight spot type, and what sim_sceneLevel represented is the phase of sight spot rank Like degree, what sim_userLevel represented is the similarity of user class, and what sim_scene represented is between two users Whether there is the similarity that common sight spot is weighed.

s i m_c a t e g o r y (x, y) = \frac{1}{1 + d (x, y)}, s i m_s c e n e L e v e l (x, y) = \frac{1}{1 + d (x, y)},

s i m_u s e r L e v e l (x, y) = \frac{1}{1 + d (x, y)}, d (x, y) = \sqrt{Σ_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

Wherein xi, yi represent that when seeking sim_categoy x user and y user removed the probability of all categories at sight spot, and xi, yi exist Representing when seeking sim_sceneLevel that x user and y user removed each grade probability at sight spot, xi, yi are asking The user gradation of x user and y user is represented respectively during sim_userLevel

And when calculating sim_scene, use and whether gone to identical sight spot to weigh this similarity, method used is Calculating the ratio between common factor number and the maximum number at the gone sight spot of two people at two the gone sight spots of people, do so can be The value of similarity specifies between 0-1, and can also weigh out similarity between the two well.Should be noted that Be owing to some user has repeatedly gone to identical sight spot, we calculate similarity when to the sight spot needs gone Do duplicate removal to process.

It is computed, show in above-mentioned sample that the similarity matrix between user is as follows:

	User1	User2	User3
				User1	1	0.641	0.598
User2	0.641	1	0.613
				User3	0.598	0.613	1

3. by similarity score according to sorting from high to low, and to the higher top of this user's recommendation scores as similar use Family

The similar users order recommended for user User1 is: User2, User3

The similar users order recommended for user User2 is: User1, User3

The similar users order recommended for user User3 is: User2, User1

Three, understand MapReduce framework workflow and apply sample:

1) MapReduce framework workflow

Step 1: client provides a mapreduce operation to host node；

Mapreduce is inputted data and is divided into isometric small data block by step 2:Hadoop, is referred to as inputting burst, And be that each burst builds a map task, map task and reduce task are distributed to simultaneously On different working nodes；

Step 3: each burst place working node executed in parallel map task, the result after performing is ranked up, As the input data of reduce task, it is copied to perform the working node of reduce task.

The result of map task is carried out conformity calculation by the working node of step 4:reduce task, and by last calculating Result, as output, is written in output file.

MapReduce framework workflow diagram is as shown in Figure 2.

2) MapReduce framework adds up word frequency sample in the present system

Initial data in mapreduce operation is split into isometric burst by step 1:Hadoop, and is distributed by burst On different map task working nodes；

Step 2: each burst place working node executed in parallel map task, includes participle here, goes to stop word and counting；

Step 3: as the input of reduce after count results being sorted using the form of<key:value>, is imparted to perform The working node of reduce task；

Step 4:reduce working node performs to merge and statistical work, finally result output is preserved.

MapReduce framework adds up word frequency sample flow chart as shown in Figure 3 in the present system.

Four, UI interface is embodied as content

After correctly filling in " user name ", " password ", clicking on " login " button, backstage can be by the note in customer data base Record verification information is the most correct, as correctly, gets final product login system.When new user registered by needs, click on " registration " and press Button, after ejecting the interface of registration, typing relevant information successively, clicks on " determination " button and can realize the note of new user Volume, the user profile of new registration can be stored in customer data base, then uses the user of new registration at login interface Realize system login.

After entering into system, the public function of all pages is sight spot similarity comparison inquiry and search sight spot.Work as needs When inquiring about the similarity degree at two sight spots, in sight spot similarity comparison one hurdle, input two needs the scapes of contrast respectively Point title, backstage can call sight spot similarity comparison formula and calculate, according to the eigenvalue at sight spot with went the two The co-user at sight spot, weighted calculation Euclidean distance formula, inverted to the value obtained, it is designated as two sight spots Similarity, this formula represents that both distances are the biggest, and similarity is the least.Click on equal sign, the phase at the two sight spot can be drawn Like angle value and analog result.Function of search, i.e. inputs sight name at search column, clicks on search, and backstage can be from sight spot Data base inquires about this sight spot information, and partial information is fed back to the page.

1. the RECOMENDATION page mainly includes four column contents: hot spot is recommended, whole nation each province tourist arrivals statistics, Whole nation each province tourism optimum season statistics, hot spot describe.

Hot spot is recommended: this column content is to draw according to big data statistics, each by calculating in scene data storehouse The reception number at sight spot, and weight visitor's scoring to this sight spot, comprehensively draws the sight spot of ten before ranking, gives Recommend to show.

Whole nation each province tourist arrivals statistics: i.e. map column, is used for adding up the whole of last year whole nation each province reception visitor's quantity Number, on map can by shade display each province reception number, color is the deepest, represent go this The people saving tourism is the most；Color is the most shallow, represents and goes the fewer in number of this province.This province of mouse-over, can show phase The concrete number answered.This column purpose is to provide impression intuitively for traveller, it is thus understood that go in the whole of last year The people which saves tourism is more or less.

Whole nation each province tourism optimum season statistics: for adding up the tourism month that whole nation each province is optimal, transverse axis represents each province Part, the longitudinal axis represented for 12 month, when mouse-over province, can show this province title and optimal month. This column purpose is to combine current season, selects to be best suitable for province for consumer and goes on a tour and provide help.

Hot spot describes: according to big data statistics, shows and goes sight-seeing front ten sight spots that number is most in all sight spots, Recommend as hot spot.Ten sight name show under the hot spot of left side, the sight spot on the right side of the page Describing a hurdle and will show the specifying information at sight spot, wherein one page is that a sight spot describes, and particular content includes sight spot Title, sight spot type, sight spot rank, address, sight spot and sight spot brief introduction.Page-turning function, can check the next one The introduction of hot spot, or click directly on hot spot title, it is possible to translate into corresponding sight spot lobby page.Right side The function of search of top, can carry out distribution inquiry in data base for input sight spot, and be shown by corresponding informance In sight spot describes.

2. classification recommends interface mainly to include four columns: recommending scenery spot of all categories, whole nation each province tourist arrivals add up, respectively Classification word cloud is shown, hot spot of all categories describes.

Recommending scenery spot of all categories: this plate lists 27 class sight spot typonyms, these 27 classifications are also according to sight spot Big data are added up, when each item name of mouse-over, ten heat shown under the category can be extended Door sight spot, clicks on concrete sight name, can show concrete sight spot information in the description bar of sight spot, right side.

The whole nation each province tourist arrivals statistics: for add up the whole of last year whole nation each province reception visitor's quantity number, on ground Can be by the number of shade display each province's reception on figure, color is the deepest, represents the people going to this province to travel more Many；Color is the most shallow, represents and goes the fewer in number of this province.This province of mouse-over, can show corresponding concrete number. This column purpose is to provide impression intuitively for traveller, it is thus understood which goes save the people of tourism in the whole of last year More or less.

Word cloud of all categories is shown: for showing the Feature Words information of each classification, visitor can be made to see intuitively each What class another characteristic is, and during each word of mouse-over, can show the frequency that this word is added up.These words are By the sight spot of each classification being described in detail the result obtained after information carries out text-processing, first by each class Other all sight spots describe comprehensive, obtain big length lteral data, then use MR Computational frame, by this number According to carrying out word segmentation processing, on the basis of segmentation methods, introduce tourist attractions dictionary here, to avoid one Proprietary sight name splits into multiple vocabulary.Need after participle word segmentation result is processed, including going to stop word, go Symbol, removes English etc., even if going symbol to remove the blank in punctuation mark and statement, removing English is exactly literary composition The English occurred in chapter all removes, and goes to stop word and i.e. removes stop word, such as auxiliary word, verb etc., and this kind of word is not Last statistics listed in by needs, so to carry out stopping word step, this stops dictionary firstly the need of setting up one, will not The word needed all puts in, and then in word segmentation result, searching loop stops dictionary, progressively will occur in word segmentation result The word that stops delete.This plate, in addition to can showing word cloud, also has the function predicting geopark, i.e. by number Use sorting algorithm according to digging technology, existing big data are trained, the type at unknown sight spot can be predicted.

Classification hot spot describes: according to big data statistics, shows in each classification that all sight spots visit number is Many front ten sight spots, recommend as classification hot spot.The name of each classification is under the type of sight spot, left side Display, each item name of mouse-over, front ten hot spot titles of the category can be shown below classification, One of them sight name of click, the sight spot on the right side of the page describes a hurdle and will show the specifying information at sight spot, Wherein one page is that a sight spot describes, and particular content includes sight name, sight spot type, sight spot rank, sight spot Address and sight spot brief introduction.Page-turning function, can check the introduction of generic next hot spot, or directly point Hit hot spot title, it is possible to translate into corresponding sight spot lobby page.Function of search above You Ce, can be for defeated Enter sight spot in data base, carry out distribution inquiry, and corresponding informance is illustrated in the description of sight spot.

3. personalized recommendation interface mainly includes four columns: personalized recommendation sight spot, whole nation each province tourist arrivals statistics, Customer relationship network, recommendation sight spot describe.

Personalized recommendation sight spot: after user logs in, this plate lists ten sight name, and this title is by based on thing The collaborative filtering of product, in conjunction with content-based recommendation algorithm, the tourism information summary for individual calculates Come.First add up the tourist attractions of individual, and the score data to this sight spot, form individual behavior matrix, Then by big data platform, all of user's sight spot information is calculated homologous factors, and this matrix is closed And process, finally by homologous factors and individual behavior matrix multiple, obtain this user and the weighting at all sight spots is divided Value, recommends before highest scoring ten as personalized recommendation result.Wherein dividing in individual behavior matrix Value is that user has weighted sight spot similarity and sight spot property value to the scoring at sight spot, and sight spot similarity can be understood as Like the user having how many ratios in the user of sight spot i also to like sight spot j, in order to avoid hot spot occurs, dig Pick long-tail sight spot, the method that have employed the weight having punished sight spot on the formula calculating similarity, therefore alleviate The probability that hot spot is with a lot of sight spots the most similar.

The whole nation each province tourist arrivals statistics: for add up the whole of last year whole nation each province reception visitor's quantity number, on ground Can be by the number of shade display each province's reception on figure, color is the deepest, represents the people going to this province to travel more Many；Color is the most shallow, represents and goes the fewer in number of this province.This column purpose is to provide for traveller to print intuitively As, it is thus understood which goes save the people traveled in the whole of last year more or less.

Customer relationship network: for representing the close relation degree of all visitors, the purpose of this column is to tie Make more friend with a common goal.This chart is to divide closely according to the similarity size between visitor and visitor Degree, similarity size calculates based on common interest hobby between user, the sight spot i.e. gone according to user Same or similar statistics it can be understood as went user A and B of sight spot i the most also to remove sight spot j, time Going through customer data base, use Euclidean distance formula, calculate distance between the two, distance is the biggest, similar Spending the lowest, distance is the least, and similarity is the biggest.When each of mouse-over, can show similar to this user Degree size, clicks on this user, can check which sight spot this user went to, and clicks on and recommends sight spot to press by similar users Button, the sight spot also can gone according to the user that similarity is the highest, carry out front ten recommendations.

Personalized recommendation sight spot describes: calculate according to proposed algorithm, by ten sight spots before highest scoring in recommendation results, Recommend as personalized sight spot.Ten sight name show under personalized recommendation sight spot, left side, and the page is right The sight spot of side describes a hurdle and will show the specifying information at sight spot, and wherein one page is that a sight spot describes, particular content Including sight name, sight spot type, sight spot rank, address, sight spot and sight spot brief introduction.Page-turning function, can look into See the introduction of next hot spot, or click directly on hot spot title, it is possible to translate into corresponding sight spot and introduce Page.Function of search above You Ce, can carry out distribution inquiry for input sight spot in data base, and by correspondence Information is illustrated in the description of sight spot.

Claims

1. personalized tourism commending system based on Hadoop, it is characterized in that: this system is with Eclipse as developing instrument, Hadoop is big data processing platform (DPP), Java is programming language, connect local Windows system and server CentOS system by JSCH is cross-platform, i.e. may be implemented in the corresponding operating request that sends on server on local browser；By the interactive information of the page, backstage uses the MapReduce Computational frame in Hadoop, carries out substep and search and calculate in distributed file system, and result is integrated return front end page；

Native system has five modules and has complemented each other whole system function, and they are webcrawler module, data module, big data processing module respectively, recommend computing module, UI interface module；Their annexation is, webcrawler module is unidirectional with meta data block to be connected, and simultaneously the most unidirectional with UI interface module is connected；Data module is unidirectional with big data processing module to be connected, and is bi-directionally connected with UI interface module simultaneously；Big data processing module with recommend that computing module is unidirectional to be connected, while be bi-directionally connected with UI interface module；Computing module is recommended to be bi-directionally connected with UI interface module；The concrete connection procedure of each module is as follows,

1. webcrawler module mainly crawls sight spot information and user profile data, the order that crawls of sight spot information is to crawl successively according to province and urban information, first each province urban information data in ergodic data module, backstage is by the city name of URL in amendment tour site, obtain the Cookie of this website simultaneously, obtain the sight name list under each city, each province successively, further according to this attraction list, successively the relevant field information needed for each sight spot is extracted, and record and store in sight spot information table corresponding in data base；The information data of user is that the review pages according to each sight spot obtains the information commenting on this sight spot, and obtain the commentator i.e. details of user according to review information, user profile and review information are recorded respectively and stores in user message table corresponding in data and evaluation table；Crawl flow process as follows:

List of countries → province list → city list → attraction list → sight spot field information → sight spot comment → commentator's information

Webcrawler module mainly triggers, by two approach, the program of crawling, one be every day right place read scene data to data base, and trigger and crawl sight spot and user profile program accordingly, and result record is stored in the corresponding data table in data base；Another is to be triggered by the search function of UI interface module, when the sight name inquired about can not find corresponding result in data base, crawlers will be touched go tour site to inquire about and crawl relevant information, if finding corresponding sight spot, then by the relevant field information crawler at this sight spot out, and record and store sight spot information table corresponding in data base, result is fed back to UI page correspondence position again simultaneously.

2. data module is mainly used to store master data information, other including three major types, is sight spot master data, user's master data, user sight spot relation data respectively；Wherein sight spot master data comprises province list, city list, attraction list, each sight spot Basic Information Table；The sight spot information that user's master data comprises user basic information and user went；User sight spot relation data comprises user's evaluating data to sight spot；On the one hand data module provides the data supporting on basis for big data processing module, on the other hand can therefrom inquire about information needed by UI interface search function.

3. big data processing module is that MapReduce Computational frame based on Hadoop platform runs, this framework is broadly divided into two parts of Map and Reduce, first each working node performing map task it is distributed to by host node after being split by initial data, each working node starts simultaneously at execution map task, after map task terminates, the result input value as reduce task will be exported, send the working node performing reduce task to, reduce is responsible for merging the result of map statistical disposition, and final result is integrated output；

The purpose of this module is to improve data processing speed, can pass through the similarity search function at UI interface, classification sight spot word cloud function, scene types forecast function trigger and call the data in data base and carry out processing and calculate, and result of calculation returns to the correspondence position of the UI page；Calculating mainly for following four aspect content parallel processing, one is that the process of web crawlers have employed MR thought, is effectively improved and crawls speed；Two is to be applied to the calculating aspect to user's similarity and sight spot similarity, it is achieved at short notice user or sight spot are completed Similarity Measure, has used the user's master data in data module, sight spot master data and user sight spot relation data among these；Three is to be applied to text mining process, respectively each classification sight spot information is done participle statistics, and show corresponding dynamic class word cloud design sketch, class prediction calculating has been done at the sight spot to UNKNOWN TYPE in addition, mainly uses sight spot master data as classification based training.

4. recommending computing module is that the result data according to big data processing module carries out specific aim recommendation calculating, and recommendation results is fed back in UI page correspondence field；This module has three big content recommendations, one is to recommend similar users for login user, the user's similarity i.e. calculated according to big data processing module sets up the similarity matrix of user and user, finds the top ten list user the highest with this user's similarity as recommendation results；Two is content-based recommendation, according to the sight spot gone before user, analyzes and extract these sight spot features, as the hobby of user, finds similar sight spot in the similarity matrix of sight spot；Three is that mixing is recommended, namely personalized recommendation, and it has merged based on content recommendation method and collaborative filtering recommending method based on article, has improved and recommend accuracy, provide the user the recommendation results being more suitable for；Mixing recommendation method is first to form sight spot homologous factors according to user sight spot relation data, secondly sight spot content characteristic is weighted in the scoring of particular user sight spot, form user behavior matrix, homologous factors is multiplied with behavioural matrix the score value obtaining this user to all sight spots, take the top ten that score value is the highest, be presented to user as consequently recommended result.

5.UI interface module is the most relevant with above-mentioned 4 modules, in addition between webcrawler module being unidirectional triggering relation, it is all bi-directional association with other three modules, on the one hand triggering the program on backstage in each relating module by page corresponding function, result of calculation is fed back to the corresponding field of the page and shows by the most each module；

UI interface module mainly has the three big pages, is the RECOMENDATION page respectively, and classification recommends the page and the personalized recommendation page；The function having at each page is to carry out sight spot content retrieval, and i.e. one sight name of input, can show the essential information at this sight spot；Also have sight spot Similarity value inquiry, i.e. two sight name of input, help user to find Similarity value and the analog result at the two sight spot；Hot spot recommends the page to be according to the selection at 100 general-purpose families in data base and evaluation information comprehensive statistics, at this page, it will show sight spot and the sight spot information of Top10, and the tourist arrivals of national each province and the optimal travelling season of each province；What classification recommended page presentation is that each classification sight spot statistics obtains front ten sight spots that the category is the most popular, all sight spots are always divided into 27 classifications, text data digging has been done respectively for each classification, extract the key feature word of each classification, and its frequency of occurrences of pin is made that corresponding word cloud to each classification, make user can see Feature Words of all categories more intuitively, in addition, type prediction can also be carried out for the sight spot of unknown sight spot type, the maximum probability of which classification belonging to this sight spot can be calculated；The personalized recommendation page, the method display recommendation results according to commending contents can be selected, it is also possible to according to mixing recommendation method display recommendation results, in addition, illustrate the graph of a relation between user always according to user's similarity size, and top ten user the highest for similarity is recommended this user.