CN103235824A

CN103235824A - Method and system for determining web page texts users interested in according to browsed web pages

Info

Publication number: CN103235824A
Application number: CN201310163619XA
Authority: CN
Inventors: 刘臻; 吕琳媛; 肖思源; 刘润然; 佘莉
Original assignee: SHANGHAI HEGUANG INFORMATION TECHNOLOGY Co Ltd
Current assignee: SHANGHAI HEGUANG INFORMATION TECHNOLOGY Co Ltd
Priority date: 2013-05-06
Filing date: 2013-05-06
Publication date: 2013-08-07

Abstract

A method for determining related web page texts users interested in according to a browsed web page URL (Uniform Resource Locator) comprises the steps of performing filtration on browsed web pages of the users in a certain period of time, removing useless web pages and web pages which cannot be accessed and linking rest URL addresses through the filtration to obtain text contents of pages and extract title and text information; defining a category for every web page text of web page document collection according to a predefined topic category; and performing access frequency statistics on every category and enabling a web page set with the highest access frequency value to serve as the related web page texts the users interested in and analysis data to achieve directional push of data business and improve the credibility of data business push.

Description

Determine the method and system of user's interest web page text according to browsing page

Technical field

The present invention relates to a kind of method and system of determining user's interest related web page text according to browsing page URL, be used at user interest preference propelling data business field.

Background technology

Data service pushes and has begun comprehensively to burst forth in 2011, emerge numerous mechanisms in the industry, data service pushes also website combination from the phase one, and (medium are selected very important, make up and select according to audient's characteristics of medium), (content optimization is very important to subordinate phase context orientation, attract audient's type to make up according to content), three phases is that the directed propelling movement mode of crowd of core changes with crowd's directional technology till now again, more focuses on the identification to the crowd.In addition, location-based data service pushes in another one dimension development and ripe.

The objective of the invention is to accurately determine user's interest related web page text according to browsing page URL, and then can follow the tracks of each user's behavioural habits, and its behavior and browsing content analyzed, predict its interest preference, concentrate on the object of receiving information interested and the user who needs is arranged, realize that the orientation of data service pushes, improve the confidence level that data service pushes, improve the user preferences degree, can reduce data noise better.

Summary of the invention

The invention provides a kind of method of determining user's interest related web page text according to browsing page URL, comprise step: the webpage that user in certain period is browsed carries out filtration treatment, get rid of useless pages and some webpage that can't visit, to linking through the remaining URL address of screening, obtain the content of text of the page, extract title and text message; According to predefined subject categories, for each web document of web document set is determined a classification; To each class frequency statistics that conducts interviews, the highest webpage collection of visiting frequency value is as the user's interest related web page.

Wherein, need in the Web page classifying step to make up and the training net web page classifier, input training text collection, by text representation and feature selecting, make up sorter model according to the feature dictionary, be output as the classifying rules collection that is similar to tree structure, the training process of Web page classifying device namely is that training sample is constantly divided into groups, by setting up target variable about the classification forecast model of each input variable, packet under the different values with target variable of round Realization input variable, and then for classification and prediction to new data-objects.

The Web page classifying device uses the decision tree classification method, the steps include:

Test sample book is expressed as the form same with training sample;

T ← decision tree root node;

The testing attribute and the threshold value that depend on plan tree node t compare the value of sample character pair to be tested with it, determine according to the standard of t node division then to be

The right child of left child or t ← t of t ← t;

Recurrence is carried out to previous step, is leafy node up to t;

The classification of test sample book is the classification of leaf t representative.

In addition, in the Web page classifying step, the text to be sorted that input was handled through the text pretreatment module, pass through text representation, carry out feature selecting according to the feature dictionary, carry out text classification with the classifying rules of training the sorter model that generates, be output as the affiliated classification information of each text.

In addition, in the text representation step, adopt characteristic vector space to represent text feature, document i can be expressed as the proper vector of following formula:

W _ij=(W _i1,W _i2,...,W _im)

Wherein, W _IjBe entry j frequency of occurrences f in document i _IjFunction, directly use entry in the frequency of occurrences of document as eigenwert, computing formula is:

W _ij=f _ij。

Also have, in the feature selecting step, adopt the feature dimension reduction method based on improved χ 2 statistics and pattern polymerization, step is:

⑴ according to formula

χ_{ij}^{' 2} = sign (n_{11} \times n_{22} - n_{12} \times n_{21}) \frac{n \times {(n_{11} \times n_{22} - n_{12} \times n_{21})}^{2}}{(n_{11} + n_{12}) \times (n_{21} + n_{22}) \times (n_{11} + n_{21}) \times (n_{12} + n_{22})}

Calculate each entry to the improved χ of every class ²Statistic;

⑵ according to formula

{CHI}_{i} = \max {| χ_{i 1}^{' 2} |, | χ_{i 2}^{' 2} |, \cdot \cdot \cdot, | χ_{is}^{' 2} |}

Calculate the CHI of each entry, then feature is sorted from high to low by the CHI value, choose preceding M big feature entry of CHI value, the eigenmatrix that then obtains thus has M pattern;

⑶ for relatively whether each pattern is consistent to all kinds of classification contribution proportions at first handle the improvement statistic unification of each pattern between [1,1], and processing mode is as follows:

A_{ij} = χ_{ij}^{' 2} / (\max - \min)

Wherein max, min are respectively the improvement χ of pattern i ²The maximal value of statistic and minimum value;

⑷ adopt simple clustering algorithm, carry out cluster (pattern of every line display of A) according to the pattern of A, of a sort pattern is polymerized to a new pattern, to obtain L new model like this, wherein L is much smaller than M, adopt the stratification of cohesion to carry out cluster, the most frequently used Euclidean distance is adopted in range observation, and is as follows:

d (i, j) = \sqrt{{(A_{i 1} - A_{j 1})}^{2} + {(A_{i 2} - A_{j 2})}^{2} + \cdot \cdot \cdot + {(A_{is} - A_{js})}^{2}}

With Euclidean distance d (i j) carries out cluster less than the pattern of certain threshold value, and the process of cluster is:

1. calculate distance less than the pattern of threshold value according to matrix A, it is carried out cluster;

2. after the cluster, the pattern in every class is merged into a pattern, and this pattern comprises the whole entries in this class, and its word frequency is exactly the word frequency sum of these entries, recomputates the improvement statistic of new model, forms matrix A again according to new model;

Repeat 1., 2. two steps, till all patterns can not polymerization;

⑸ recomputate the CHI value of each characteristic item, the individual characteristic item of L ' before selecting according to CHI value size.

The present invention also provides a kind of system that determines user's interest related web page text according to browsing page URL, comprise that web page text obtains submodule, web page text classification submodule, visiting frequency statistics submodule and the current content interest of user and determines submodule, web page text obtains submodule the webpage that user in certain period browses is carried out filtration treatment, get rid of useless pages and some webpage that can't visit, to linking through the remaining URL address of screening, obtain the content of text of the page, extract title and text message; Web page text classification submodule is according to predefined subject categories, for each web document of web document set is determined a classification; Visiting frequency statistics submodule is to each class frequency statistics that conducts interviews, and the current content interest of user determines that the submodule webpage collection that the visiting frequency value is the highest is as the user's interest related web page.

Wherein, need to make up and the training net web page classifier in the web page text classification submodule, input training text collection, by text representation and feature selecting, make up sorter model according to the feature dictionary, be output as the classifying rules collection that is similar to tree structure, the training process of Web page classifying device namely is that training sample is constantly divided into groups, by setting up target variable about the classification forecast model of each input variable, packet under the different values with target variable of round Realization input variable, and then for classification and prediction to new data-objects.

In addition, the text to be sorted that the input of web page text classification submodule was handled through the text pretreatment module by text representation, carries out feature selecting according to the feature dictionary, carry out text classification with the classifying rules of training the sorter model that generates, be output as the affiliated classification information of each text.

Description of drawings

Fig. 1 is that a kind of portable terminal is by the system construction drawing of radio network gateway browsing pages;

Fig. 2 is a kind of method of obtaining the mobile phone users interest preference on Mobile Server by radio network gateway in real time;

Fig. 3 is the operational flowchart of time window adjusting of the present invention and web data statistic of classification module;

Fig. 4 is the operational flowchart of Web page classifying of the present invention/content information processing sub;

Fig. 5 a is the method that the present invention makes up the web page text sorter;

Fig. 5 b is the using method of web page text sorter of the present invention;

Fig. 6 is that user content interest of the present invention is extracted the submodule operational flowchart;

Fig. 7 is the exemplary tree-shaped structure of user interest preference of the present invention;

Fig. 8 pushes the module operation process flow diagram for data service;

Fig. 9 is location analysis module operational flowchart of the present invention;

Figure 10 is the related process flow diagram of positional information of the present invention.

Embodiment

Following with reference to accompanying drawing 1～10 further specify the method for determining user's interest related web page text according to browsing page URL of the present invention with and the data that are suitable for push the service implementation example.

Fig. 1 is that portable terminal passes through the system construction drawing as the radio network gateway browsing pages of WAP gateway.

The invention provides a kind of data service supplying system based on wireless network, after it obtains the log information of user's use as the portable terminal of mobile phone by radio network gateway, use the mobile phone behavior to carry out filtration treatment to user in the scope for the previous period, obtain the user behavior feature, make the internal interest of holding of user and behavioural habits in conjunction with the interest preference that forms the user, and associate in real time with the positional information of portable terminal, push to the portable terminal information of carrying out, described system is illustrated by the part of frame of broken lines institute mark among Fig. 1, comprise time window adjusting and web data statistic of classification module, the user interest extraction module, data service pushes module and location analysis module, wherein:

Time window is regulated and web data statistic of classification module receives the URL of browsing pages from radio network gateway, and user's browsing page in the scope is for the previous period carried out filtration treatment, acquisition user's interest related web page and user behavior feature;

The user interest extraction module comprises that behavioural information is analyzed submodule, content information is analyzed submodule and integrated study submodule,

Behavioural information is analyzed submodule according to the user behavior feature, and time series is added up and screening, dimensionality reduction, forms user behavior interest, is output as user's current behavior interest,

Content information is analyzed submodule according to the URL address of user's interest related web page, and web page contents is carried out text-processing, extracts Web page subject, and according to described Web page subject and other attribute informations of webpage, form user content interest, be output as the current content interest of user

The integrated study submodule uses the integrated study technology according to user's current behavior interest and current content interest, forms user interest, is output as the current interest of user;

Location analysis module by the GMLC gateway obtain the user current browse positional information;

Data service pushes module according to active user's interest of user interest extraction module output, utilizes the rule association strategy, judges whether to carry out the localization information Push Service; To not meeting active user's interest of localized service feature, service pushes module mates it with corresponding pre-pushed information, choose the highest pushed information of matching degree according to matching result; To meeting active user's interest of localized service feature, according to from the user of location analysis module current browse positional information, obtain location association information, the recycling matching strategy, the current interest of user and location association information are mated, and select the highest location association information of matching degree as pushed information according to matching result, push to portable terminal.

Wherein said radio network gateway comprises WAP GW, strengthens equipment such as GGSN, independent synthesized gateway, in the explanation of back, is the content that example is introduced whole invention with common WAP GW.

Wherein browsing pages is provided by the sp/cp server in the network, and portable terminal is visited these pages by radio network gateway.

The invention provides a kind of data service method for pushing based on wireless network, as shown in Figure 2, after it obtains the log information of user's use as the portable terminal of mobile phone by radio network gateway, use the mobile phone behavior to carry out filtration treatment to user in the scope for the previous period, obtain the user behavior feature, make interest that the user internally holds and behavioural habits in conjunction with the interest preference that forms the user, and associate in real time with the positional information of portable terminal, push to the portable terminal information of carrying out, comprising:

Receive the URL of browsing pages from radio network gateway, user's browsing page in the scope is for the previous period carried out filtration treatment, obtain user's interest related web page and user behavior feature;

According to the user behavior feature, time series is added up and screening, dimensionality reduction, form user behavior interest, as user's current behavior interest, URL address according to the user's interest related web page, web page contents is carried out text-processing, extract Web page subject, and according to described Web page subject and other attribute informations of webpage, form user content interest, as the current content interest of user, according to above-mentioned user's current behavior interest and current content interest, use the integrated study technology, form user interest, as the current interest of user;

By the GMLC gateway obtain the user current browse positional information;

According to active user's interest, utilize the rule association strategy, judge whether to carry out the localization information Push Service; To not meeting active user's interest of localized service feature, it is mated with corresponding pre-pushed information, choose the highest pushed information of matching degree according to matching result; To meeting active user's interest of localized service feature, according to the user current browse positional information, obtain location association information, the recycling matching strategy, the current interest of user and location association information are mated, and select the highest location association information of matching degree as pushed information according to matching result, push to portable terminal.

Time window is regulated and web data statistic of classification module comprises time window adjusting submodule and web data statistic of classification submodule, and web data statistic of classification submodule comprises behavioural information statistics submodule and Web page classifying submodule.Fig. 3 is the operational flowchart of time window adjusting and web data statistic of classification module.

Time window is regulated submodule execution time window control method,, determines and the adjusting time window the concentrated interest of reflection user current slot according to user's networking speed and custom.

In order to obtain user's interest related web page and user behavior feature, described system need carry out filtration treatment to user's browsing page in the scope for the previous period, the time range interval that needs statistical treatment in the prior art is fixed value normally, as the interest preference of user in a long period section processed, as one day, January even 1 year, though such processing is more comprehensive and accurate aspect analysis user interest, but the web page contents of analyzing is huge, real-time is relatively poor, or be trigger condition with single internet behavior or single browsing page, last net or browse a webpage and do once and recommend, though be real-time recommendation like this, but system can return too many content recommendation, has increased the burden of cordless communication network, has also reduced the entertaining that the user experiences.

The problems referred to above based on prior art, the present invention has adopted the control method of time window in, can take into account the long-term interest preference of user and interest preference in short-term, regulate between the two and control, control the quantity of obtaining webpage by regulating time window, the size of regulating time window reaches real-time effect, and is more timely and accurate.

The control method of described time window can be regulated submodule by time window and carry out.

The purpose of this method is to be beginning the current surf time with the user, is benchmark with a time range that meets user's networking speed and custom, analyzes the category of interest that the user reflects by online in this time range.

Networking speed and custom that the control method of described time window is different according to the user, the initial setting time value of setting-up time window, the setting-up time of time window automatically adjusts along with user's online custom afterwards, and step is:

The statistics user is reticular density in history

Wherein, T is the phase of history time, and M is the user in T internet behavior quantity in the time period;

The initial setting time value is

Wherein, α is an empirical value, is used for regulating the time window size, and the time range of setting guarantees that the user has certain online amount and surf time, and the time range of setting is shorter, makes user interest more concentrated, and user's displacement range is little;

Certain hour week after date, calculate again the user in a new time period on reticular density,

d = \frac{M^{'}}{T^{'}};

The setting-up time value is:

t^{'} = t + \frac{D - d}{D + d};

Wherein, the α adjustable size,

Statistics online quantity total amount is adjusted α according to above-mentioned formula after a long period.

Web data classification processing sub comprises behavioural information processing sub and Web page classifying/content information processing sub, and behavioural information and Web page classifying/content information are handled, and obtains user's interest related web page and user behavior feature.

Submodule and user's current behavior feature that the behavioural information processing sub comprises note behavioral statistics submodule, communication behavior statistics submodule, internet behavior statistics submodule, delete the user behavior feature by the PCA method are determined submodule.It carries out the time statistics according to the browsing page that obtains to the above-mentioned behavior of user in above-mentioned time window, obtain user's behavioural characteristic.

The operation steps of behavioural information processing sub is: the behavior of statistics note; The statistics communication behavior; The statistics internet behavior; By the PCA method user behavior feature is deleted; Determine user's current behavior feature.

Web page classifying/content information processing sub comprises that web page text obtains submodule, web page text classification submodule, visiting frequency statistics submodule and the current content interest of user and determines submodule.It is in the above-mentioned time window, and the webpage that the user browses carries out filtration treatment, obtains one group of related web page, according to the URL address of accessed web page, obtains the content of text of the page, to the content of text processing of classifying; To each class frequency statistics that conducts interviews, be the user's interest related web page with the highest webpage collection of visiting frequency value.Fig. 4 is the operational flowchart of Web page classifying/content information processing sub.

The operation steps of Web page classifying/content information processing sub is: obtain web page text; The web page text classification; The statistics visiting frequency; Determine the user's interest related web page.

Web page text obtains submodule to the URL address of input, gets rid of useless pages and some webpage that can't visit, to linking through the remaining URL address of screening, extracts title and text message.

The Word message of one piece of webpage source file distributes generally as follows:

Wherein link 4, link 5 is link information, also is text message.

By format analysis, coupling＜title〉the acquisition heading message; Get rid of useless link information, obtain text and useful link information, as text 1, link 4, text 2, link 5, text 3.

Web page text obtains the title of submodule output webpage and text message to the web page text submodule of classifying.

Web page text classification submodule is according to predefined subject categories, for each web document of web document set is determined a classification, the subject categories of webpage such as physical culture, food and drink, IT, real estate, automobile, tourism etc.Fig. 5 a is for making up the method for web page text sorter; Fig. 5 b is the using method of web page text sorter.

The Web page classifying device comprises following two parts:

The structure of Web page classifying device and training part, it is input as the training text collection, by text representation and feature selecting, makes up sorter model according to the feature dictionary, is output as the classifying rules collection that is similar to tree structure, shown in Fig. 5 a;

The training process of Web page classifying device namely constantly divides into groups to training sample, by setting up target variable about the classification forecast model of each input variable, packet under the different values with target variable of round Realization input variable, and then for classification and prediction to new data-objects.

The training process step of sorter is: when decision tree nodes at different levels are selected attribute, with the choice criteria of gain ratio as attribute.

Web page classifying device classified part, it is input as the text of handling through the text pretreatment module to be sorted (web document object), pass through text representation, carry out feature selecting according to the feature dictionary, carry out text classification with the classifying rules of training the sorter model that generates, be output as the affiliated classification information of each text, shown in Fig. 5 b.

1. test sample book is expressed as the form same with training sample;

2. t ← decision tree root node;

3. the testing attribute and the threshold value that depend on plan tree node t compare the value of sample character pair to be tested with it, determine according to the standard of t node division then to be

The right child of left child or t ← t of t ← t;

4. recurrence is carried out ⑶, is leafy node up to t;

5. the classification of test sample book is the classification of leaf t representative.

In the text representation step, adopt characteristic vector space to represent text feature, document i can be expressed as the proper vector of following formula:

W _ij=(W _i1,W _i2,...,W _im)

W _ij=f _ij

In the feature selecting step, adopt the feature dimension reduction method based on improved χ 2 statistics and pattern polymerization, step is:

⑴ according to formula

χ_{ij}^{' 2} = sign (n_{11} \times n_{22} - n_{12} \times n_{21}) \frac{n \times {(n_{11} \times n_{22} - n_{12} \times n_{21})}^{2}}{(n_{11} + n_{12}) \times (n_{21} + n_{22}) \times (n_{11} + n_{21}) \times (n_{12} + n_{22})}

Calculate each entry to the improved χ of every class ²Statistic;

⑵ according to formula

{CHI}_{i} = \max {| χ_{i 1}^{' 2} |, | χ_{i 2}^{' 2} |, \cdot \cdot \cdot, | χ_{is}^{' 2} |}

A_{ij} = χ_{ij}^{' 2} / (\max - \min)

d (i, j) = \sqrt{{(A_{i 1} - A_{j 1})}^{2} + {(A_{i 2} - A_{j 2})}^{2} + \cdot \cdot \cdot + {(A_{is} - A_{js})}^{2}}

Repeat 1., 2. two steps, till all patterns can not polymerization;

The integrated study submodule uses the integrated study technology according to user's current behavior interest and current content interest, forms user interest, is output as the current interest of user.

User interest is divided into behavior interest and two parts of content interest, extracts with behavioural information analysis submodule and user content interest analysis submodule respectively, and is integrated by the integrated study submodule at last.

User's usage behavior is analyzed submodule: the current behavioural characteristic of user is carried out obtaining user's current behavior interest based on the decision Tree algorithms classification.

User content interest is extracted submodule: the webpage to the current category of interest of user carries out text analyzing, obtains the web page text attribute information, according to the web page text attribute information, obtains the current content interest of user, and step is:

(1) obtains corresponding keyword and index thereof;

(2) calculate the user to the attention rate of keyword;

(3) according to the attention rate threshold value, obtain the current content interest of user.

The keyword acquisition process comprises:

1. to carrying out word segmentation processing (be to separate with the space as English between Chinese word, be convenient to handle) in full;

2. (it is the word that less semantic meaning is arranged, as function word and some high frequency words to filter out stop words.

Stop words is owing to appearing in a lot of files, so information analysis there is not contribution);

3. extract text header, deposit the title word set in vectorial V _h

4. extract first section in text, second section, latter end, deposit the content word set in vectorial V _c

If 5. | V _h∩ V _c|＜P, judge that then text header is " abstract type " title.Wherein, P is a given threshold value, is defined as 3 according to experiment;

6.

If x were ∈ { query dictionary }-, text header also would be judged as " abstract type " title (x refers to any one value of extracting from title set Vk);

If 7. title does not have (5) or (6) middle feature, judge that then it is " concrete type " title;

Title for " abstract type ", adopt the TFIDF method to search weights in the text and be higher than the word of certain threshold value as candidate word, whether this word of position judgment by the candidate word place is key word (weights of place sentence are more high, and the possibility that becomes key word is more big) then.

To with " concrete type " title, behind the title participle, the noun that obtains and verb just are the key word of the text.When calculating the sentence weight, give the bigger weight proportion factor of word in the heading tabulation.

By above method, can obtain the weight of each sentence, can calculate the weights of each sentence, for time of back provides foundation, and having upgraded the weight of lists of keywords, the keyword chained list of each article correspondence is the keyword of this article by the weight ordering.

Attention rate is calculated: by to each browsing content information of user with browse behavioural information analysis, just can quantitative calculation go out the user to the attention rate of each interest topic.Calculation procedure comprises:

1. the keyword in all theme vectors under the identical generic A is joined among this type of subordinate's the lists of keywords K;

2. with the duplicate key word normalizing that occurs in the same item subordinate keyword interpolation process, the duplicate key word has triggered the gathering of the similar theme of candidate, and all webpages under this word are integrated into form a similar theme group of candidate together;

3. for the similar theme group of the candidate at each duplicate key word place, the original weights of this word in this group theme vector are relatively found out the theme vector at weights the maximum place as the core theme representative of this group theme vector (and join among the K it);

4. calculate the similarity of each theme vector in the similar theme group to the place candidate of core theme, set a threshold value, all exceed thresholding person and join the similar theme group Ki group of formation among the theme group Ki, have also namely formed a topic Ki;

5. the core theme of being found out with the front is as the representative of topic Ki, will be core theme temperature after adjusting with the frequency stack of all theme vector place themes among the topic Ki, and the core theme after adjusting is joined in candidate's focus topic list;

6. calculate the attention rate of each theme among the K according to foregoing fever thermometer metering method;

The integrated study submodule is at same training set, train different sorters, it is the decision tree Weak Classifier, then these decision tree Weak Classifiers are gathered, constitute a stronger final sorter, form the final classification of user interest, adopt the AdaBoost algorithm that the result of user behavior sorter and user content categorize interests device is carried out the iteration adjustment, obtain the weight of different decision tree Weak Classifiers, and then obtain the current interest of user.

User interest preference comprises item of interest, category of interest, attention rate and generation time; In concrete enforcement, user's interest preference can be expressed as tree-shaped version, the upper strata of tree structure represents that it is interest subclass or theme that the type of interest preference, lower floor are represented.User's interest pattern confidence, the information that also can preserve user interest feature word both can have been preserved with tree structure.Fig. 7 is the exemplary tree-shaped structure of user interest preference of the present invention.

Data service pushes module: the described rule association strategy that utilizes, judge whether described user interest and preference are fit to local service, and as satisfying the condition of doing local service, then the trigger position analysis module obtains the current position of browsing; Otherwise, do general relevance Information Push Service.

The Rule of judgment of local service can for:

(1) the current categories of websites of browsing of user is as service system of food and drink, shopping, lodging, traffic website or the value added service provider of city version etc.

(2) classification of the current interest of user is as weather, inquiry traffic, predetermined ticketing service, discount, tourism classics, distinguishing products etc.

Above Rule of judgment can make up, as the current website of browsing of user be certain city version search the website, room, and the interest of browsing page reflection is to rent a house, and then can be fit to localized service recommendation.

Location analysis module is obtained the current position of browsing by the GMLC gateway, i.e. user residing geographic position when browsing current web page.Fig. 9 is location analysis module operational flowchart of the present invention.

Wherein, push module to service in described location analysis module and also comprise that described location analysis module browses URL that the positional information customization is associated with described mobile phone users present position or the step of URL content of pages based on described acquisition before sending positional information.Figure 10 is the process flow diagram of positional information association of the present invention.

The location association information bank: record is the information on services that provides of identical or close place or site attribute information etc. geographically, as:

The location finding coupling: the process with user interest preference, customer position information and corresponding location association information are mated specifically comprises:

(1) with user's current location information as key word of the inquiry, carry out location association inquiry, obtain with as key word input consistent location information record;

(2) classification of the current interest preference of user and the information on services that provides in the location association information are mated, calculate matching degree, if matching degree exceeds a certain threshold value, then export this location association information;

1. if matching result is more, then the theme of the current interest preference of user and the information on services that provides in the location association information are mated, calculate matching degree

2. sort according to matching degree;

3. the output matching degree exceeds the positional information of threshold value.

(3) otherwise, the core position in the customer position information as key word of the inquiry, is carried out location association inquiry, obtain with as key word input consistent location information record, change (2);

Above step is in position analysis and the location association identical or close with the current present position of user.

If the matching degree of above information all is lower than preset threshold, the place or the service that do not have suitable interest preference in user's current location are described then.Therefore, need find suitable place or service according to its interest and preference.

The target location is analyzed: the target location comprises address or scene for the information of match user interest and preference, and process comprises:

(1) with the theme of the current interest preference of user as key word of the inquiry, carry out the location association inquiry, obtain with as key word input consistent location information record, export this location association information;

(2) if there is not consistent positional information record, then calculate the theme of the current interest preference of user and the matching degree that information on services is provided in the location association information,

1. sort according to matching degree;

2. the output matching degree exceeds the positional information of threshold value.

(3) positional information with output passes to the route recommendation unit.

The route recommendation unit comprises:

(1) recommended route generation unit is used for calculating and the selection schemer data;

(2) output route data, thus be created on from the departure place recommended route of recommending when moving to the destination;

(3) display unit is used for showing demonstration information.

It should be noted that at last: above embodiment is only in order to technical scheme of the present invention to be described but not limit it, although with reference to preferred embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: those skilled in the art can make amendment or are equal to replacement technical scheme of the present invention, and these modifications or be equal to replacement and also can not make amended technical scheme break away from the spirit and scope of technical solution of the present invention.

Claims

1. method of determining user's interest related web page text according to browsing page URL is characterized in that: comprise step:

The webpage that user in certain period is browsed carries out filtration treatment, gets rid of useless pages and some webpage that can't visit, to linking through the remaining URL address of screening, obtains the content of text of the page, extracts title and text message;

According to predefined subject categories, for each web document of web document set is determined a classification;

To each class frequency statistics that conducts interviews,

The highest webpage collection of visiting frequency value is as the user's interest related web page.

2. a kind of method of determining user's interest related web page text according to browsing page URL as claimed in claim 1, it is characterized in that: need in the Web page classifying step to make up and the training net web page classifier, input training text collection, by text representation and feature selecting, make up sorter model according to the feature dictionary, be output as the classifying rules collection that is similar to tree structure

The training process of Web page classifying device namely is that training sample is constantly divided into groups, by setting up target variable about the classification forecast model of each input variable, packet under the different values with target variable of round Realization input variable, and then for classification and prediction to new data-objects.

3. a kind of method of determining user's interest related web page text according to browsing page URL as claimed in claim 2 is characterized in that: the Web page classifying device uses the decision tree classification method, the steps include:

1. test sample book is expressed as the form same with training sample;

2. t ← decision tree root node;

3. the testing attribute and the threshold value that depend on plan tree node t compare the value of sample character pair to be tested with it,

Standard decision according to the division of t node is then

The right child of left child or t ← t of t ← t;

4. recurrence is carried out ⑶, is leafy node up to t;

4. a kind of method of determining user's interest related web page text according to browsing page URL as claimed in claim 2, it is characterized in that: in the Web page classifying step, the text to be sorted that input was handled through the text pretreatment module, pass through text representation, carry out feature selecting according to the feature dictionary, carry out text classification with the classifying rules of training the sorter model that generates, be output as the affiliated classification information of each text.

5. as claim 2 or 4 described a kind of methods of determining user's interest related web page text according to browsing page URL, it is characterized in that: in the text representation step, adopt characteristic vector space to represent text feature, document i can be expressed as the proper vector of following formula:

W _ij=(W _i1,W _i2,...,W _im)

W _ij=f _ij。

6. as claim 2 or 4 described a kind of methods of determining user's interest related web page text according to browsing page URL, it is characterized in that: in the feature selecting step, adopt the feature dimension reduction method based on improved χ 2 statistics and pattern polymerization, step is:

⑴ according to formula

x_{ij}^{' 2} = sign (n_{11} \times n_{22} - n_{12} \times n_{21}) \frac{n \times {(n_{11} \times n_{22} - n_{12} \times n_{21})}^{2}}{(n_{11} + n_{12}) \times (n_{21} + n_{22}) \times (n_{11} + n_{21}) \times (n_{12} + n_{22})}

Calculate each entry to improved χ 2 statistics of every class;

⑵ according to formula

{CHI}_{i} = \max {| x_{il}^{′2} |, | x_{i 2}^{' 2} |, \cdot \cdot \cdot, | x_{is}^{' 2} |}

A_{ij} = x_{ij}^{' 2} / (\max - \min)

d (i, j) = \sqrt{{(A_{i 1} - A_{j 1})}^{2} + {(A_{i 2} - A_{j 2})}^{2} + \cdot \cdot \cdot + {(A_{is} - A_{js})}^{2}}

Repeat 1., 2. two steps, till all patterns can not polymerization;

7. system that determines user's interest related web page text according to browsing page URL, it is characterized in that: comprise that web page text obtains submodule, web page text classification submodule, visiting frequency statistics submodule and the current content interest of user and determines submodule

Web page text obtains submodule the webpage that user in certain period browses is carried out filtration treatment, get rid of useless pages and some webpage that can't visit, to linking through the remaining URL address of screening, obtain the content of text of the page, extract title and text message;

Web page text classification submodule is according to predefined subject categories, for each web document of web document set is determined a classification;

Visiting frequency statistics submodule is to each class frequency statistics that conducts interviews,

The current content interest of user determines that the submodule webpage collection that the visiting frequency value is the highest is as the user's interest related web page.

8. a kind of system that determines user's interest related web page text according to browsing page URL as claimed in claim 7, it is characterized in that: need to make up and the training net web page classifier in the web page text classification submodule, input training text collection, by text representation and feature selecting, make up sorter model according to the feature dictionary, be output as the classifying rules collection that is similar to tree structure

9. as claim 7 or 8 described a kind of systems that determine user's interest related web page text according to browsing page URL, it is characterized in that: the text to be sorted that the input of web page text classification submodule was handled through the text pretreatment module, pass through text representation, carry out feature selecting according to the feature dictionary, carry out text classification with the classifying rules of training the sorter model that generates, be output as the affiliated classification information of each text.