CN103678652B

CN103678652B - Information individualized recommendation method based on Web log data

Info

Publication number: CN103678652B
Application number: CN201310717507.4A
Authority: CN
Inventors: 袁东风; 马翠云
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2013-12-23
Filing date: 2013-12-23
Publication date: 2017-02-01
Anticipated expiration: 2033-12-23
Also published as: CN103678652A

Abstract

The invention discloses an information individualized recommendation method based on Web log data, and belongs to the technical field of electronic information. The method is used for an information mode of a server plus a broadband network plus a multimedia thin client side. The method includes the steps that users have access to internet sources through the multimedia thin client side, and theserver records the behaviors of the users into server log files; clean, regular and accurate data sources are extracted through analysis and preprocessing of data of the Web log files in the server; a user interest matrix is built through a collaborative filtering technology, similarity between the users is calculated, and the users with large similarity are selected as similar users; a recommendation resource pool is built according to hobbies and interests of the similar users; the server selects and recommends pages of which the recommendation value is larger than the threshold value in the recommendation resource pool to the users. The method has the advantages that the data in the Web log files are preprocessed to acquire the clean and regular data sources, and accurate and individualized information recommendation is provided for the users by the cooperation of the hobbies and the interests of the similar users.

Description

A kind of information personalized recommendation method based on web daily record data

Technical field

The present invention relates to a kind of information personalized recommendation method based on web daily record data, belong to electronic information technology neck Domain.

Background technology

With the fast development of the Internet, the webpage of magnanimity is had to update on the internet or issue daily.For Want for users that it has been more and more difficult for finding oneself satisfied information, thus result in " letter in substantial amounts of information Breath excess " and the contradictory phenomena of " information is hungry ".For solving this problem it is proposed that individual info service, this is a kind of intelligence Can information service way.Relevant information actively can be searched according to the information requirement of user and customizing mode, and utilize The service of line intelligent recommendation or push technology, accurately by the information transmission needed for user to corresponding user.In personalized clothes In business technology, application is more successfully collaborative filtering method.The method refers to the demand according to itself for the user, by using with other Cooperating in family, forms certain cooperation rule, or predict the emerging of unique user using the tendentiousness of multiple information users Interest, then evaluates to information, thus obtaining recommendation results according to the user with same interest hobby.Due to web daily record In have recorded substantial amounts of user behavior information, can be that personalized service provides important data to support using web daily record.But it is former Beginning log recording is mixed and disorderly, imperfect and non-structured, so needing to carry out efficient pretreatment to it.In addition, in user Interest measure aspect, the method that the access module extracting user from access log file that presently, there are is recommended, do not have In view of the time response of user to access pages, and the interest level to certain page for the user, can be according to user in this page Residence time length is weighing.As Patent No. 103338223a of Tsing-Hua University's application, invention entitled " a kind of movement should Recommendation method, client and server " belong to this row.On the basis of this problem, propose one kind and be based on web daily record number According to information personalized recommendation method.First the data in journal file is analyzed and pretreatment it is ensured that extract clean, Regular, accurate data source, secondly, the time response of user to access pages is added limit of consideration, emerging in conjunction with similar users Interest hobby, reaches the purpose providing the user more accurate, personalized information recommendation.

Content of the invention

The defect existing for existing background technology and deficiency, the present invention proposes a kind of letter based on web daily record data Breath personalized recommendation method is it is intended to solve the data source extracted in traditional information recommendation method based on web daily record data not Enough clean, regular, and the problem existing in terms of user interest tolerance.Can be provided the user more smart by this method Accurate, personalized information recommendation.

Technical scheme is as follows:

A kind of information personalized recommendation method based on web daily record data, step is as follows:

A, user access the resource on network by multimedia thin client, and this behavior record of user is existed by server In server log file；

B, the data to web journal file in server are analyzed and pretreatment, exclude that visit capacity is few, do not have generation Table user access record and its middle blade-rotating i.e. be referred to as junk data a class data, by original semi-structured be not easy by The web daily record data that people understands for example only comprises user ip, access time, the url of accession page, the number of access bytes digital section Extract as meeting rule, accurate data source to be converted into structurized data according to table；

According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to In tables of data；

Data in tables of data is cleared up, by nonsensical data in user access information, entitled including suffix These access records of bmp, jpg, jpeg, php, jsp and conditional code are not the 200 daily record notes representing unsuccessful access Record is deleted, and only retains the log recording of suffix entitled html, htm and xml；Wherein bmp represents bitmap, jpg and jpeg represents slightly The graphics file format of distortion compression, php is supertext pretreatment language, in the embedded html document of server end execution Script, jsp represents embedded web page script, and html, htm and xml are web page files；

The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is reset with 3 beginnings To other positions, represent that client has mistake with 4 beginnings, represent that server end has mistake with 5 beginnings；

Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value and divide carrying out behavior Analysis；

Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this Individual time threshold then thinks new session start；

Find out significant accession page and access path from user conversation, user is reach mesh in access process The link page that accesses of page and having to i.e. in blade-rotating delete from session；

C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select The user with larger similarity is as similar users；

User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, will User-page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and x represents that user m is clear Look at time of a certain resource class x, matrix c (m × x) is weighted filter data prediction, obtains standardized resource, from And form user interest matrix；

Using k- means Data Cluster Algorithm, user is clustered, the similarity of user is chosen cosine similarity to evaluate；

D, the hobby for similar users are set up and are recommended resource pool；

The interest-degree u to page j for the user i_i,jCan be expressed as in page j total time of staying with user i to all pages The byte number of the ratio of browsing time summation and page j with the product of all accession page byte number sum ratios it may be assumed that

u_{i, j} = \frac{σ t i m e i, j}{σ_{k = 1}^{m} t i m e i, k} \times \frac{s i z e i, j}{σ_{k = 1}^{m} s i z e i, k},

Wherein: timei, j be user i in page j total time of staying, timei, k be user i to all page browsings when Between summation, sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is institute There is page sum；

E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in pushing away Recommend value and be more than the web page recommendation of specified threshold to user.

Described url means URL, is the abbreviation of English uniform resource locator, is right The position of the resource that can obtain from the Internet and a kind of succinct expression of access method, are standard resources on the Internet Address.Each file on the Internet has a unique url, and the information that it comprises points out position and the browser of file How should be processed it.

Described weighted filter data prediction is by carrying out integrated treatment to the scoring of the average weighted of matrix row and column And calculate prediction scoring, each user thus can be made to have score value to each accession page, thus alleviating dilute Thin sex chromosome mosaicism.Avoid the evaluation difference to accession page for the different users simultaneously, for normalization evaluation result, obtain standardization Resource.

Described k- means Data Cluster Algorithm is: initial random given k Ge Cu center, according to closest principle sample to be sorted This point assigns to each cluster.Then the barycenter of each cluster is recalculated by averaging method, so that it is determined that the new cluster heart.Iteration always, directly Displacement to the cluster heart is less than certain specified value.

In described cosine similarity vector space, two vectorial angle cosine values are poor between two individualities as weighing Different size.Measuring similarity (similarity), that is, calculate the similarity degree between individuality, similarity contrary with distance metric The value of tolerance is less, illustrates that between individuality, similarity is less, difference is bigger.Compare distance metric, cosine similarity more focuses on two Difference on direction for the individual vector, rather than the difference in distance or length.

The invention has the beneficial effects as follows by pretreatment is carried out to the data in web journal file, through data conversion, number

Cleaner, accurate, regular data source, Er Qie are obtained according to steps such as cleaning, user's identification, session identifications With

Family interest measure aspect, the time response of user to access pages is added limit of consideration, emerging in conjunction with similar users Interest hobby,

Provide the user more accurate, personalized information recommendation.

Brief description

Specific embodiment

With reference to embodiment, the invention will be further described, but not limited to this.

Embodiment:

D, the hobby for similar users are set up and are recommended resource pool；

u_{i, j} = \frac{σ t i m e i, j}{σ_{k = 1}^{m} t i m e i, k} \times \frac{s i z e i, j}{σ_{k = 1}^{m} s i z e i, k},

Claims

1. a kind of information personalized recommendation method based on web daily record data, step is as follows:

A, user access the resource on network by multimedia thin client, and server is by this behavior record of user in service In device journal file；

B, the data to web journal file in server are analyzed and pretreatment, exclude visit capacity few, under-represented User access record and its middle blade-rotating is referred to as a class data of junk data, semi-structured be not easy to be read by people by original The web daily record data understood is converted into structurized data；

According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to data In table；

Data in tables of data is cleared up, by nonsensical data in user access information, including the entitled bmp of suffix, To access record and conditional code be not 200 to represent that the log recording of unsuccessful access is deleted for these of jpg, jpeg, php, jsp Remove, only retain the log recording of suffix entitled html, htm and xml；Wherein bmp represents bitmap, jpg and jpeg represents slightly distortion The graphics file format of compression, php is supertext pretreatment language, in the foot of the embedded html document of server end execution This language, jsp represents embedded web page script, and html, htm and xml are web page files；

The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is redirected to 3 beginnings With 4 beginnings, other positions, represent that client has mistake, represent that server end has mistake with 5 beginnings；

Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value to carry out behavior analysiss；

Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this when Between threshold value then think new session start；

Find out significant accession page and access path from user conversation, user is the page that achieves the goal in access process And the link page i.e. middle blade-rotating having to access is deleted from session；

C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select some to have The user of larger similarity is as similar users；

User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, by user- Page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and it is a certain that x represents that user m browses The time of resource class x, matrix c (m × x) being weighted filter data prediction, obtaining standardized resource, thus being formed User interest matrix；

D, the hobby for similar users are set up and are recommended resource pool；

The interest-degree u to page j for the user i_i,jCan be expressed as in page j total time of staying with user i to all page browsings The byte number of the ratio of temporal summation and page j with the product of all accession page byte number sum ratios it may be assumed that

u_{i, j} = \frac{σ t i m e i, j}{σ_{k = 1}^{m} t i m e i, k} \times \frac{s i z e i, j}{σ_{k = 1}^{m} s i z e i, k},

Wherein: timei, j be user i in page j total time of staying, timei, k are that user i is total to all page browsing times With sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is all pages Face sum；

E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in recommendation More than specified threshold web page recommendation to user.