CN103678652B - Information individualized recommendation method based on Web log data - Google Patents

Information individualized recommendation method based on Web log data Download PDF

Info

Publication number
CN103678652B
CN103678652B CN201310717507.4A CN201310717507A CN103678652B CN 103678652 B CN103678652 B CN 103678652B CN 201310717507 A CN201310717507 A CN 201310717507A CN 103678652 B CN103678652 B CN 103678652B
Authority
CN
China
Prior art keywords
user
data
page
users
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310717507.4A
Other languages
Chinese (zh)
Other versions
CN103678652A (en
Inventor
袁东风
马翠云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201310717507.4A priority Critical patent/CN103678652B/en
Publication of CN103678652A publication Critical patent/CN103678652A/en
Application granted granted Critical
Publication of CN103678652B publication Critical patent/CN103678652B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses an information individualized recommendation method based on Web log data, and belongs to the technical field of electronic information. The method is used for an information mode of a server plus a broadband network plus a multimedia thin client side. The method includes the steps that users have access to internet sources through the multimedia thin client side, and theserver records the behaviors of the users into server log files; clean, regular and accurate data sources are extracted through analysis and preprocessing of data of the Web log files in the server; a user interest matrix is built through a collaborative filtering technology, similarity between the users is calculated, and the users with large similarity are selected as similar users; a recommendation resource pool is built according to hobbies and interests of the similar users; the server selects and recommends pages of which the recommendation value is larger than the threshold value in the recommendation resource pool to the users. The method has the advantages that the data in the Web log files are preprocessed to acquire the clean and regular data sources, and accurate and individualized information recommendation is provided for the users by the cooperation of the hobbies and the interests of the similar users.

Description

A kind of information personalized recommendation method based on web daily record data
Technical field
The present invention relates to a kind of information personalized recommendation method based on web daily record data, belong to electronic information technology neck Domain.
Background technology
With the fast development of the Internet, the webpage of magnanimity is had to update on the internet or issue daily.For Want for users that it has been more and more difficult for finding oneself satisfied information, thus result in " letter in substantial amounts of information Breath excess " and the contradictory phenomena of " information is hungry ".For solving this problem it is proposed that individual info service, this is a kind of intelligence Can information service way.Relevant information actively can be searched according to the information requirement of user and customizing mode, and utilize The service of line intelligent recommendation or push technology, accurately by the information transmission needed for user to corresponding user.In personalized clothes In business technology, application is more successfully collaborative filtering method.The method refers to the demand according to itself for the user, by using with other Cooperating in family, forms certain cooperation rule, or predict the emerging of unique user using the tendentiousness of multiple information users Interest, then evaluates to information, thus obtaining recommendation results according to the user with same interest hobby.Due to web daily record In have recorded substantial amounts of user behavior information, can be that personalized service provides important data to support using web daily record.But it is former Beginning log recording is mixed and disorderly, imperfect and non-structured, so needing to carry out efficient pretreatment to it.In addition, in user Interest measure aspect, the method that the access module extracting user from access log file that presently, there are is recommended, do not have In view of the time response of user to access pages, and the interest level to certain page for the user, can be according to user in this page Residence time length is weighing.As Patent No. 103338223a of Tsing-Hua University's application, invention entitled " a kind of movement should Recommendation method, client and server " belong to this row.On the basis of this problem, propose one kind and be based on web daily record number According to information personalized recommendation method.First the data in journal file is analyzed and pretreatment it is ensured that extract clean, Regular, accurate data source, secondly, the time response of user to access pages is added limit of consideration, emerging in conjunction with similar users Interest hobby, reaches the purpose providing the user more accurate, personalized information recommendation.
Content of the invention
The defect existing for existing background technology and deficiency, the present invention proposes a kind of letter based on web daily record data Breath personalized recommendation method is it is intended to solve the data source extracted in traditional information recommendation method based on web daily record data not Enough clean, regular, and the problem existing in terms of user interest tolerance.Can be provided the user more smart by this method Accurate, personalized information recommendation.
Technical scheme is as follows:
A kind of information personalized recommendation method based on web daily record data, step is as follows:
A, user access the resource on network by multimedia thin client, and this behavior record of user is existed by server In server log file;
B, the data to web journal file in server are analyzed and pretreatment, exclude that visit capacity is few, do not have generation Table user access record and its middle blade-rotating i.e. be referred to as junk data a class data, by original semi-structured be not easy by The web daily record data that people understands for example only comprises user ip, access time, the url of accession page, the number of access bytes digital section Extract as meeting rule, accurate data source to be converted into structurized data according to table;
According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to In tables of data;
Data in tables of data is cleared up, by nonsensical data in user access information, entitled including suffix These access records of bmp, jpg, jpeg, php, jsp and conditional code are not the 200 daily record notes representing unsuccessful access Record is deleted, and only retains the log recording of suffix entitled html, htm and xml;Wherein bmp represents bitmap, jpg and jpeg represents slightly The graphics file format of distortion compression, php is supertext pretreatment language, in the embedded html document of server end execution Script, jsp represents embedded web page script, and html, htm and xml are web page files;
The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is reset with 3 beginnings To other positions, represent that client has mistake with 4 beginnings, represent that server end has mistake with 5 beginnings;
Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value and divide carrying out behavior Analysis;
Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this Individual time threshold then thinks new session start;
Find out significant accession page and access path from user conversation, user is reach mesh in access process The link page that accesses of page and having to i.e. in blade-rotating delete from session;
C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select The user with larger similarity is as similar users;
User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, will User-page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and x represents that user m is clear Look at time of a certain resource class x, matrix c (m × x) is weighted filter data prediction, obtains standardized resource, from And form user interest matrix;
Using k- means Data Cluster Algorithm, user is clustered, the similarity of user is chosen cosine similarity to evaluate;
D, the hobby for similar users are set up and are recommended resource pool;
The interest-degree u to page j for the user ii,jCan be expressed as in page j total time of staying with user i to all pages The byte number of the ratio of browsing time summation and page j with the product of all accession page byte number sum ratios it may be assumed that
u i , j = σ t i m e i , j σ k = 1 m t i m e i , k × s i z e i , j σ k = 1 m s i z e i , k ,
Wherein: timei, j be user i in page j total time of staying, timei, k be user i to all page browsings when Between summation, sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is institute There is page sum;
E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in pushing away Recommend value and be more than the web page recommendation of specified threshold to user.
Described url means URL, is the abbreviation of English uniform resource locator, is right The position of the resource that can obtain from the Internet and a kind of succinct expression of access method, are standard resources on the Internet Address.Each file on the Internet has a unique url, and the information that it comprises points out position and the browser of file How should be processed it.
Described weighted filter data prediction is by carrying out integrated treatment to the scoring of the average weighted of matrix row and column And calculate prediction scoring, each user thus can be made to have score value to each accession page, thus alleviating dilute Thin sex chromosome mosaicism.Avoid the evaluation difference to accession page for the different users simultaneously, for normalization evaluation result, obtain standardization Resource.
Described k- means Data Cluster Algorithm is: initial random given k Ge Cu center, according to closest principle sample to be sorted This point assigns to each cluster.Then the barycenter of each cluster is recalculated by averaging method, so that it is determined that the new cluster heart.Iteration always, directly Displacement to the cluster heart is less than certain specified value.
In described cosine similarity vector space, two vectorial angle cosine values are poor between two individualities as weighing Different size.Measuring similarity (similarity), that is, calculate the similarity degree between individuality, similarity contrary with distance metric The value of tolerance is less, illustrates that between individuality, similarity is less, difference is bigger.Compare distance metric, cosine similarity more focuses on two Difference on direction for the individual vector, rather than the difference in distance or length.
The invention has the beneficial effects as follows by pretreatment is carried out to the data in web journal file, through data conversion, number
Cleaner, accurate, regular data source, Er Qie are obtained according to steps such as cleaning, user's identification, session identifications With
Family interest measure aspect, the time response of user to access pages is added limit of consideration, emerging in conjunction with similar users Interest hobby,
Provide the user more accurate, personalized information recommendation.
Brief description
Specific embodiment
With reference to embodiment, the invention will be further described, but not limited to this.
Embodiment:
A kind of information personalized recommendation method based on web daily record data, step is as follows:
A, user access the resource on network by multimedia thin client, and this behavior record of user is existed by server In server log file;
B, the data to web journal file in server are analyzed and pretreatment, exclude that visit capacity is few, do not have generation Table user access record and its middle blade-rotating i.e. be referred to as junk data a class data, by original semi-structured be not easy by The web daily record data that people understands for example only comprises user ip, access time, the url of accession page, the number of access bytes digital section Extract as meeting rule, accurate data source to be converted into structurized data according to table;
According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to In tables of data;
Data in tables of data is cleared up, by nonsensical data in user access information, entitled including suffix These access records of bmp, jpg, jpeg, php, jsp and conditional code are not the 200 daily record notes representing unsuccessful access Record is deleted, and only retains the log recording of suffix entitled html, htm and xml;Wherein bmp represents bitmap, jpg and jpeg represents slightly The graphics file format of distortion compression, php is supertext pretreatment language, in the embedded html document of server end execution Script, jsp represents embedded web page script, and html, htm and xml are web page files;
The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is reset with 3 beginnings To other positions, represent that client has mistake with 4 beginnings, represent that server end has mistake with 5 beginnings;
Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value and divide carrying out behavior Analysis;
Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this Individual time threshold then thinks new session start;
Find out significant accession page and access path from user conversation, user is reach mesh in access process The link page that accesses of page and having to i.e. in blade-rotating delete from session;
C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select The user with larger similarity is as similar users;
User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, will User-page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and x represents that user m is clear Look at time of a certain resource class x, matrix c (m × x) is weighted filter data prediction, obtains standardized resource, from And form user interest matrix;
Using k- means Data Cluster Algorithm, user is clustered, the similarity of user is chosen cosine similarity to evaluate;
D, the hobby for similar users are set up and are recommended resource pool;
The interest-degree u to page j for the user ii,jCan be expressed as in page j total time of staying with user i to all pages The byte number of the ratio of browsing time summation and page j with the product of all accession page byte number sum ratios it may be assumed that
u i , j = σ t i m e i , j σ k = 1 m t i m e i , k × s i z e i , j σ k = 1 m s i z e i , k ,
Wherein: timei, j be user i in page j total time of staying, timei, k be user i to all page browsings when Between summation, sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is institute There is page sum;
E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in pushing away Recommend value and be more than the web page recommendation of specified threshold to user.

Claims (1)

1. a kind of information personalized recommendation method based on web daily record data, step is as follows:
A, user access the resource on network by multimedia thin client, and server is by this behavior record of user in service In device journal file;
B, the data to web journal file in server are analyzed and pretreatment, exclude visit capacity few, under-represented User access record and its middle blade-rotating is referred to as a class data of junk data, semi-structured be not easy to be read by people by original The web daily record data understood is converted into structurized data;
According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to data In table;
Data in tables of data is cleared up, by nonsensical data in user access information, including the entitled bmp of suffix, To access record and conditional code be not 200 to represent that the log recording of unsuccessful access is deleted for these of jpg, jpeg, php, jsp Remove, only retain the log recording of suffix entitled html, htm and xml;Wherein bmp represents bitmap, jpg and jpeg represents slightly distortion The graphics file format of compression, php is supertext pretreatment language, in the foot of the embedded html document of server end execution This language, jsp represents embedded web page script, and html, htm and xml are web page files;
The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is redirected to 3 beginnings With 4 beginnings, other positions, represent that client has mistake, represent that server end has mistake with 5 beginnings;
Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value to carry out behavior analysiss;
Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this when Between threshold value then think new session start;
Find out significant accession page and access path from user conversation, user is the page that achieves the goal in access process And the link page i.e. middle blade-rotating having to access is deleted from session;
C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select some to have The user of larger similarity is as similar users;
User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, by user- Page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and it is a certain that x represents that user m browses The time of resource class x, matrix c (m × x) being weighted filter data prediction, obtaining standardized resource, thus being formed User interest matrix;
Using k- means Data Cluster Algorithm, user is clustered, the similarity of user is chosen cosine similarity to evaluate;
D, the hobby for similar users are set up and are recommended resource pool;
The interest-degree u to page j for the user ii,jCan be expressed as in page j total time of staying with user i to all page browsings The byte number of the ratio of temporal summation and page j with the product of all accession page byte number sum ratios it may be assumed that
u i , j = σ t i m e i , j σ k = 1 m t i m e i , k × s i z e i , j σ k = 1 m s i z e i , k ,
Wherein: timei, j be user i in page j total time of staying, timei, k are that user i is total to all page browsing times With sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is all pages Face sum;
E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in recommendation More than specified threshold web page recommendation to user.
CN201310717507.4A 2013-12-23 2013-12-23 Information individualized recommendation method based on Web log data Active CN103678652B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310717507.4A CN103678652B (en) 2013-12-23 2013-12-23 Information individualized recommendation method based on Web log data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310717507.4A CN103678652B (en) 2013-12-23 2013-12-23 Information individualized recommendation method based on Web log data

Publications (2)

Publication Number Publication Date
CN103678652A CN103678652A (en) 2014-03-26
CN103678652B true CN103678652B (en) 2017-02-01

Family

ID=50316196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310717507.4A Active CN103678652B (en) 2013-12-23 2013-12-23 Information individualized recommendation method based on Web log data

Country Status (1)

Country Link
CN (1) CN103678652B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506480B (en) * 2014-06-27 2018-11-23 深圳市永达电子信息股份有限公司 The cross-domain access control method and system combined based on label with audit
CN104331433B (en) * 2014-10-22 2017-12-29 浙江中烟工业有限责任公司 A kind of Tobacco Reference based on mobile terminal user's daily record recommends method
CN105589905B (en) * 2014-12-26 2019-06-18 中国银联股份有限公司 The analysis of user interest data and collection system and its method
CN104866540B (en) * 2015-05-04 2018-04-27 华中科技大学 A kind of personalized recommendation method based on group of subscribers behavioural analysis
CN105589917B (en) * 2015-09-17 2017-05-03 广州市动景计算机科技有限公司 Method and device for analyzing log information of browser
CN106302849A (en) * 2016-08-04 2017-01-04 北京集奥聚合科技有限公司 A kind of method carrying out moving solid fusion by carrier data
CN106302851B (en) * 2016-08-09 2019-08-02 厦门天锐科技股份有限公司 A method of judging that server is accessed by which kind of network type
CN106528852A (en) * 2016-11-25 2017-03-22 盐城工学院 Method and device for conducting redirecting by accessing data
CN107256261B (en) * 2017-06-13 2021-03-19 中原工学院 Electronic information transmission system and method thereof
CN107341397A (en) * 2017-06-30 2017-11-10 福建师范大学 Big data platform session recognition methods based on dynamic time threshold value
CN109388737B (en) * 2017-08-03 2023-03-31 腾讯科技(北京)有限公司 Method and device for sending exposure data of content item and storage medium
CN108109035A (en) * 2017-12-08 2018-06-01 上海电机学院 Webpage recommending method based on Web personalizations
CN108109043A (en) * 2017-12-22 2018-06-01 重庆邮电大学 A kind of commending system reduces the method for repeating to recommend
CN109299375A (en) * 2018-10-24 2019-02-01 中国平安人寿保险股份有限公司 Information personalized push method, device, electronic equipment and storage medium
CN110188566A (en) * 2019-05-19 2019-08-30 复旦大学 A method of the test access behavior based on sequence analysis damages data equity
CN111400628B (en) * 2020-03-12 2023-04-07 腾讯科技(深圳)有限公司 Information propagation method, device, equipment and medium
CN114840486B (en) * 2022-06-28 2022-09-16 广州趣米网络科技有限公司 User behavior data acquisition method and system and cloud platform
CN117194804B (en) * 2023-11-08 2024-01-26 上海银行股份有限公司 Guiding recommendation method and system suitable for operation management system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1967533A (en) * 2006-07-17 2007-05-23 北京航空航天大学 Gateway personalized recommendation service method and system introduced yuan recommendation engine
CN102141986A (en) * 2010-01-28 2011-08-03 北京邮电大学 Individualized information providing method and system based on user behaviors
CN102819575A (en) * 2012-07-20 2012-12-12 南京大学 Personalized search method for Web service recommendation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1967533A (en) * 2006-07-17 2007-05-23 北京航空航天大学 Gateway personalized recommendation service method and system introduced yuan recommendation engine
CN102141986A (en) * 2010-01-28 2011-08-03 北京邮电大学 Individualized information providing method and system based on user behaviors
CN102819575A (en) * 2012-07-20 2012-12-12 南京大学 Personalized search method for Web service recommendation

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
A Taxonomy of Recommender Agents on the Internet;MIQUEL MONTANER 等;《Artificial Intelligence Review》;20030831;第19卷(第4期);全文 *
Toward the Next Generation of Recommender;Gediminas Adomavicius等;《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》;20050630;第17卷(第6期);全文 *
Web日志挖掘技术的研究与应用;陈文臣;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20070228;全文 *
基于Web日志挖掘的个性化推荐研究;张海鹏;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20070630;全文 *
基于Web日志的用户兴趣聚类研究;陈峰;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20081130;全文 *
用户profile中用户兴趣度计算方法的改进;温彩玲;《太原城市职业技术学院学报》;20100131(第1期);全文 *

Also Published As

Publication number Publication date
CN103678652A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103678652B (en) Information individualized recommendation method based on Web log data
US10776885B2 (en) Mutually reinforcing ranking of social media accounts and contents
US20090319449A1 (en) Providing context for web articles
CN108334489B (en) Text core word recognition method and device
CN103778260A (en) Individualized microblog information recommending system and method
US20140351267A1 (en) Overlapping Community Detection in Weighted Graphs
US20170235726A1 (en) Information identification and extraction
CN107153716B (en) Webpage content extraction method and device
CN107241215B (en) User behavior prediction method and device
CN102207967B (en) Method and system for automatically providing new browser plugin
CN112749326A (en) Information processing method, information processing device, computer equipment and storage medium
US20180046628A1 (en) Ranking social media content
CN111429161B (en) Feature extraction method, feature extraction device, storage medium and electronic equipment
CN113592522A (en) Method and apparatus for processing traffic data, and computer-readable storage medium
CN102662972A (en) A visually disabled person-oriented automatic picture description method for web content barrier-free access
US11269896B2 (en) System and method for automatic difficulty level estimation
CN109819002B (en) Data pushing method and device, storage medium and electronic device
CN112287272A (en) Method, system and storage medium for classifying website list pages
CN103870452A (en) Method and method for recommending data
CN112307352A (en) Content recommendation method, system, device and storage medium
CN104537080B (en) Information recommends method and system
CN115204436A (en) Method, device, equipment and medium for detecting abnormal reasons of business indexes
CN116956183A (en) Multimedia resource recommendation method, model training method, device and storage medium
CN108920546B (en) Steady-state label development method and system based on user requirements
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant