CN103678652B - Information individualized recommendation method based on Web log data - Google Patents
Information individualized recommendation method based on Web log data Download PDFInfo
- Publication number
- CN103678652B CN103678652B CN201310717507.4A CN201310717507A CN103678652B CN 103678652 B CN103678652 B CN 103678652B CN 201310717507 A CN201310717507 A CN 201310717507A CN 103678652 B CN103678652 B CN 103678652B
- Authority
- CN
- China
- Prior art keywords
- user
- data
- page
- users
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
- G06F16/337—Profile generation, learning or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses an information individualized recommendation method based on Web log data, and belongs to the technical field of electronic information. The method is used for an information mode of a server plus a broadband network plus a multimedia thin client side. The method includes the steps that users have access to internet sources through the multimedia thin client side, and theserver records the behaviors of the users into server log files; clean, regular and accurate data sources are extracted through analysis and preprocessing of data of the Web log files in the server; a user interest matrix is built through a collaborative filtering technology, similarity between the users is calculated, and the users with large similarity are selected as similar users; a recommendation resource pool is built according to hobbies and interests of the similar users; the server selects and recommends pages of which the recommendation value is larger than the threshold value in the recommendation resource pool to the users. The method has the advantages that the data in the Web log files are preprocessed to acquire the clean and regular data sources, and accurate and individualized information recommendation is provided for the users by the cooperation of the hobbies and the interests of the similar users.
Description
Technical field
The present invention relates to a kind of information personalized recommendation method based on web daily record data, belong to electronic information technology neck
Domain.
Background technology
With the fast development of the Internet, the webpage of magnanimity is had to update on the internet or issue daily.For
Want for users that it has been more and more difficult for finding oneself satisfied information, thus result in " letter in substantial amounts of information
Breath excess " and the contradictory phenomena of " information is hungry ".For solving this problem it is proposed that individual info service, this is a kind of intelligence
Can information service way.Relevant information actively can be searched according to the information requirement of user and customizing mode, and utilize
The service of line intelligent recommendation or push technology, accurately by the information transmission needed for user to corresponding user.In personalized clothes
In business technology, application is more successfully collaborative filtering method.The method refers to the demand according to itself for the user, by using with other
Cooperating in family, forms certain cooperation rule, or predict the emerging of unique user using the tendentiousness of multiple information users
Interest, then evaluates to information, thus obtaining recommendation results according to the user with same interest hobby.Due to web daily record
In have recorded substantial amounts of user behavior information, can be that personalized service provides important data to support using web daily record.But it is former
Beginning log recording is mixed and disorderly, imperfect and non-structured, so needing to carry out efficient pretreatment to it.In addition, in user
Interest measure aspect, the method that the access module extracting user from access log file that presently, there are is recommended, do not have
In view of the time response of user to access pages, and the interest level to certain page for the user, can be according to user in this page
Residence time length is weighing.As Patent No. 103338223a of Tsing-Hua University's application, invention entitled " a kind of movement should
Recommendation method, client and server " belong to this row.On the basis of this problem, propose one kind and be based on web daily record number
According to information personalized recommendation method.First the data in journal file is analyzed and pretreatment it is ensured that extract clean,
Regular, accurate data source, secondly, the time response of user to access pages is added limit of consideration, emerging in conjunction with similar users
Interest hobby, reaches the purpose providing the user more accurate, personalized information recommendation.
Content of the invention
The defect existing for existing background technology and deficiency, the present invention proposes a kind of letter based on web daily record data
Breath personalized recommendation method is it is intended to solve the data source extracted in traditional information recommendation method based on web daily record data not
Enough clean, regular, and the problem existing in terms of user interest tolerance.Can be provided the user more smart by this method
Accurate, personalized information recommendation.
Technical scheme is as follows:
A kind of information personalized recommendation method based on web daily record data, step is as follows:
A, user access the resource on network by multimedia thin client, and this behavior record of user is existed by server
In server log file;
B, the data to web journal file in server are analyzed and pretreatment, exclude that visit capacity is few, do not have generation
Table user access record and its middle blade-rotating i.e. be referred to as junk data a class data, by original semi-structured be not easy by
The web daily record data that people understands for example only comprises user ip, access time, the url of accession page, the number of access bytes digital section
Extract as meeting rule, accurate data source to be converted into structurized data according to table;
According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to
In tables of data;
Data in tables of data is cleared up, by nonsensical data in user access information, entitled including suffix
These access records of bmp, jpg, jpeg, php, jsp and conditional code are not the 200 daily record notes representing unsuccessful access
Record is deleted, and only retains the log recording of suffix entitled html, htm and xml;Wherein bmp represents bitmap, jpg and jpeg represents slightly
The graphics file format of distortion compression, php is supertext pretreatment language, in the embedded html document of server end execution
Script, jsp represents embedded web page script, and html, htm and xml are web page files;
The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is reset with 3 beginnings
To other positions, represent that client has mistake with 4 beginnings, represent that server end has mistake with 5 beginnings;
Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value and divide carrying out behavior
Analysis;
Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this
Individual time threshold then thinks new session start;
Find out significant accession page and access path from user conversation, user is reach mesh in access process
The link page that accesses of page and having to i.e. in blade-rotating delete from session;
C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select
The user with larger similarity is as similar users;
User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, will
User-page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and x represents that user m is clear
Look at time of a certain resource class x, matrix c (m × x) is weighted filter data prediction, obtains standardized resource, from
And form user interest matrix;
Using k- means Data Cluster Algorithm, user is clustered, the similarity of user is chosen cosine similarity to evaluate;
D, the hobby for similar users are set up and are recommended resource pool;
The interest-degree u to page j for the user ii,jCan be expressed as in page j total time of staying with user i to all pages
The byte number of the ratio of browsing time summation and page j with the product of all accession page byte number sum ratios it may be assumed that
Wherein: timei, j be user i in page j total time of staying, timei, k be user i to all page browsings when
Between summation, sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is institute
There is page sum;
E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in pushing away
Recommend value and be more than the web page recommendation of specified threshold to user.
Described url means URL, is the abbreviation of English uniform resource locator, is right
The position of the resource that can obtain from the Internet and a kind of succinct expression of access method, are standard resources on the Internet
Address.Each file on the Internet has a unique url, and the information that it comprises points out position and the browser of file
How should be processed it.
Described weighted filter data prediction is by carrying out integrated treatment to the scoring of the average weighted of matrix row and column
And calculate prediction scoring, each user thus can be made to have score value to each accession page, thus alleviating dilute
Thin sex chromosome mosaicism.Avoid the evaluation difference to accession page for the different users simultaneously, for normalization evaluation result, obtain standardization
Resource.
Described k- means Data Cluster Algorithm is: initial random given k Ge Cu center, according to closest principle sample to be sorted
This point assigns to each cluster.Then the barycenter of each cluster is recalculated by averaging method, so that it is determined that the new cluster heart.Iteration always, directly
Displacement to the cluster heart is less than certain specified value.
In described cosine similarity vector space, two vectorial angle cosine values are poor between two individualities as weighing
Different size.Measuring similarity (similarity), that is, calculate the similarity degree between individuality, similarity contrary with distance metric
The value of tolerance is less, illustrates that between individuality, similarity is less, difference is bigger.Compare distance metric, cosine similarity more focuses on two
Difference on direction for the individual vector, rather than the difference in distance or length.
The invention has the beneficial effects as follows by pretreatment is carried out to the data in web journal file, through data conversion, number
Cleaner, accurate, regular data source, Er Qie are obtained according to steps such as cleaning, user's identification, session identifications
With
Family interest measure aspect, the time response of user to access pages is added limit of consideration, emerging in conjunction with similar users
Interest hobby,
Provide the user more accurate, personalized information recommendation.
Brief description
Specific embodiment
With reference to embodiment, the invention will be further described, but not limited to this.
Embodiment:
A kind of information personalized recommendation method based on web daily record data, step is as follows:
A, user access the resource on network by multimedia thin client, and this behavior record of user is existed by server
In server log file;
B, the data to web journal file in server are analyzed and pretreatment, exclude that visit capacity is few, do not have generation
Table user access record and its middle blade-rotating i.e. be referred to as junk data a class data, by original semi-structured be not easy by
The web daily record data that people understands for example only comprises user ip, access time, the url of accession page, the number of access bytes digital section
Extract as meeting rule, accurate data source to be converted into structurized data according to table;
According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to
In tables of data;
Data in tables of data is cleared up, by nonsensical data in user access information, entitled including suffix
These access records of bmp, jpg, jpeg, php, jsp and conditional code are not the 200 daily record notes representing unsuccessful access
Record is deleted, and only retains the log recording of suffix entitled html, htm and xml;Wherein bmp represents bitmap, jpg and jpeg represents slightly
The graphics file format of distortion compression, php is supertext pretreatment language, in the embedded html document of server end execution
Script, jsp represents embedded web page script, and html, htm and xml are web page files;
The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is reset with 3 beginnings
To other positions, represent that client has mistake with 4 beginnings, represent that server end has mistake with 5 beginnings;
Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value and divide carrying out behavior
Analysis;
Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this
Individual time threshold then thinks new session start;
Find out significant accession page and access path from user conversation, user is reach mesh in access process
The link page that accesses of page and having to i.e. in blade-rotating delete from session;
C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select
The user with larger similarity is as similar users;
User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, will
User-page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and x represents that user m is clear
Look at time of a certain resource class x, matrix c (m × x) is weighted filter data prediction, obtains standardized resource, from
And form user interest matrix;
Using k- means Data Cluster Algorithm, user is clustered, the similarity of user is chosen cosine similarity to evaluate;
D, the hobby for similar users are set up and are recommended resource pool;
The interest-degree u to page j for the user ii,jCan be expressed as in page j total time of staying with user i to all pages
The byte number of the ratio of browsing time summation and page j with the product of all accession page byte number sum ratios it may be assumed that
Wherein: timei, j be user i in page j total time of staying, timei, k be user i to all page browsings when
Between summation, sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is institute
There is page sum;
E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in pushing away
Recommend value and be more than the web page recommendation of specified threshold to user.
Claims (1)
1. a kind of information personalized recommendation method based on web daily record data, step is as follows:
A, user access the resource on network by multimedia thin client, and server is by this behavior record of user in service
In device journal file;
B, the data to web journal file in server are analyzed and pretreatment, exclude visit capacity few, under-represented
User access record and its middle blade-rotating is referred to as a class data of junk data, semi-structured be not easy to be read by people by original
The web daily record data understood is converted into structurized data;
According to the content information of web journal file, tables of data builds corresponding field, then text data is imported to data
In table;
Data in tables of data is cleared up, by nonsensical data in user access information, including the entitled bmp of suffix,
To access record and conditional code be not 200 to represent that the log recording of unsuccessful access is deleted for these of jpg, jpeg, php, jsp
Remove, only retain the log recording of suffix entitled html, htm and xml;Wherein bmp represents bitmap, jpg and jpeg represents slightly distortion
The graphics file format of compression, php is supertext pretreatment language, in the foot of the embedded html document of server end execution
This language, jsp represents embedded web page script, and html, htm and xml are web page files;
The conditional code of web journal file acquiescence is represented with 2 beginnings asks successfully, to represent that user's request is redirected to 3 beginnings
With 4 beginnings, other positions, represent that client has mistake, represent that server end has mistake with 5 beginnings;
Different users are identified according to the ip of user, selects visit capacity to reach the user of certain value to carry out behavior analysiss;
Conversated identification in the time of staying of whole website according to user, set a time threshold, if it exceeds this when
Between threshold value then think new session start;
Find out significant accession page and access path from user conversation, user is the page that achieves the goal in access process
And the link page i.e. middle blade-rotating having to access is deleted from session;
C, set up user interest matrix using collaborative filtering, calculate the similarity between each user, select some to have
The user of larger similarity is as similar users;
User-page matrix is expressed as r (m × n), wherein matrix value rm, and n represents the time of user m browsing pages n, by user-
Page matrix r (m × n) is converted into user-resource class matrix c (m × x), wherein matrix value cm, and it is a certain that x represents that user m browses
The time of resource class x, matrix c (m × x) being weighted filter data prediction, obtaining standardized resource, thus being formed
User interest matrix;
Using k- means Data Cluster Algorithm, user is clustered, the similarity of user is chosen cosine similarity to evaluate;
D, the hobby for similar users are set up and are recommended resource pool;
The interest-degree u to page j for the user ii,jCan be expressed as in page j total time of staying with user i to all page browsings
The byte number of the ratio of temporal summation and page j with the product of all accession page byte number sum ratios it may be assumed that
Wherein: timei, j be user i in page j total time of staying, timei, k are that user i is total to all page browsing times
With sizei, j are the byte numbers of page j, and sizei, k are all accession page byte number sums, and k is natural number, and m is all pages
Face sum;
E, at server by threshold setting unit set recommendation threshold value, server select recommend resource pool in recommendation
More than specified threshold web page recommendation to user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310717507.4A CN103678652B (en) | 2013-12-23 | 2013-12-23 | Information individualized recommendation method based on Web log data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310717507.4A CN103678652B (en) | 2013-12-23 | 2013-12-23 | Information individualized recommendation method based on Web log data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103678652A CN103678652A (en) | 2014-03-26 |
CN103678652B true CN103678652B (en) | 2017-02-01 |
Family
ID=50316196
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310717507.4A Active CN103678652B (en) | 2013-12-23 | 2013-12-23 | Information individualized recommendation method based on Web log data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103678652B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104506480B (en) * | 2014-06-27 | 2018-11-23 | 深圳市永达电子信息股份有限公司 | The cross-domain access control method and system combined based on label with audit |
CN104331433B (en) * | 2014-10-22 | 2017-12-29 | 浙江中烟工业有限责任公司 | A kind of Tobacco Reference based on mobile terminal user's daily record recommends method |
CN105589905B (en) * | 2014-12-26 | 2019-06-18 | 中国银联股份有限公司 | The analysis of user interest data and collection system and its method |
CN104866540B (en) * | 2015-05-04 | 2018-04-27 | 华中科技大学 | A kind of personalized recommendation method based on group of subscribers behavioural analysis |
CN105589917B (en) * | 2015-09-17 | 2017-05-03 | 广州市动景计算机科技有限公司 | Method and device for analyzing log information of browser |
CN106302849A (en) * | 2016-08-04 | 2017-01-04 | 北京集奥聚合科技有限公司 | A kind of method carrying out moving solid fusion by carrier data |
CN106302851B (en) * | 2016-08-09 | 2019-08-02 | 厦门天锐科技股份有限公司 | A method of judging that server is accessed by which kind of network type |
CN106528852A (en) * | 2016-11-25 | 2017-03-22 | 盐城工学院 | Method and device for conducting redirecting by accessing data |
CN107256261B (en) * | 2017-06-13 | 2021-03-19 | 中原工学院 | Electronic information transmission system and method thereof |
CN107341397A (en) * | 2017-06-30 | 2017-11-10 | 福建师范大学 | Big data platform session recognition methods based on dynamic time threshold value |
CN109388737B (en) * | 2017-08-03 | 2023-03-31 | 腾讯科技(北京)有限公司 | Method and device for sending exposure data of content item and storage medium |
CN108109035A (en) * | 2017-12-08 | 2018-06-01 | 上海电机学院 | Webpage recommending method based on Web personalizations |
CN108109043A (en) * | 2017-12-22 | 2018-06-01 | 重庆邮电大学 | A kind of commending system reduces the method for repeating to recommend |
CN109299375A (en) * | 2018-10-24 | 2019-02-01 | 中国平安人寿保险股份有限公司 | Information personalized push method, device, electronic equipment and storage medium |
CN110188566A (en) * | 2019-05-19 | 2019-08-30 | 复旦大学 | A method of the test access behavior based on sequence analysis damages data equity |
CN111400628B (en) * | 2020-03-12 | 2023-04-07 | 腾讯科技(深圳)有限公司 | Information propagation method, device, equipment and medium |
CN114840486B (en) * | 2022-06-28 | 2022-09-16 | 广州趣米网络科技有限公司 | User behavior data acquisition method and system and cloud platform |
CN117194804B (en) * | 2023-11-08 | 2024-01-26 | 上海银行股份有限公司 | Guiding recommendation method and system suitable for operation management system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1967533A (en) * | 2006-07-17 | 2007-05-23 | 北京航空航天大学 | Gateway personalized recommendation service method and system introduced yuan recommendation engine |
CN102141986A (en) * | 2010-01-28 | 2011-08-03 | 北京邮电大学 | Individualized information providing method and system based on user behaviors |
CN102819575A (en) * | 2012-07-20 | 2012-12-12 | 南京大学 | Personalized search method for Web service recommendation |
-
2013
- 2013-12-23 CN CN201310717507.4A patent/CN103678652B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1967533A (en) * | 2006-07-17 | 2007-05-23 | 北京航空航天大学 | Gateway personalized recommendation service method and system introduced yuan recommendation engine |
CN102141986A (en) * | 2010-01-28 | 2011-08-03 | 北京邮电大学 | Individualized information providing method and system based on user behaviors |
CN102819575A (en) * | 2012-07-20 | 2012-12-12 | 南京大学 | Personalized search method for Web service recommendation |
Non-Patent Citations (6)
Title |
---|
A Taxonomy of Recommender Agents on the Internet;MIQUEL MONTANER 等;《Artificial Intelligence Review》;20030831;第19卷(第4期);全文 * |
Toward the Next Generation of Recommender;Gediminas Adomavicius等;《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》;20050630;第17卷(第6期);全文 * |
Web日志挖掘技术的研究与应用;陈文臣;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20070228;全文 * |
基于Web日志挖掘的个性化推荐研究;张海鹏;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20070630;全文 * |
基于Web日志的用户兴趣聚类研究;陈峰;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20081130;全文 * |
用户profile中用户兴趣度计算方法的改进;温彩玲;《太原城市职业技术学院学报》;20100131(第1期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN103678652A (en) | 2014-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103678652B (en) | Information individualized recommendation method based on Web log data | |
US10776885B2 (en) | Mutually reinforcing ranking of social media accounts and contents | |
US20090319449A1 (en) | Providing context for web articles | |
CN108334489B (en) | Text core word recognition method and device | |
CN103778260A (en) | Individualized microblog information recommending system and method | |
US20140351267A1 (en) | Overlapping Community Detection in Weighted Graphs | |
US20170235726A1 (en) | Information identification and extraction | |
CN107153716B (en) | Webpage content extraction method and device | |
CN107241215B (en) | User behavior prediction method and device | |
CN102207967B (en) | Method and system for automatically providing new browser plugin | |
CN112749326A (en) | Information processing method, information processing device, computer equipment and storage medium | |
US20180046628A1 (en) | Ranking social media content | |
CN111429161B (en) | Feature extraction method, feature extraction device, storage medium and electronic equipment | |
CN113592522A (en) | Method and apparatus for processing traffic data, and computer-readable storage medium | |
CN102662972A (en) | A visually disabled person-oriented automatic picture description method for web content barrier-free access | |
US11269896B2 (en) | System and method for automatic difficulty level estimation | |
CN109819002B (en) | Data pushing method and device, storage medium and electronic device | |
CN112287272A (en) | Method, system and storage medium for classifying website list pages | |
CN103870452A (en) | Method and method for recommending data | |
CN112307352A (en) | Content recommendation method, system, device and storage medium | |
CN104537080B (en) | Information recommends method and system | |
CN115204436A (en) | Method, device, equipment and medium for detecting abnormal reasons of business indexes | |
CN116956183A (en) | Multimedia resource recommendation method, model training method, device and storage medium | |
CN108920546B (en) | Steady-state label development method and system based on user requirements | |
Lin et al. | Combining a segmentation-like approach and a density-based approach in content extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |