CN103401849A - Abnormal session analyzing method for website logs - Google Patents

Abnormal session analyzing method for website logs Download PDF

Info

Publication number
CN103401849A
CN103401849A CN201310303384XA CN201310303384A CN103401849A CN 103401849 A CN103401849 A CN 103401849A CN 201310303384X A CN201310303384X A CN 201310303384XA CN 201310303384 A CN201310303384 A CN 201310303384A CN 103401849 A CN103401849 A CN 103401849A
Authority
CN
China
Prior art keywords
session
access
abnormal
page
conversation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310303384XA
Other languages
Chinese (zh)
Other versions
CN103401849B (en
Inventor
陆道宏
汤伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rock Software (shanghai) Co Ltd
Original Assignee
Rock Software (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rock Software (shanghai) Co Ltd filed Critical Rock Software (shanghai) Co Ltd
Priority to CN201310303384.XA priority Critical patent/CN103401849B/en
Publication of CN103401849A publication Critical patent/CN103401849A/en
Application granted granted Critical
Publication of CN103401849B publication Critical patent/CN103401849B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses an abnormal session analyzing method for website logs. The method comprises the following steps: forming an independent purpose access unit (for example a page, a picture, script access and the like) by a user session; in the initial analyzing stage, analyzing a normal session flow by way of combining modes for automatically accessing a website in the logs; continuously adding normal session streams in the subsequent processing process, and boiling down all sessions expect normal ones to abnormal sessions; and meanwhile, respectively displaying and processing the abnormal sessions in different levels. The method from the starting point of abnormality analysis can greatly improve the analyzing efficiency of the abnormality of the website logs without leaking any abnormal access sessions, so that the method is most valuable for safety/forensic application to analyzing hacking.

Description

A kind of abnormal conversation analysis method of web log file
Technical field
The present invention relates to the analyzing and processing technology of website data, is specifically related to the analytical technology of web log file.
Background technology
The web log file analysis has a lot of different applications, according to the application purpose difference, and can be different to requirement and the processing mode of log analysis.The statistical method of the many employings of the research of the Web data mining take analyzing system performance as target; The method of association rule mining is adopted in the data mining that is designed to target with improved system more; To understand user view as the data mining research of the target methods that adopt cluster excavation and classified excavation more.In these fields, all can relate to the technology such as data scrubbing, session identification and user ID, but because the requirement difference of application also differs widely to analysis and the processing mode of web log file.
For the abnormal use of website data, extremely comprise the abnormal of abnormal, the access behavior of abnormal, access mode of access purpose and access tool.Website provides content and application message, and normal calling party uses the content of pages on the browser access website, uses website to solve the problem of a certain class.The access purpose of abnormal user is to obtain different information and data, so its access mode, access behavior and access tool all can be different.When the abnormal conversation analysis of website data is applied to the evidence analysis field, meticulousr to the processing requirements of web log file, can not leak the abnormal of every little bit, could finally find out and attack or the website abnormal source.
Data scrubbing refers to removes incoherent data in log recording, what traditional web analytics was paid close attention to is the link page that the user accesses, the picture that comprises for the page, show that the syntax, script etc. all think the part of the page, in the process of data scrubbing, will directly delete these incoherent contents.For anomaly analysis, it is incoherent there is no what data, and each page comprises that how many contents are finally fixed numbers, are less than or can think extremely more than this fixed number.
User conversation refers to the once effectively access of user to server, and the user conversation that normal website is expected is exactly that the user clicks a certain link, and website will link all relevant data contents and send to the user.Traditional conversation analysis may comprise interlinking of a plurality of pages that access websites of user is related, has used simultaneously simple session determination methods, and the journal entry that remains after data scrubbing has formed the content of session.
Traditional web log file analysis focuses on the normal access module of website, as carries out load optimized analysis, user model analysis etc.The web log file number of the required analysis of this web log file analytical method is very large, utilizes very low of its efficiency of carrying out the web log file anomaly analysis.
Moreover the method that traditional web log file analysis is adopted is not suitable for safety/forensic applications of analyzing hacker attacks.Due in hacker's class case, the activities of hacker that website is relevant comprises sets foot-point, scans, launches a offensive, uploads, carries power, control, Denial of Service attack etc., and these access behaviors have very large difference with normal user's access mode, access behavior and access tool.So traditional web log file analytical method is not applicable to the safety/forensics analysis of hacker attacks.
Summary of the invention
The present invention is directed to website using log analysis efficiency low and be not suitable for the problems such as analysis of hacker attacks safety/evidence obtaining, and provide a kind of web log file abnormal conversation analysis method.The method not only can improve the efficiency of web log file anomaly analysis greatly, and also useful to safety/forensic applications of analyzing hacker attacks.
In order to achieve the above object, the present invention adopts following technical scheme:
A kind of abnormal conversation analysis method of web log file; described analytical method forms an independently autotelic addressed location by user conversation; the initial stage of analyzing; employing analyzes the normal conversation flow process in conjunction with the mode of daily record automatic access website; and in follow-up processing procedure, constantly increase normal conversation stream; session outside all are normal all is summed up as extremely, extremely can be distinguished into different grades simultaneously and shows respectively and process.
In preferred embodiment of the present invention, the concrete implementation step of described analytical method is as follows:
(1) load web log file, by the initial analysis to daily record, obtain entrance and the page/file access address information of website visiting;
(2) simulation browser and user behavior, startup reptile engine captures the page of website, analyzes the structure of each page, generates link information between page documents object model and the page/file content;
(3) DOM Document Object Model and the page link information of using reptile to generate, in conjunction with the visit information of daily record, carry out secondary analysis to web log file simultaneously, generates preliminary session information stream, simultaneously the website session carried out to normal/abnormal the mark;
(4) use URL pattern/access mode/return results wait to extremely and normal conversation sort out, and feed back to the user, the user can revise extremely/normal conversation attribute, and can merge/break/adjust classification;
(5) according to user's adjustment, log sessions is processed, exported all abnormal session streams, extremely can be divided into different grades according to built-in configuration mode and show and further process.
Further, described step (1) scans each daily record, the page link of the client ip of parsing daily record, access time, access method, access, client-side program, server return value, server state etc., the page link (URL) of analyzing each access that obtains is exactly the page of website/file access address.
Further, described step (2) simulation browser conducts interviews to each different URL of website, analyze the DOM Document Object Model of back page content, if the access of certain document object can not cause the access of quoting to other object, be defined as an atomic access; If access the content that certain document can obtain other document object simultaneously, the connection of other document object is included in the document becomes an atomic access, the formation standard session that comprises the atomic access of a plurality of document object links, the atomic access that only comprises a document object access does not form standard session.
Further, described step (3) is carried out scan process according to session and the atomic access in step (2), determined to daily record, changes the web log file circulation into session information stream; All standard session are defined as to normal conversation, all improper sessions are defined as to abnormal session.
Further, in described step (4) mode according to pattern matching merge/break/adjust classification.
Further, described step (3) is carried out as follows web log file and is carried out Dialog processing:
(31) page session that meets simulation browser access rule fully is identified as normally;
(32) session that meets pre-configured pattern is identified as extremely;
(33) a certain session subscriber repeated accesses is identified as extremely over the Configuration Values of setting;
(33) page session lower than a certain setting Configuration Values is identified as extremely;
(34) can not be labeled as normal page session is identified as extremely; (35) by the mode of man-machine interactively, the conversation analysis result is processed, session normal/abnormal marked to change, conversation modes is carried out to merger, make the kind of normal conversation reduce, quantity increases.
Further, described step (5) is filtered all normal conversation, and the only abnormal session in show log generates different ranks according to the abnormal session of not being all of configuration, and the abnormal show with similar rank and access type is identical pattern.
Use method of the present invention to carry out the anomaly analysis of web log file, in the conversation analysis stage, the amount of analysis of web log file can be tapered to 1/8 of original log number, after getting rid of normal access log session, the scale of log analysis can taper to 1/100 of original log number, has greatly improved the efficiency of web log file anomaly analysis.
Embodiment
For technological means, creation characteristic that the present invention is realized, reach purpose and effect is easy to understand, below in conjunction with example, further set forth the present invention.
The abnormal conversation analysis method of web log file provided by the invention; it is by forming user conversation an independently autotelic addressed location (such as a page, a picture, a script access etc.); the initial stage of analyzing; employing analyzes the normal conversation flow process in conjunction with the mode of daily record automatic access website; and in follow-up processing procedure, constantly increase normal conversation stream; session outside all are normal all is summed up as extremely, extremely can be distinguished into different grades simultaneously and shows respectively and process.
This scheme in the specific implementation, mainly comprises following three parts:
1) process session: by the scanning to the originating website and analysis, can find out the session logic in web log file, thereby draw the accurate session stream in web log file.
2) enumerate the pattern of whole user conversations: the number of sessions in true website may be seldom, and such as a corporation sites, the page, can be over 20 after carrying out comprehensively.The session of forum website is also few, basically can range to the access of module and to the access of the concrete page, and its conversation modes is also in 20.
3) session and the normal users session of difference abnormal access: for the website visiting behavior of abnormal website application target, its session is different from the normal conversation pattern.
Thus, the abnormal conversation analysis method of web log file at first to analyze the session in web log file, analyze based on the session stream information of website subsequently that it is abnormal.It specifically is implemented as follows:
(1) load web log file, by the initial analysis to daily record, obtain link information between the entrance of website visiting and the page/file content.
This step in the specific implementation, by scanning each daily record, the page link of the client ip of parsing daily record, access time, access method, access, client-side program, server return value, server state etc., analyze the page that each URL that obtains is website/file access address.
(2) simulation browser and user behavior, startup reptile engine captures the page of website, analyzes the structure of each page, generates link information between page documents object model and the page/file content.
This step in the specific implementation, each different URL to website conducts interviews by simulation browser (reptile), analyze the DOM Document Object Model of back page content, if the access of certain document object can not cause the access of quoting to other object, be defined as an atomic access; If access the content that certain document can obtain other document object simultaneously, the connection of other document object is included in the document becomes an atomic access, the formation standard session that comprises the atomic access of a plurality of document object links, the atomic access that only comprises a document object access does not form standard session.
(3) DOM Document Object Model and the page link information of using reptile to generate, while is in conjunction with the visit information of daily record, web log file is carried out to secondary analysis, specifically according to session and the atomic access in step (2), determined, daily record is carried out to scan process, change the web log file circulation into session information stream; All standard session are defined as to normal conversation, all improper sessions are defined as to abnormal session, simultaneously the website session is carried out to normal/abnormal the mark.Realizing by following determining step successively of its concrete mark:
(31) judge whether the website session meets the page session of simulation browser access rule fully, if be identified as normal;
(32) judge whether the website session meets the session of pre-configured pattern, if be identified as abnormal;
(33) judge in the session of website, whether a certain session subscriber repeated accesses surpasses the Configuration Values of setting, if be identified as abnormal;
(33) Configuration Values of judgement website session is enough lower than the corresponding Configuration Values of setting, if this session is identified as extremely;
(34) through step (31) to (33), can not be labeled as normal page session is identified as extremely;
(35) by the mode of man-machine interactively, the conversation analysis result is processed, session normal/abnormal marked to change, conversation modes is carried out to merger, make the kind of normal conversation reduce, quantity increases.
(4) use URL pattern/access mode/return results wait to extremely and normal conversation sort out, and feed back to the user, the user can revise extremely/normal conversation attribute, and can merge according to the mode of pattern matching/classification breaks/adjust.
(5) according to user's adjustment, log sessions is processed, exported all abnormal session streams, extremely can be divided into different grades according to built-in configuration mode and show and further process.
When this step is specifically implemented, filter all normal conversation, the only abnormal session in show log, generate different ranks according to the abnormal session of not being all of configuration, and the abnormal show with similar rank and access type is identical pattern.
Above demonstration and described basic principle of the present invention, principal character and advantage of the present invention.The technical staff of the industry should understand; the present invention is not restricted to the described embodiments; that in above-described embodiment and specification, describes just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims (8)

1. abnormal conversation analysis method of web log file; it is characterized in that; described analytical method forms an independently autotelic addressed location by user conversation; the initial stage of analyzing; employing analyzes the normal conversation flow process in conjunction with the mode of daily record automatic access website; and in follow-up processing procedure, constantly increase normal conversation stream, the session outside all are normal all is summed up as extremely, and the while extremely can be distinguished into different grades and shows respectively and process.
2. the abnormal conversation analysis method of a kind of web log file according to claim 1, is characterized in that, the concrete implementation step of described analytical method is as follows:
(1) load web log file, by the initial analysis to daily record, obtain entrance and the page/file access address information of website visiting;
(2) simulation browser and user behavior, startup reptile engine captures the page of website, analyzes the structure of each page, generates link information between page documents object model and the page/file content;
(3) DOM Document Object Model and the page link information of using reptile to generate, in conjunction with the visit information of daily record, carry out secondary analysis to web log file simultaneously, generates preliminary session information stream, simultaneously the website session carried out to normal/abnormal the mark;
(4) use URL pattern/access mode/return results wait to extremely and normal conversation sort out, and feed back to the user, the user can revise extremely/normal conversation attribute, and can merge/break/adjust classification;
(5) according to user's adjustment, log sessions is processed, exported all abnormal session streams, extremely can be divided into different grades according to built-in configuration mode and show and further process.
3. the abnormal conversation analysis method of a kind of web log file according to claim 2; it is characterized in that; described step (1) scans each daily record; the page link of the client ip of parsing daily record, access time, access method, access, client-side program, server return value, server state etc., the page link of analyzing each access that obtains is exactly the page of website/file access address.
4. the abnormal conversation analysis method of a kind of web log file according to claim 2, it is characterized in that, described step (2) simulation browser conducts interviews to each different URL of website, analyze the DOM Document Object Model of back page content, if the access of certain document object can not cause the access of quoting to other object, be defined as an atomic access; If access the content that certain document can obtain other document object simultaneously, the connection of other document object is included in the document becomes an atomic access; The formation standard session that comprises the atomic access of a plurality of document object links, the atomic access that only comprises a document object access does not form standard session.
5. the abnormal conversation analysis method of according to claim 2 or 4 described a kind of web log file, it is characterized in that, described step (3) is carried out scan process according to session and the atomic access in step (2), determined to daily record, changes the web log file circulation into session information stream; All standard session are defined as to normal conversation, all improper sessions are defined as to abnormal session.
6. the abnormal conversation analysis method of a kind of web log file according to claim 5, is characterized in that, described step (3) is carried out as follows web log file and carried out Dialog processing:
(31) page session that meets simulation browser access rule fully is identified as normally;
(32) session that meets pre-configured pattern is identified as extremely;
(33) a certain session subscriber repeated accesses is identified as extremely over the Configuration Values of setting;
(33) page session lower than a certain setting Configuration Values is identified as extremely;
(34) can not be labeled as normal page session is identified as extremely;
(35) by the mode of man-machine interactively, the conversation analysis result is processed, session normal/abnormal marked to change, conversation modes is carried out to merger, make the kind of normal conversation reduce, quantity increases.
7. the abnormal conversation analysis method of a kind of web log file according to claim 2, is characterized in that, the mode according to pattern matching in described step (4) merges/breaks/adjust classification.
8. the abnormal conversation analysis method of a kind of web log file according to claim 2; it is characterized in that; described step (5) is filtered all normal conversation; the only abnormal session in show log; according to the abnormal session of not being all of configuration, generate different ranks, the abnormal show with similar rank and access type is identical pattern.
CN201310303384.XA 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs Active CN103401849B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310303384.XA CN103401849B (en) 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310303384.XA CN103401849B (en) 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs

Publications (2)

Publication Number Publication Date
CN103401849A true CN103401849A (en) 2013-11-20
CN103401849B CN103401849B (en) 2017-02-15

Family

ID=49565375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310303384.XA Active CN103401849B (en) 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs

Country Status (1)

Country Link
CN (1) CN103401849B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105262720A (en) * 2015-09-07 2016-01-20 深信服网络科技(深圳)有限公司 Web robot traffic identification method and device
CN106649312A (en) * 2015-10-29 2017-05-10 北京北方微电子基地设备工艺研究中心有限责任公司 Log file analysis method and system
CN107204991A (en) * 2017-07-06 2017-09-26 深信服科技股份有限公司 A kind of server exception detection method and system
CN107590227A (en) * 2017-09-05 2018-01-16 成都知道创宇信息技术有限公司 A kind of log analysis method of combination reptile
CN108027839A (en) * 2016-03-21 2018-05-11 谷歌有限责任公司 System and method for identifying non-standard session
CN108304410A (en) * 2017-01-13 2018-07-20 阿里巴巴集团控股有限公司 A kind of detection method, device and the data analysing method of the abnormal access page
CN109241733A (en) * 2018-08-07 2019-01-18 北京神州绿盟信息安全科技股份有限公司 Crawler Activity recognition method and device based on web access log
CN111224823A (en) * 2020-01-06 2020-06-02 杭州数群科技有限公司 Method based on different network log analysis
CN111224963A (en) * 2019-12-30 2020-06-02 北京安码科技有限公司 Network shooting range task duplication method, system, electronic equipment and storage medium
CN107483507B (en) * 2017-09-30 2020-11-13 北京东土军悦科技有限公司 Session analysis method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101232399A (en) * 2008-02-18 2008-07-30 刘峰 Analytical method of website abnormal visit
CN101242307A (en) * 2008-02-01 2008-08-13 刘峰 Website access analysis system and method based on built-in code proxy log
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
US20100064281A1 (en) * 2008-07-22 2010-03-11 Kimball Dean C Method and system for web-site testing
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log
EP2610776A2 (en) * 2011-09-16 2013-07-03 Veracode, Inc. Automated behavioural and static analysis using an instrumented sandbox and machine learning classification for mobile security

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101242307A (en) * 2008-02-01 2008-08-13 刘峰 Website access analysis system and method based on built-in code proxy log
CN101232399A (en) * 2008-02-18 2008-07-30 刘峰 Analytical method of website abnormal visit
US20100064281A1 (en) * 2008-07-22 2010-03-11 Kimball Dean C Method and system for web-site testing
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
EP2610776A2 (en) * 2011-09-16 2013-07-03 Veracode, Inc. Automated behavioural and static analysis using an instrumented sandbox and machine learning classification for mobile security
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105262720A (en) * 2015-09-07 2016-01-20 深信服网络科技(深圳)有限公司 Web robot traffic identification method and device
CN106649312A (en) * 2015-10-29 2017-05-10 北京北方微电子基地设备工艺研究中心有限责任公司 Log file analysis method and system
CN106649312B (en) * 2015-10-29 2019-10-29 北京北方华创微电子装备有限公司 The analysis method and system of journal file
CN108027839A (en) * 2016-03-21 2018-05-11 谷歌有限责任公司 System and method for identifying non-standard session
CN108027839B (en) * 2016-03-21 2023-09-15 谷歌有限责任公司 System and method for identifying non-canonical sessions
CN108304410A (en) * 2017-01-13 2018-07-20 阿里巴巴集团控股有限公司 A kind of detection method, device and the data analysing method of the abnormal access page
CN107204991A (en) * 2017-07-06 2017-09-26 深信服科技股份有限公司 A kind of server exception detection method and system
CN107590227A (en) * 2017-09-05 2018-01-16 成都知道创宇信息技术有限公司 A kind of log analysis method of combination reptile
CN107483507B (en) * 2017-09-30 2020-11-13 北京东土军悦科技有限公司 Session analysis method, device and storage medium
CN109241733A (en) * 2018-08-07 2019-01-18 北京神州绿盟信息安全科技股份有限公司 Crawler Activity recognition method and device based on web access log
CN111224963A (en) * 2019-12-30 2020-06-02 北京安码科技有限公司 Network shooting range task duplication method, system, electronic equipment and storage medium
CN111224823B (en) * 2020-01-06 2022-08-16 杭州数群科技有限公司 Method based on different network log analysis
CN111224823A (en) * 2020-01-06 2020-06-02 杭州数群科技有限公司 Method based on different network log analysis

Also Published As

Publication number Publication date
CN103401849B (en) 2017-02-15

Similar Documents

Publication Publication Date Title
CN103401849A (en) Abnormal session analyzing method for website logs
US9424319B2 (en) Social media based content selection system
US8286248B1 (en) System and method of web application discovery via capture and analysis of HTTP requests for external resources
CN105844140A (en) Website login brute force crack method and system capable of identifying verification code
US11537745B2 (en) Deep learning-based detection and data loss prevention of image-borne sensitive documents
CN103888490A (en) Automatic WEB client man-machine identification method
CN102436564A (en) Method and device for identifying falsified webpage
CN108768921B (en) Malicious webpage discovery method and system based on feature detection
CN102486799B (en) World wide web (WWW) page processing method and device
US20210383518A1 (en) Training and configuration of dl stack to detect attempted exfiltration of sensitive screenshot-borne data
AU2014400621B2 (en) System and method for providing contextual analytics data
Samtani et al. Identifying SCADA systems and their vulnerabilities on the internet of things: A text-mining approach
US20140331142A1 (en) Method and system for recommending contents
CN102006174B (en) Data processing method and device based on online behavior of mobile phone user
KR101005866B1 (en) Method And A system of Advanced Web Log Preprocess Algorithm for Rule Based Web IDS System
US20210383159A1 (en) Deep learning stack used in production to prevent exfiltration of image-borne identification documents
CN102185830B (en) A kind of method and system of security filtration of network television browser
CN105635064A (en) CSRF attack detection method and device
CN103488947A (en) Method and device for identifying instant messaging client-side account number stealing Trojan horse program
CN114244564A (en) Attack defense method, device, equipment and readable storage medium
CN105262720A (en) Web robot traffic identification method and device
CN110442582B (en) Scene detection method, device, equipment and medium
Ham et al. Big Data Preprocessing Mechanism for Analytics of Mobile Web Log.
CN104811418A (en) Virus detection method and apparatus
CN111353116B (en) Content detection method, system and device, client device and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant