CN103401849B - Abnormal session analyzing method for website logs - Google Patents

Abnormal session analyzing method for website logs Download PDF

Info

Publication number
CN103401849B
CN103401849B CN201310303384.XA CN201310303384A CN103401849B CN 103401849 B CN103401849 B CN 103401849B CN 201310303384 A CN201310303384 A CN 201310303384A CN 103401849 B CN103401849 B CN 103401849B
Authority
CN
China
Prior art keywords
session
access
page
exception
conversation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310303384.XA
Other languages
Chinese (zh)
Other versions
CN103401849A (en
Inventor
陆道宏
汤伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rock Software (shanghai) Co Ltd
Original Assignee
Rock Software (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rock Software (shanghai) Co Ltd filed Critical Rock Software (shanghai) Co Ltd
Priority to CN201310303384.XA priority Critical patent/CN103401849B/en
Publication of CN103401849A publication Critical patent/CN103401849A/en
Application granted granted Critical
Publication of CN103401849B publication Critical patent/CN103401849B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses an abnormal session analyzing method for website logs. The method comprises the following steps: forming an independent purpose access unit (for example a page, a picture, script access and the like) by a user session; in the initial analyzing stage, analyzing a normal session flow by way of combining modes for automatically accessing a website in the logs; continuously adding normal session streams in the subsequent processing process, and boiling down all sessions expect normal ones to abnormal sessions; and meanwhile, respectively displaying and processing the abnormal sessions in different levels. The method from the starting point of abnormality analysis can greatly improve the analyzing efficiency of the abnormality of the website logs without leaking any abnormal access sessions, so that the method is most valuable for safety/forensic application to analyzing hacking.

Description

A kind of web log file exception conversation analysis method
Technical field
The present invention relates to the analyzing and processing technology of website data and in particular to the analytical technology of web log file.
Background technology
Web log file analysis has much different applications, different according to application purpose, the requirement to log analysis and Processing mode can be different.Adopting statistical method the research that web data with analyzing system performance as target is excavated more;With Improve system design be target data mining more than using association rule mining method;To understand the number as target for the user view According to more than Research on Mining using the method for cluster result and classified excavation.Data scrubbing, session all can be related in these areas The technology such as mark and ID, but it is because that the requirement applied is different, the analysis to web log file and processing mode are also very much not Identical.
Exception for website data uses, abnormal abnormal, access mode abnormal, the access behavior including accessing purpose And the exception of access tool.Website provides content and application message, and normal access user uses browser access net Content of pages on standing, solves the problems, such as a certain class using website.The access purpose of abnormal user is to obtain different letter Breath data, so its access mode, access behavior and access tool all can be different.The abnormal conversation analysis of website data When being applied to evidence analysis field, then more fine to the processing requirement of web log file it is impossible to leak through the exception of every little bit, Attack or website abnormal source can finally be found out.
Data scrubbing refers to remove incoherent data in log recording, and traditional web analytics are concerned with user and access The link page, the picture including for the page, the display syntax, script etc. are regarded as a part for the page, clear in data These incoherent contents will directly be deleted during reason.For anomaly analysis, any data is not had to be uncorrelated , it is finally fixed number that each page includes how many contents, is construed as exception fewer or greater than this fixed number.
User conversation refers to user's once effectively accessing to server, and the user conversation that normal website is expected is exactly user Click on a certain link, the related all data contents of this link are sent to user by website.Traditional conversation analysis potentially include User once accesses interlinking of the multiple pages involved by website, is simultaneously used simple session determination methods, is counting Constitute the content of session according to the journal entry remaining after cleaning.
Traditional web log file analysis focuses on the normal access module of website, such as carries out load optimized analysis, Yong Humo Formula analysis etc..Required for this web log file analysis method, the web log file bar number of analysis is very big, carries out website day using it The efficiency of will anomaly analysis is very low.
Furthermore, safety/evidence obtaining that the adopted method of traditional web log file analysis is not suitable for analyzing hacker attacks should With.Because, in hacker's class case, the related activities of hacker in website includes setting foot-point, scans, launches a offensive, uploading, proposing power, control System, Denial of Service attack etc., these access behavior and normal user's access mode, access behavior and access tool You Hen great area Not.So traditional web log file analysis method is not suitable for the safety/forensics analysis of hacker attacks at all.
Content of the invention
The present invention is directed to that website log analysis efficiency is low and analysis that be not suitable for hacker attacks safety/evidence obtaining etc. is asked Topic, and a kind of web log file exception conversation analysis method is provided.The method can not only greatly improve web log file anomaly analysis Efficiency, and to analysis hacker attacks safety/forensic applications also useful.
In order to achieve the above object, the present invention adopts the following technical scheme that:
A kind of web log file exception conversation analysis method, user conversation is formed as independent having by described analysis method The access unit of purpose, at the initial stage of analysis, analyzes normal conversation flow process by the way of automatically accessing website with reference to daily record, And be continuously increased normal conversation stream in follow-up processing procedure, by all normal outside session be all summed up as exception, Extremely different grades can be distinguished into show respectively and process simultaneously.
In the preferred embodiment of the present invention, described analysis method specific implementation step is as follows:
(1)Load web log file, by the initial analysis to daily record, obtain entrance and the page/file of website visiting Reference address information;
(2)Simulation browser and user behavior, start reptile engine and the page of website are captured, analyze each page Structure, generate link information between page documents object model and the page/file content;
(3)The DOM Document Object Model being generated using reptile and page link information are in combination with the access information of daily record, right Web log file carries out secondary analysis, generates preliminary session information stream, website session is carried out simultaneously normal/abnormal enter rower Note;
(4)Using URL pattern/access mode/returning result etc., exception and normal conversation are sorted out, and feed back to use Family, user can revise exception/normal conversation attribute it is possible to merge/partition/adjustment classification;
(5)According to the adjustment of user, log sessions are processed, export all abnormal session streams, abnormal meeting is according to interior The configuration mode put is divided into different grades to carry out display and processes further.
Further, described step(1)Each daily record of scanning, the client ip of parsing daily record, access time, access side Method, the page link accessing, client-side program, server return value, server state etc., each access that analysis obtains Page link(URL)It is exactly the page/file access address of website.
Further, described step(2)Simulation browser conducts interviews to each different URL of website, and analysis is returned Return the DOM Document Object Model of content of pages, if the access of certain document object will not cause and quote access to other objects, It is defined as an atomic access;If access the content that certain document can obtain other document objects simultaneously, other document objects Connection be included in the document and become an atomic access, comprise the composition standard of the atomic access of multiple document objects links Session, the atomic access only comprising a document object access does not constitute standard session.
Further, described step(3)According to step(2)The session of middle determination and atomic access are scanned to daily record locating Reason, web log file circulation is changed into session information stream;All standard session are defined as normal conversation, by all improper sessions It is defined as abnormal session.
Further, described step(4)In merge/decouple according to the mode of pattern match/adjust classification.
Further, described step(3)Carry out web log file as follows to conversate process:
(31)The page session complying fully with simulation browser access rule is identified as normally;
(32)The session meeting the pattern of being pre-configured with is identified as exception;
(33)The Configuration Values that a certain session subscriber repeated accesses exceed setting are identified as exception;
(33)It is identified as exception less than a certain page session setting Configuration Values;
(34)Normal page session can not be labeled as and be identified as exception;(35)To session by way of man-machine interactively Analysis result is processed, and the normal/abnormal of session is labeled changing, and carries out merger to conversation modes so that normal meeting The species of words reduces, increasing number.
Further, described step(5)Filter all of normal conversation, the abnormal session only in show log, according to joining The abnormal session that is not all put generates different ranks, and having similar rank and the abnormal show of access type is identical mould Formula.
Carry out the anomaly analysis of web log file using method of the present invention, can be by web log file in the conversation analysis stage Amount of analysis taper to the 1/8 of original log bar number, exclude normal access log session after, the scale of log analysis is permissible Taper to the 1/100 of original log bar number, substantially increase the efficiency of web log file anomaly analysis.
Specific embodiment
In order that technological means, creation characteristic, reached purpose and effect that the present invention realizes are easy to understand, tie below Close example and the present invention is expanded on further.
The web log file exception conversation analysis method that the present invention provides, its by by user conversation be formed as one independent Autotelic access unit(Such as one page, picture, a script access etc.), at the initial stage of analysis, using combination The mode that daily record accesses website automatically analyzes normal conversation flow process, and is continuously increased normal meeting in follow-up processing procedure Words stream, by all normal outside session be all summed up as exception, simultaneously abnormal can be distinguished into different grades show respectively and Process.
The program in the specific implementation, mainly includes following three part:
1) process session:By to the scanning of the originating website and analysis, the session logic in web log file can be found out, from And draw the accurate session stream in web log file.
2) enumerate the pattern of whole user conversations:Number of sessions in actual site is probably little, such as one public affairs Department website, the page after carrying out synthesis, not over 20.The session of forum website is also few, and it is right substantially can to range The access of module and the access to the concrete page, its conversation modes is also within 20.
3) session of difference abnormal access and normal users session:Website visiting row for abnormal website application target For its session is different from normal conversation pattern.
Thus, web log file exception conversation analysis method in first have to the session in web log file is analyzed, with Afterwards it is analyzed based on the session stream information of website abnormal.It is embodied as follows:
(1)Load web log file, by the initial analysis to daily record, obtain entrance and the page/file of website visiting Link information between content.
This step in the specific implementation, by scanning each daily record, the parsing client ip of daily record, access time, visit Ask method, the page link accessing, client-side program, server return value, server state etc., analyze each obtaining URL is the page/file access address of website.
(2)Simulation browser and user behavior, start reptile engine and the page of website are captured, analyze each page Structure, generate link information between page documents object model and the page/file content.
This step in the specific implementation, by simulation browser(Reptile)Each different URL of website is conducted interviews, The DOM Document Object Model of analysis back page content, if the access of certain document object will not cause and quote visit to other objects Ask it is determined that being an atomic access;If access the content that certain document can obtain other document objects simultaneously, other documents The connection of object is included in the document becomes an atomic access, comprises the composition of the atomic access of multiple document object links Standard session, the atomic access only comprising a document object access does not constitute standard session.
(3)The DOM Document Object Model being generated using reptile and page link information are in combination with the access information of daily record, right Web log file carries out secondary analysis, with specific reference to step(2)The session of middle determination and atomic access are scanned to daily record processing, Web log file circulation is changed into session information stream;All standard session are defined as normal conversation, will be true for all improper sessions It is set to abnormal session, normal/abnormal being labeled is carried out to website session simultaneously.What it specifically marked passes sequentially through following sentencing Disconnected step is realizing:
(31)Judge whether website session complies fully with the page session that simulation browser accesses rule, if being then identified For normal;
(32)Judge whether website session meets the session of the pattern of being pre-configured with, be if so, then identified as exception;
(33)Judge in the session of website whether a certain session subscriber repeated accesses exceed the Configuration Values of setting, if so, then It is identified as exception;
(33)Judge website session Configuration Values be enough be less than corresponding set Configuration Values, if so, then this session is identified For exception;
(34)Through step(31)Extremely(33)Normal page session can not be labeled as and be identified as exception;
(35)By way of man-machine interactively, conversation analysis result is processed, to session normal/abnormal enter rower Note change, merger is carried out to conversation modes so that normal conversation species reduce, increasing number.
(4)Using URL pattern/access mode/returning result etc., exception and normal conversation are sorted out, and feed back to use Family, user can be to revise exception/normal conversation attribute it is possible to merging in the way of according to pattern match/decouple/adjusting and return Class.
(5)According to the adjustment of user, log sessions are processed, export all abnormal session streams, abnormal meeting is according to interior The configuration mode put is divided into different grades to carry out display and processes further.
When this step is embodied as, filter all of normal conversation, the abnormal session only in show log, according to configuration It is not all abnormal session and generates different ranks, having similar rank and the abnormal show of access type is identical pattern.
General principle, principal character and the advantages of the present invention of the present invention have been shown and described above.The technology of the industry , it should be appreciated that the present invention is not restricted to the described embodiments, the simply explanation described in above-described embodiment and specification is originally for personnel The principle of invention, without departing from the spirit and scope of the present invention, the present invention also has various changes and modifications, these changes Change and improvement both falls within scope of the claimed invention.Claimed scope by appending claims and its Equivalent thereof.

Claims (7)

1. a kind of web log file exception conversation analysis method is it is characterised in that user conversation is formed as one by described analysis method Individual independent autotelic access unit, at the initial stage of analysis, is just analyzed by the way of automatically accessing website with reference to daily record Often session flow process, and be continuously increased normal conversation stream in follow-up processing procedure, by all normal outside session all by It is attributed to exception, extremely can be distinguished into different grades simultaneously and show respectively and process;Described analysis method specific implementation step As follows:
(1) load web log file, by the initial analysis to daily record, obtain entrance and the page/file access of website visiting Address information;
(2) simulation browser and user behavior, starts reptile engine and the page of website is captured, analyze the knot of each page Structure, generates link information between page documents object model and the page/file content;
(3) DOM Document Object Model and the page link information of reptile generation are used, in combination with the access information of daily record, to website Daily record carries out secondary analysis, generates preliminary session information stream, carries out normal/abnormal being labeled to website session simultaneously;
(4) using URL pattern/access mode/returning result, exception and normal conversation are sorted out, and feed back to user, use Exception/normal conversation attribute is revised it is possible to merge/partition/adjustment classification in family;
(5) log sessions are processed by the adjustment according to user, export all abnormal session streams, and abnormal meeting is according to built-in Configuration mode is divided into different grades to carry out display and processes further.
2. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (1) Each daily record of scanning, the client ip of parsing daily record, access time, access method, the page link accessing, client journey Sequence, server return value, server state, each page link accessing that analysis obtains is exactly the page/file of website Reference address.
3. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (2) Simulation browser conducts interviews to each different URL of website, the DOM Document Object Model of analysis back page content, if The access of certain document object will not cause to other objects quote access it is determined that be an atomic access;If accessing certain Document can obtain the content of other document objects simultaneously, then the connection of other document objects be included in the document become one former Son accesses;Comprise the composition standard session of the atomic access of multiple document object links, only comprise what a document object accessed Atomic access does not constitute standard session.
4. a kind of web log file exception conversation analysis method according to claim 3 is it is characterised in that described step (3) Daily record is scanned process according to the session determining in step (2) and atomic access, web log file circulation is changed into session letter Breath stream;All standard session are defined as normal conversation, all improper sessions are defined as abnormal session.
5. a kind of web log file exception conversation analysis method according to claim 4 is it is characterised in that described step (3) Carry out web log file as follows to conversate process:
(31) page session complying fully with simulation browser access rule is identified as normally;
(32) session meeting the pattern of being pre-configured with is identified as exception;
(33) Configuration Values that a certain session subscriber repeated accesses exceed setting are identified as exception;
(33) it is less than a certain page session setting Configuration Values and be identified as exception;
(34) normal page session can not be labeled as and be identified as exception;
(35) by way of man-machine interactively, conversation analysis result is processed, the normal/abnormal of session is labeled more Change, merger is carried out to conversation modes so that normal conversation species reduce, increasing number.
6. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (4) In merge/decouple according to the mode of pattern match/adjust classification.
7. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (5) Filter all of normal conversation, the abnormal session only in show log, according to configuration be not all abnormal session generate different Rank, having similar rank and the abnormal show of access type is identical pattern.
CN201310303384.XA 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs Active CN103401849B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310303384.XA CN103401849B (en) 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310303384.XA CN103401849B (en) 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs

Publications (2)

Publication Number Publication Date
CN103401849A CN103401849A (en) 2013-11-20
CN103401849B true CN103401849B (en) 2017-02-15

Family

ID=49565375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310303384.XA Active CN103401849B (en) 2013-07-18 2013-07-18 Abnormal session analyzing method for website logs

Country Status (1)

Country Link
CN (1) CN103401849B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105262720A (en) * 2015-09-07 2016-01-20 深信服网络科技(深圳)有限公司 Web robot traffic identification method and device
CN106649312B (en) * 2015-10-29 2019-10-29 北京北方华创微电子装备有限公司 The analysis method and system of journal file
US9872072B2 (en) * 2016-03-21 2018-01-16 Google Llc Systems and methods for identifying non-canonical sessions
CN114417197A (en) * 2017-01-13 2022-04-29 阿里巴巴集团控股有限公司 Access record processing method and device and storage medium
CN107204991A (en) * 2017-07-06 2017-09-26 深信服科技股份有限公司 A kind of server exception detection method and system
CN107590227A (en) * 2017-09-05 2018-01-16 成都知道创宇信息技术有限公司 A kind of log analysis method of combination reptile
CN107483507B (en) * 2017-09-30 2020-11-13 北京东土军悦科技有限公司 Session analysis method, device and storage medium
CN109241733A (en) * 2018-08-07 2019-01-18 北京神州绿盟信息安全科技股份有限公司 Crawler Activity recognition method and device based on web access log
CN111224963A (en) * 2019-12-30 2020-06-02 北京安码科技有限公司 Network shooting range task duplication method, system, electronic equipment and storage medium
CN111224823B (en) * 2020-01-06 2022-08-16 杭州数群科技有限公司 Method based on different network log analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101232399A (en) * 2008-02-18 2008-07-30 刘峰 Analytical method of website abnormal visit
CN101242307A (en) * 2008-02-01 2008-08-13 刘峰 Website access analysis system and method based on built-in code proxy log
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log
EP2610776A2 (en) * 2011-09-16 2013-07-03 Veracode, Inc. Automated behavioural and static analysis using an instrumented sandbox and machine learning classification for mobile security

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010011792A2 (en) * 2008-07-22 2010-01-28 Widemile Inc. Method and system for web-site testing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101242307A (en) * 2008-02-01 2008-08-13 刘峰 Website access analysis system and method based on built-in code proxy log
CN101232399A (en) * 2008-02-18 2008-07-30 刘峰 Analytical method of website abnormal visit
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
EP2610776A2 (en) * 2011-09-16 2013-07-03 Veracode, Inc. Automated behavioural and static analysis using an instrumented sandbox and machine learning classification for mobile security
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log

Also Published As

Publication number Publication date
CN103401849A (en) 2013-11-20

Similar Documents

Publication Publication Date Title
CN103401849B (en) Abnormal session analyzing method for website logs
US8625642B2 (en) Method and apparatus of network artifact indentification and extraction
US9424319B2 (en) Social media based content selection system
US6741990B2 (en) System and method for efficient and adaptive web accesses filtering
US9305302B2 (en) Weighting sentiment information
DE112012002624T5 (en) Regex compiler
CN104615627B (en) A kind of event public feelings information extracting method and system based on microblog
US8166161B1 (en) System and method for ensuring privacy while tagging information in a network environment
CN109104456A (en) A kind of user tracking based on browser fingerprint and propagating statistics analysis method
CN108462888A (en) The intelligent association analysis method and system of user's TV and internet behavior
JPWO2007148817A1 (en) Content recommendation system, content recommendation method, and content recommendation program
US8433666B2 (en) Link information extracting apparatus, link information extracting method, and recording medium
DE112020000136T5 (en) Low entropy browsing history for quasi-personalizing content
US20060149771A1 (en) Information processing system and communication retry method
Hellsten et al. The creation of the climategate hype in blogs and newspapers: mixed methods approach
Kiriya From “troll factories” to “littering the information space”: Control strategies over the Russian internet
CN104933077A (en) Rule-based multi-file information analysis method
US8005810B2 (en) Scoping and biasing search to user preferred domains or blogs
CN102937973A (en) Method and device for generating presentation configuration information used for information presentation
CN109190408B (en) Data information security processing method and system
Zhou et al. Collection of us extremist online forums: A web mining approach
US20130205015A1 (en) Method and Device for Analyzing Data Intercepted on an IP Network in order to Monitor the Activity of Users on a Website
US20190197069A1 (en) Social Media Based Content Selection System
KR100989320B1 (en) B-Tree Index Vector Based Web-Log High-Speed Search Method For Huge Web Log Mining And Web Attack Detection and B-tree based indexing log processor
Haavik Deep Learning-Based Traffic Classification for Network Penetration Testing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant