CN103401849B - Abnormal session analyzing method for website logs - Google Patents
Abnormal session analyzing method for website logs Download PDFInfo
- Publication number
- CN103401849B CN103401849B CN201310303384.XA CN201310303384A CN103401849B CN 103401849 B CN103401849 B CN 103401849B CN 201310303384 A CN201310303384 A CN 201310303384A CN 103401849 B CN103401849 B CN 103401849B
- Authority
- CN
- China
- Prior art keywords
- session
- access
- page
- exception
- conversation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Debugging And Monitoring (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses an abnormal session analyzing method for website logs. The method comprises the following steps: forming an independent purpose access unit (for example a page, a picture, script access and the like) by a user session; in the initial analyzing stage, analyzing a normal session flow by way of combining modes for automatically accessing a website in the logs; continuously adding normal session streams in the subsequent processing process, and boiling down all sessions expect normal ones to abnormal sessions; and meanwhile, respectively displaying and processing the abnormal sessions in different levels. The method from the starting point of abnormality analysis can greatly improve the analyzing efficiency of the abnormality of the website logs without leaking any abnormal access sessions, so that the method is most valuable for safety/forensic application to analyzing hacking.
Description
Technical field
The present invention relates to the analyzing and processing technology of website data and in particular to the analytical technology of web log file.
Background technology
Web log file analysis has much different applications, different according to application purpose, the requirement to log analysis and
Processing mode can be different.Adopting statistical method the research that web data with analyzing system performance as target is excavated more;With
Improve system design be target data mining more than using association rule mining method;To understand the number as target for the user view
According to more than Research on Mining using the method for cluster result and classified excavation.Data scrubbing, session all can be related in these areas
The technology such as mark and ID, but it is because that the requirement applied is different, the analysis to web log file and processing mode are also very much not
Identical.
Exception for website data uses, abnormal abnormal, access mode abnormal, the access behavior including accessing purpose
And the exception of access tool.Website provides content and application message, and normal access user uses browser access net
Content of pages on standing, solves the problems, such as a certain class using website.The access purpose of abnormal user is to obtain different letter
Breath data, so its access mode, access behavior and access tool all can be different.The abnormal conversation analysis of website data
When being applied to evidence analysis field, then more fine to the processing requirement of web log file it is impossible to leak through the exception of every little bit,
Attack or website abnormal source can finally be found out.
Data scrubbing refers to remove incoherent data in log recording, and traditional web analytics are concerned with user and access
The link page, the picture including for the page, the display syntax, script etc. are regarded as a part for the page, clear in data
These incoherent contents will directly be deleted during reason.For anomaly analysis, any data is not had to be uncorrelated
, it is finally fixed number that each page includes how many contents, is construed as exception fewer or greater than this fixed number.
User conversation refers to user's once effectively accessing to server, and the user conversation that normal website is expected is exactly user
Click on a certain link, the related all data contents of this link are sent to user by website.Traditional conversation analysis potentially include
User once accesses interlinking of the multiple pages involved by website, is simultaneously used simple session determination methods, is counting
Constitute the content of session according to the journal entry remaining after cleaning.
Traditional web log file analysis focuses on the normal access module of website, such as carries out load optimized analysis, Yong Humo
Formula analysis etc..Required for this web log file analysis method, the web log file bar number of analysis is very big, carries out website day using it
The efficiency of will anomaly analysis is very low.
Furthermore, safety/evidence obtaining that the adopted method of traditional web log file analysis is not suitable for analyzing hacker attacks should
With.Because, in hacker's class case, the related activities of hacker in website includes setting foot-point, scans, launches a offensive, uploading, proposing power, control
System, Denial of Service attack etc., these access behavior and normal user's access mode, access behavior and access tool You Hen great area
Not.So traditional web log file analysis method is not suitable for the safety/forensics analysis of hacker attacks at all.
Content of the invention
The present invention is directed to that website log analysis efficiency is low and analysis that be not suitable for hacker attacks safety/evidence obtaining etc. is asked
Topic, and a kind of web log file exception conversation analysis method is provided.The method can not only greatly improve web log file anomaly analysis
Efficiency, and to analysis hacker attacks safety/forensic applications also useful.
In order to achieve the above object, the present invention adopts the following technical scheme that:
A kind of web log file exception conversation analysis method, user conversation is formed as independent having by described analysis method
The access unit of purpose, at the initial stage of analysis, analyzes normal conversation flow process by the way of automatically accessing website with reference to daily record,
And be continuously increased normal conversation stream in follow-up processing procedure, by all normal outside session be all summed up as exception,
Extremely different grades can be distinguished into show respectively and process simultaneously.
In the preferred embodiment of the present invention, described analysis method specific implementation step is as follows:
(1)Load web log file, by the initial analysis to daily record, obtain entrance and the page/file of website visiting
Reference address information;
(2)Simulation browser and user behavior, start reptile engine and the page of website are captured, analyze each page
Structure, generate link information between page documents object model and the page/file content;
(3)The DOM Document Object Model being generated using reptile and page link information are in combination with the access information of daily record, right
Web log file carries out secondary analysis, generates preliminary session information stream, website session is carried out simultaneously normal/abnormal enter rower
Note;
(4)Using URL pattern/access mode/returning result etc., exception and normal conversation are sorted out, and feed back to use
Family, user can revise exception/normal conversation attribute it is possible to merge/partition/adjustment classification;
(5)According to the adjustment of user, log sessions are processed, export all abnormal session streams, abnormal meeting is according to interior
The configuration mode put is divided into different grades to carry out display and processes further.
Further, described step(1)Each daily record of scanning, the client ip of parsing daily record, access time, access side
Method, the page link accessing, client-side program, server return value, server state etc., each access that analysis obtains
Page link(URL)It is exactly the page/file access address of website.
Further, described step(2)Simulation browser conducts interviews to each different URL of website, and analysis is returned
Return the DOM Document Object Model of content of pages, if the access of certain document object will not cause and quote access to other objects,
It is defined as an atomic access;If access the content that certain document can obtain other document objects simultaneously, other document objects
Connection be included in the document and become an atomic access, comprise the composition standard of the atomic access of multiple document objects links
Session, the atomic access only comprising a document object access does not constitute standard session.
Further, described step(3)According to step(2)The session of middle determination and atomic access are scanned to daily record locating
Reason, web log file circulation is changed into session information stream;All standard session are defined as normal conversation, by all improper sessions
It is defined as abnormal session.
Further, described step(4)In merge/decouple according to the mode of pattern match/adjust classification.
Further, described step(3)Carry out web log file as follows to conversate process:
(31)The page session complying fully with simulation browser access rule is identified as normally;
(32)The session meeting the pattern of being pre-configured with is identified as exception;
(33)The Configuration Values that a certain session subscriber repeated accesses exceed setting are identified as exception;
(33)It is identified as exception less than a certain page session setting Configuration Values;
(34)Normal page session can not be labeled as and be identified as exception;(35)To session by way of man-machine interactively
Analysis result is processed, and the normal/abnormal of session is labeled changing, and carries out merger to conversation modes so that normal meeting
The species of words reduces, increasing number.
Further, described step(5)Filter all of normal conversation, the abnormal session only in show log, according to joining
The abnormal session that is not all put generates different ranks, and having similar rank and the abnormal show of access type is identical mould
Formula.
Carry out the anomaly analysis of web log file using method of the present invention, can be by web log file in the conversation analysis stage
Amount of analysis taper to the 1/8 of original log bar number, exclude normal access log session after, the scale of log analysis is permissible
Taper to the 1/100 of original log bar number, substantially increase the efficiency of web log file anomaly analysis.
Specific embodiment
In order that technological means, creation characteristic, reached purpose and effect that the present invention realizes are easy to understand, tie below
Close example and the present invention is expanded on further.
The web log file exception conversation analysis method that the present invention provides, its by by user conversation be formed as one independent
Autotelic access unit(Such as one page, picture, a script access etc.), at the initial stage of analysis, using combination
The mode that daily record accesses website automatically analyzes normal conversation flow process, and is continuously increased normal meeting in follow-up processing procedure
Words stream, by all normal outside session be all summed up as exception, simultaneously abnormal can be distinguished into different grades show respectively and
Process.
The program in the specific implementation, mainly includes following three part:
1) process session:By to the scanning of the originating website and analysis, the session logic in web log file can be found out, from
And draw the accurate session stream in web log file.
2) enumerate the pattern of whole user conversations:Number of sessions in actual site is probably little, such as one public affairs
Department website, the page after carrying out synthesis, not over 20.The session of forum website is also few, and it is right substantially can to range
The access of module and the access to the concrete page, its conversation modes is also within 20.
3) session of difference abnormal access and normal users session:Website visiting row for abnormal website application target
For its session is different from normal conversation pattern.
Thus, web log file exception conversation analysis method in first have to the session in web log file is analyzed, with
Afterwards it is analyzed based on the session stream information of website abnormal.It is embodied as follows:
(1)Load web log file, by the initial analysis to daily record, obtain entrance and the page/file of website visiting
Link information between content.
This step in the specific implementation, by scanning each daily record, the parsing client ip of daily record, access time, visit
Ask method, the page link accessing, client-side program, server return value, server state etc., analyze each obtaining
URL is the page/file access address of website.
(2)Simulation browser and user behavior, start reptile engine and the page of website are captured, analyze each page
Structure, generate link information between page documents object model and the page/file content.
This step in the specific implementation, by simulation browser(Reptile)Each different URL of website is conducted interviews,
The DOM Document Object Model of analysis back page content, if the access of certain document object will not cause and quote visit to other objects
Ask it is determined that being an atomic access;If access the content that certain document can obtain other document objects simultaneously, other documents
The connection of object is included in the document becomes an atomic access, comprises the composition of the atomic access of multiple document object links
Standard session, the atomic access only comprising a document object access does not constitute standard session.
(3)The DOM Document Object Model being generated using reptile and page link information are in combination with the access information of daily record, right
Web log file carries out secondary analysis, with specific reference to step(2)The session of middle determination and atomic access are scanned to daily record processing,
Web log file circulation is changed into session information stream;All standard session are defined as normal conversation, will be true for all improper sessions
It is set to abnormal session, normal/abnormal being labeled is carried out to website session simultaneously.What it specifically marked passes sequentially through following sentencing
Disconnected step is realizing:
(31)Judge whether website session complies fully with the page session that simulation browser accesses rule, if being then identified
For normal;
(32)Judge whether website session meets the session of the pattern of being pre-configured with, be if so, then identified as exception;
(33)Judge in the session of website whether a certain session subscriber repeated accesses exceed the Configuration Values of setting, if so, then
It is identified as exception;
(33)Judge website session Configuration Values be enough be less than corresponding set Configuration Values, if so, then this session is identified
For exception;
(34)Through step(31)Extremely(33)Normal page session can not be labeled as and be identified as exception;
(35)By way of man-machine interactively, conversation analysis result is processed, to session normal/abnormal enter rower
Note change, merger is carried out to conversation modes so that normal conversation species reduce, increasing number.
(4)Using URL pattern/access mode/returning result etc., exception and normal conversation are sorted out, and feed back to use
Family, user can be to revise exception/normal conversation attribute it is possible to merging in the way of according to pattern match/decouple/adjusting and return
Class.
(5)According to the adjustment of user, log sessions are processed, export all abnormal session streams, abnormal meeting is according to interior
The configuration mode put is divided into different grades to carry out display and processes further.
When this step is embodied as, filter all of normal conversation, the abnormal session only in show log, according to configuration
It is not all abnormal session and generates different ranks, having similar rank and the abnormal show of access type is identical pattern.
General principle, principal character and the advantages of the present invention of the present invention have been shown and described above.The technology of the industry
, it should be appreciated that the present invention is not restricted to the described embodiments, the simply explanation described in above-described embodiment and specification is originally for personnel
The principle of invention, without departing from the spirit and scope of the present invention, the present invention also has various changes and modifications, these changes
Change and improvement both falls within scope of the claimed invention.Claimed scope by appending claims and its
Equivalent thereof.
Claims (7)
1. a kind of web log file exception conversation analysis method is it is characterised in that user conversation is formed as one by described analysis method
Individual independent autotelic access unit, at the initial stage of analysis, is just analyzed by the way of automatically accessing website with reference to daily record
Often session flow process, and be continuously increased normal conversation stream in follow-up processing procedure, by all normal outside session all by
It is attributed to exception, extremely can be distinguished into different grades simultaneously and show respectively and process;Described analysis method specific implementation step
As follows:
(1) load web log file, by the initial analysis to daily record, obtain entrance and the page/file access of website visiting
Address information;
(2) simulation browser and user behavior, starts reptile engine and the page of website is captured, analyze the knot of each page
Structure, generates link information between page documents object model and the page/file content;
(3) DOM Document Object Model and the page link information of reptile generation are used, in combination with the access information of daily record, to website
Daily record carries out secondary analysis, generates preliminary session information stream, carries out normal/abnormal being labeled to website session simultaneously;
(4) using URL pattern/access mode/returning result, exception and normal conversation are sorted out, and feed back to user, use
Exception/normal conversation attribute is revised it is possible to merge/partition/adjustment classification in family;
(5) log sessions are processed by the adjustment according to user, export all abnormal session streams, and abnormal meeting is according to built-in
Configuration mode is divided into different grades to carry out display and processes further.
2. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (1)
Each daily record of scanning, the client ip of parsing daily record, access time, access method, the page link accessing, client journey
Sequence, server return value, server state, each page link accessing that analysis obtains is exactly the page/file of website
Reference address.
3. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (2)
Simulation browser conducts interviews to each different URL of website, the DOM Document Object Model of analysis back page content, if
The access of certain document object will not cause to other objects quote access it is determined that be an atomic access;If accessing certain
Document can obtain the content of other document objects simultaneously, then the connection of other document objects be included in the document become one former
Son accesses;Comprise the composition standard session of the atomic access of multiple document object links, only comprise what a document object accessed
Atomic access does not constitute standard session.
4. a kind of web log file exception conversation analysis method according to claim 3 is it is characterised in that described step (3)
Daily record is scanned process according to the session determining in step (2) and atomic access, web log file circulation is changed into session letter
Breath stream;All standard session are defined as normal conversation, all improper sessions are defined as abnormal session.
5. a kind of web log file exception conversation analysis method according to claim 4 is it is characterised in that described step (3)
Carry out web log file as follows to conversate process:
(31) page session complying fully with simulation browser access rule is identified as normally;
(32) session meeting the pattern of being pre-configured with is identified as exception;
(33) Configuration Values that a certain session subscriber repeated accesses exceed setting are identified as exception;
(33) it is less than a certain page session setting Configuration Values and be identified as exception;
(34) normal page session can not be labeled as and be identified as exception;
(35) by way of man-machine interactively, conversation analysis result is processed, the normal/abnormal of session is labeled more
Change, merger is carried out to conversation modes so that normal conversation species reduce, increasing number.
6. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (4)
In merge/decouple according to the mode of pattern match/adjust classification.
7. a kind of web log file exception conversation analysis method according to claim 1 is it is characterised in that described step (5)
Filter all of normal conversation, the abnormal session only in show log, according to configuration be not all abnormal session generate different
Rank, having similar rank and the abnormal show of access type is identical pattern.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310303384.XA CN103401849B (en) | 2013-07-18 | 2013-07-18 | Abnormal session analyzing method for website logs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310303384.XA CN103401849B (en) | 2013-07-18 | 2013-07-18 | Abnormal session analyzing method for website logs |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103401849A CN103401849A (en) | 2013-11-20 |
CN103401849B true CN103401849B (en) | 2017-02-15 |
Family
ID=49565375
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310303384.XA Active CN103401849B (en) | 2013-07-18 | 2013-07-18 | Abnormal session analyzing method for website logs |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103401849B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105262720A (en) * | 2015-09-07 | 2016-01-20 | 深信服网络科技(深圳)有限公司 | Web robot traffic identification method and device |
CN106649312B (en) * | 2015-10-29 | 2019-10-29 | 北京北方华创微电子装备有限公司 | The analysis method and system of journal file |
US9872072B2 (en) * | 2016-03-21 | 2018-01-16 | Google Llc | Systems and methods for identifying non-canonical sessions |
CN114417197A (en) * | 2017-01-13 | 2022-04-29 | 阿里巴巴集团控股有限公司 | Access record processing method and device and storage medium |
CN107204991A (en) * | 2017-07-06 | 2017-09-26 | 深信服科技股份有限公司 | A kind of server exception detection method and system |
CN107590227A (en) * | 2017-09-05 | 2018-01-16 | 成都知道创宇信息技术有限公司 | A kind of log analysis method of combination reptile |
CN107483507B (en) * | 2017-09-30 | 2020-11-13 | 北京东土军悦科技有限公司 | Session analysis method, device and storage medium |
CN109241733A (en) * | 2018-08-07 | 2019-01-18 | 北京神州绿盟信息安全科技股份有限公司 | Crawler Activity recognition method and device based on web access log |
CN111224963A (en) * | 2019-12-30 | 2020-06-02 | 北京安码科技有限公司 | Network shooting range task duplication method, system, electronic equipment and storage medium |
CN111224823B (en) * | 2020-01-06 | 2022-08-16 | 杭州数群科技有限公司 | Method based on different network log analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101232399A (en) * | 2008-02-18 | 2008-07-30 | 刘峰 | Analytical method of website abnormal visit |
CN101242307A (en) * | 2008-02-01 | 2008-08-13 | 刘峰 | Website access analysis system and method based on built-in code proxy log |
CN101635718A (en) * | 2009-08-26 | 2010-01-27 | 中兴通讯股份有限公司 | Network crawler system and method for acquiring resource as well as network resource gripping device |
CN103178982A (en) * | 2011-12-23 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Method and device for analyzing log |
EP2610776A2 (en) * | 2011-09-16 | 2013-07-03 | Veracode, Inc. | Automated behavioural and static analysis using an instrumented sandbox and machine learning classification for mobile security |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010011792A2 (en) * | 2008-07-22 | 2010-01-28 | Widemile Inc. | Method and system for web-site testing |
-
2013
- 2013-07-18 CN CN201310303384.XA patent/CN103401849B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101242307A (en) * | 2008-02-01 | 2008-08-13 | 刘峰 | Website access analysis system and method based on built-in code proxy log |
CN101232399A (en) * | 2008-02-18 | 2008-07-30 | 刘峰 | Analytical method of website abnormal visit |
CN101635718A (en) * | 2009-08-26 | 2010-01-27 | 中兴通讯股份有限公司 | Network crawler system and method for acquiring resource as well as network resource gripping device |
EP2610776A2 (en) * | 2011-09-16 | 2013-07-03 | Veracode, Inc. | Automated behavioural and static analysis using an instrumented sandbox and machine learning classification for mobile security |
CN103178982A (en) * | 2011-12-23 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Method and device for analyzing log |
Also Published As
Publication number | Publication date |
---|---|
CN103401849A (en) | 2013-11-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103401849B (en) | Abnormal session analyzing method for website logs | |
US8625642B2 (en) | Method and apparatus of network artifact indentification and extraction | |
US9424319B2 (en) | Social media based content selection system | |
US6741990B2 (en) | System and method for efficient and adaptive web accesses filtering | |
US9305302B2 (en) | Weighting sentiment information | |
DE112012002624T5 (en) | Regex compiler | |
CN104615627B (en) | A kind of event public feelings information extracting method and system based on microblog | |
US8166161B1 (en) | System and method for ensuring privacy while tagging information in a network environment | |
CN109104456A (en) | A kind of user tracking based on browser fingerprint and propagating statistics analysis method | |
CN108462888A (en) | The intelligent association analysis method and system of user's TV and internet behavior | |
JPWO2007148817A1 (en) | Content recommendation system, content recommendation method, and content recommendation program | |
US8433666B2 (en) | Link information extracting apparatus, link information extracting method, and recording medium | |
DE112020000136T5 (en) | Low entropy browsing history for quasi-personalizing content | |
US20060149771A1 (en) | Information processing system and communication retry method | |
Hellsten et al. | The creation of the climategate hype in blogs and newspapers: mixed methods approach | |
Kiriya | From “troll factories” to “littering the information space”: Control strategies over the Russian internet | |
CN104933077A (en) | Rule-based multi-file information analysis method | |
US8005810B2 (en) | Scoping and biasing search to user preferred domains or blogs | |
CN102937973A (en) | Method and device for generating presentation configuration information used for information presentation | |
CN109190408B (en) | Data information security processing method and system | |
Zhou et al. | Collection of us extremist online forums: A web mining approach | |
US20130205015A1 (en) | Method and Device for Analyzing Data Intercepted on an IP Network in order to Monitor the Activity of Users on a Website | |
US20190197069A1 (en) | Social Media Based Content Selection System | |
KR100989320B1 (en) | B-Tree Index Vector Based Web-Log High-Speed Search Method For Huge Web Log Mining And Web Attack Detection and B-tree based indexing log processor | |
Haavik | Deep Learning-Based Traffic Classification for Network Penetration Testing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |