CN109408745A - Web data analysis and processing method and device - Google Patents

Web data analysis and processing method and device Download PDF

Info

Publication number
CN109408745A
CN109408745A CN201811084330.8A CN201811084330A CN109408745A CN 109408745 A CN109408745 A CN 109408745A CN 201811084330 A CN201811084330 A CN 201811084330A CN 109408745 A CN109408745 A CN 109408745A
Authority
CN
China
Prior art keywords
url
binary group
data
information
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811084330.8A
Other languages
Chinese (zh)
Inventor
曹严清
王慧生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guomei Netan Technology Co Ltd
Original Assignee
Guomei Netan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guomei Netan Technology Co Ltd filed Critical Guomei Netan Technology Co Ltd
Priority to CN201811084330.8A priority Critical patent/CN109408745A/en
Publication of CN109408745A publication Critical patent/CN109408745A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the present invention discloses a kind of web data analysis and processing method and device, this method comprises: obtaining the url data of target webpage;The url data is split, binary group set is obtained, binary group set includes the set for the binary group being made of the element information after splitting, wherein the corresponding binary group of one group of element information, binary group includes the location information of element and the element;Compression processing is carried out to url data according to the frequency information that the corresponding element of binary group occurs, obtains the pattern mode of URL.This method can be by a large amount of url datas in various Web application access data, it is compressed into a small amount of pattern mode, retain necessary character information, showed by pattern mode a small amount of after compression, it substantially reduces data processing amount and calculation amount, and directly can artificially be observed using the data of treated pattern mode and safety analysis.

Description

Web data analysis and processing method and device
Technical field
The present invention relates to computer information safety technique field more particularly to a kind of web data analysis and processing methods and dress It sets.
Background technique
With various electric business websites blowout increase, for mainstream be sold electric business website data cases, such as access record, Average daily amount of access and safety etc. demand for statistical analysis is more more and more intense.
The method of traditional statistical analysis is based on uniform resource locator (Uniform Resource Locator, URL) The each URL of joint account accesses total degree, to count the access record of entire website and the amount of access of special time period;The party Since repeatedly URL quantity is not more in method, statistical result enormous amount, it is difficult to carry out rapid scan and analysis;And to access net The safety analysis stood is, by regular expression, to carry out the matching analysis for access URL based on canonical tanalysis method.Generally Attack malicious act can have direct feature to show in URL, so, by specific regular expression matching, access URL can be analyzed It whether is malicious attack, but it includes various attacks class that different access URL, which may correspond to identical attack type or even a URL, Type, fast in business development iteration for hundreds and thousands of server nodes, the operation system of millions user, million URL, URL becomes Change the high internet area of frequency, this method is manually put into and maintenance is huge, it is difficult to be realized.
Summary of the invention
In order to solve the above technical problems, an embodiment of the present invention is intended to provide a kind of web data analysis and processing method and dresses It sets, the problem of to reduce data processing amount and calculation amount in network access data.
The technical scheme of the present invention is realized as follows:
A kind of web data analysis and processing method, which comprises
Obtain the url data of target webpage;
The url data is split, binary group set is obtained, the binary group set includes by the element after splitting The set of the binary group of information composition, wherein the corresponding binary group of one group of element information, the binary group includes element and institute State the location information of element;
Compression processing is carried out to the url data according to the frequency information that the corresponding element of the binary group occurs, is obtained The pattern mode of the URL.
It is described to split the url data in above scheme, after obtaining binary group set further include:
Binary group in the binary group set is screened according to preset condition, obtains a frequent item collection, it is described pre- If condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the frequency information occurred according to the corresponding element of the binary group carries out the url data Compression processing obtains the pattern mode of the URL specifically:
The url data is pressed according to the frequency information that the corresponding element of binary group in a frequent item collection occurs Contracting processing, obtains the pattern mode of the URL.
In above scheme, it is described by the url data carry out split include:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng Numerical value.
In above scheme, the frequency information occurred according to the corresponding element of the binary group to the url data into Row compression processing, the pattern mode for obtaining the URL include:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific Character is replaced, and obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as described in The pattern mode of URL;
If the frequency that the element in the Candidate Set occurs be less than or equal to the second preset threshold, use original URL as The pattern mode of the URL.
In above scheme, the frequency information occurred according to each binary group compresses the url data Processing, after obtaining the pattern mode of the URL further include:
Feature extraction is carried out using the pattern mode data of the URL as the training set of machine learning;
Behavior pattern safety analysis is carried out according to the information after extraction, determines abnormal behaviour mode.
The embodiment of the present invention also provides a kind of web data analysis processing device, and described device includes:
Module is obtained, for obtaining the url data of target webpage;
Module is split, the url data for obtaining the acquisition module is split, binary group set is obtained, The binary group set includes the set for the binary group being made of the element information after splitting, wherein one group of element information is corresponding One binary group, the binary group include the location information of element and the element;
Processing module, the frequency information for being occurred according to the corresponding element of the binary group carry out the url data Compression processing obtains the pattern mode of the URL.
In above scheme, described device further include:
Screening module obtains frequently for screening to the binary group in the binary group set according to preset condition One item collection, the preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the processing module is specifically used for:
The url data is pressed according to the frequency information that the corresponding element of binary group in a frequent item collection occurs Contracting processing, obtains the pattern mode of the URL.
In above scheme, the fractionation module is used for:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng Numerical value.
In above scheme, the processing module is specifically used for:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific Character is replaced, and obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as described in The pattern mode of URL;
If the frequency that the element in the Candidate Set occurs be less than or equal to the second preset threshold, use original URL as The pattern mode of the URL.
In above scheme, described device further include:
Characteristic extracting module, it is special for being carried out using the pattern mode data of the URL as the training set of machine learning Sign is extracted;
Unusual checking module determines exception row for carrying out behavior pattern safety analysis according to the information after extraction For mode.
The embodiment of the invention provides a kind of web data analysis and processing method and device, this method is obtaining target webpage A plurality of url data after, the url data that will acquire is split, and binary group set is obtained, then in the binary group set The frequency information that binary group occurs carries out compression processing to above-mentioned url data, obtains the pattern mode of URL, thus, it is possible to By a large amount of url datas in various Web application access data, it is compressed into a small amount of pattern mode, retains necessary character letter Breath is showed by pattern mode a small amount of after compression, substantially reduces data processing amount and calculation amount, and after utilization processing The data of pattern mode directly can artificially be observed and safety analysis.
Detailed description of the invention
Fig. 1 is a kind of flow chart of data analysis processing method embodiment one provided by the invention;
Fig. 2 is the flow chart of web data analysis and processing method embodiment two provided by the invention;
Fig. 3 is the flow chart of web data analysis and processing method embodiment three provided by the invention;
Fig. 4 is the flow chart of web data analysis and processing method example IV provided by the invention;
Fig. 5 is the structural schematic diagram of web data analysis processing device embodiment one provided by the invention;
Fig. 6 is the structural schematic diagram of web data analysis processing device embodiment two provided by the invention;
Fig. 7 is the structural schematic diagram of web data analysis processing device embodiment three provided by the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description.
Fig. 1 is a kind of flow chart of data analysis processing method embodiment one provided by the invention, as shown in Figure 1, described Method includes:
Step 101: obtaining the url data of target webpage.
In this step, the url data of target webpage is obtained first, which is global wide area network to be monitored The website (World Wide Web, WEB), such as various retail electric business websites, url data are each IP user's access in a period of time The access data of each webpage, same IP user correspond to the different addresses URL under different web sites, and different IP users are in same website Under be also likely to be present a plurality of url data.Url data in this step includes the statistical data of above-mentioned all situations.Such as it obtains To following access log:
[10.112.73.45-- 02/Feb/2018:09:11:20+0800] " GET/dev/codeCheck? userName =18900964569&timestamp=1517533878415HTTP/1.0 " 200 1983;
10.112.73.45-- [02/Feb/2018:09:11:27+0800] " POST/login HTTP/1.0 " 200226;
[10.112.73.45-- 02/Feb/2018:09:11:27+0800] " GET/HTTP/1.0 " 20010400;
[10.112.73.45-- 02/Feb/2018:09:11:28+0800] " GET/pushlet.srv? p_event= Join-listen&p_format=xml-strict&p_mode=pull&p_subject=/d evice/844274E3EAD 8A3A5A9EAEBF23DEE3E5B HTTP/1.0″200 260
10.112.73.46-- [02/Feb/2018:09:11:35+0800] " GET/auth/products/index HTTP/1.0″20013127;
[10.112.73.46-- 02/Feb/2018:09:11:36+0800] " GET/pushlet.srv? p_event= Join-listen&p_format=xml-strict&p_mode=pu11&p_suDject=/d evice/844274E3EAD 8A3A5A9EAEBF23DEE3E5B HTTP/1.0″200 260;
10.112.73.46-- [02/Feb/2018:09:11:44+0800] " GET/auth/product/detail/ 2301HTTP/1.0″20037601;
10.112.73.46-- [02/Feb/2018:09:11:45+0800] " GET/v1.0/product/2301? _= 1517533902735HTTP/1.0″2001196;
In above-mentioned log, an IP address represents a user, and the URL for including in the corresponding log of an IP is represented URL of user accesses behavior, and a plurality of log represents different IP user's different access behaviors;Based on a series of continuous URL of timing Access record, indicates the different behavior paths of different user.
Step 102: url data being split, binary group set is obtained, binary group set includes by the element after splitting The set of the binary group of information composition, wherein the corresponding binary group of one group of element information, the binary group includes element and institute State the location information of element.
In this step, the url data in step 1 is split, obtains binary group set.The element of the set by Binary group composition, one group of element information after the corresponding fractionation of a binary group, binary group includes element and the corresponding position of element Information.Such as in above-mentioned log, " GET " represents one group of element information, can be indicated with binary group P1, P1={ " GET ", 01 }, Indicate to include element " GET " and its location information 01 in this URL in binary group P1.
Step 103: compression processing being carried out to url data according to the frequency information that the corresponding element of binary group occurs, is obtained The pattern mode of URL.
In this step, after step 1 and step 2 processing, every URL can split to obtain multiple binary groups, According to the frequency information that the corresponding element of the binary group occurs, compression processing is carried out to url data, such as specific time will be met Occur 3 binary groups in section, its corresponding element is retained in URL, other elements are replaced with asterisk wildcard * or it He replaces character, obtains the pattern mode of URL, wherein Pattern mode is the certain same positions of URL by universal character * Or other characters replace, and indicate that these URL have identical form, we term it pattern modes for this form.
In the present embodiment, after a plurality of url data for obtaining target webpage, the url data that will acquire is split, and is obtained To binary group set, then above-mentioned url data is compressed according to the frequency information that the corresponding element of each binary group occurs Processing, obtains the pattern mode of URL, thus, it is possible to by a large amount of url datas in various Web application access data, compression At a small amount of pattern mode, retain necessary character information, is showed by pattern mode a small amount of after compression, subtracted significantly Small data processing amount and calculation amount, and using the data of treated pattern mode can directly carry out artificially observing with Safety analysis.
It should be noted that necessary character information feature refers to that those retained information are that frequency of occurrence is enough, These information are likely to be the title of certain grade of catalogue, the title of institute's Transfer Parameters, the value of institute's Transfer Parameters, these information be exactly because It is retained to there are enough times, what we to be observed is also exactly this behavior frequently occurred.
Fig. 2 is the flow chart of web data analysis and processing method embodiment two provided by the invention, as shown in Fig. 2, in reality On the basis of applying example one, step 102 splits the url data, after obtaining binary group set further include:
Step 1021: the binary group in binary group set being screened according to preset condition, obtains a frequent item collection, institute State preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the frequency information that is occurred according to the corresponding element of the binary group of step 103 to the url data into Row compression processing obtains the pattern mode of the URL specifically:
The frequency information occurred according to the corresponding element of binary group described in a frequent item collection to the url data into Row compression processing obtains the pattern mode of the URL.
Specifically, being screened after one step 102 of embodiment to obtained binary group set, if the binary group set In, the frequency that some binary group occurs is greater than the first preset threshold, then retains the binary group, otherwise, weed out the binary group;Such as This, then obtain a frequent item collection for binary group, which is greater than the binary group group of the first preset threshold by the frequency of occurrences At set.Wherein, the first preset threshold can be set by user, be also possible to default value, come with specific reference to user demand Setting.
Further, the url data split described in step 102 and include:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng Numerical value.
The preset character information includes any of the following: slash, question mark, equal sign or " & ".
It is split specifically, specific character can be used, such as "/", "? ", "=", these symbols such as " & " can will Url splits into the significant part such as catalogue, parameter, parameter value.Such as URL: " GET/auth/products//1.0 " 200 13127 ", after being split with "/", available " GET ", " auth ", " products ", " index HTTP ", " 1.0 " etc..
Fig. 3 is the flow chart of web data analysis and processing method embodiment three provided by the invention, as shown in figure 3, in reality On the basis of applying example two, step 103 carries out the url data according to the frequency information that the corresponding element of the binary group occurs Compression processing, the pattern mode for obtaining the URL include:
Step 1031: the corresponding element of binary group of the frequent item collection of this in URL being retained, other elements utilize spy Determine character to be replaced, obtain the Candidate Set of the URL, wherein the specific character is the general words that computer can identify Symbol.
In this step, compression processing is carried out to URL, specifically, to the binary in the frequent item collection in embodiment two The corresponding element of group is retained, and other elements are replaced with universal character, such as " black earth replaces, and obtains the Candidate Set of the URL.
Step 1032: whether the frequency for judging that the element in Candidate Set occurs is greater than the second preset threshold;
Wherein, the second preset threshold can be set by user, be also possible to default value, set with specific reference to user demand It is fixed.
Step 1033: if the frequency that the element in Candidate Set occurs is greater than the second preset threshold, using Candidate Set as institute State the pattern mode of URL.
Step 1034: if the frequency that the element in Candidate Set occurs is less than or equal to the second preset threshold, using original URL Pattern mode as the URL.
In step 1032 and 1033, the element in Candidate Set is further screened, judges the element in Candidate Set Whether the frequency of appearance is greater than the second preset threshold, if the frequency that the element in Candidate Set occurs is greater than the second preset threshold, It using the Candidate Set as the pattern mode of the URL, is otherwise not processed, uses original URL as its pattern mode.
In the present embodiment, the element in Candidate Set is further screened, retains frequency of occurrence and meets the second preset threshold Element so as to which URL to be compressed into a small amount of pattern mode, and is protected as the pattern mode of final URL Stay necessary character information.Showed by pattern mode a small amount of after compression, can directly carry out artificially observing and safety point Analysis, substantially reduces data processing amount and calculation amount.
Fig. 4 is the flow chart of web data analysis and processing method example IV provided by the invention, as shown in figure 4, in reality On the basis of applying example one or embodiment two, frequency information that step occurs according to each binary group to the url data into Row compression processing, after obtaining the pattern mode of the URL further include:
Step 104: carrying out feature extraction for the pattern mode data of URL as the training set of machine learning.
Step 105: behavior pattern safety analysis being carried out according to the information after extraction, determines abnormal behaviour mode.
It in step 104 and 105, is inputted the pattern mode of URL as the training set of machine learning, carries out feature It extracts, the modeling analysis of user behavior pattern safety analysis is then carried out, thus the user behavior pattern that notes abnormalities.
A large amount of URL are summarized as particular category, the property of initial data is judged by analyzing the classification being abstracted, such as On user behavior analysis, correlation rule can be used, perhaps do not need to be concerned about which webpage user specifically has accessed, because One website subdomain name and related urls may have very much, and different user may also be difficult have duplicate behavior, so passing through It obtains user and accesses which pattern mode URL belongs to, can quickly analyze and determine the safety of its behavior.
Single user's behavior pattern refers to based on a series of URL access record point that in a period of time, which is made Analysis, specifically includes such as: which URL is all had accessed, sequentially how access frequency has accessed which business (URL structure and main body with Content, similar to regard same line of business as), whether access behavior set has similitude etc..
Based on single user's Behavior Pattern Analysis, while to User IPs multiple in same amount of time, the URL access made is gone For association analysis, comprising: which IP has done the behavior that most of IP is done, which IP has done different from most of IP lack Several rows are to realize multi-user IP Behavior Pattern Analysis purpose.
The behavior pattern of comparative analysis different user finds a small number of abnormal user behavior patterns, and is directed to abnormal behaviour Mode does safe qualitative analysis, realizes user behavior pattern safety analysis purpose.
In the present embodiment, the pattern mode data amount of the URL after carrying out compression processing is smaller, is calculated using machine learning Method uses compressed URL as training set to extract feature, carries out user behavior pattern safety analysis, thus greatly Reduce the training time of model, improves machine learning data-handling efficiency.
The present invention also provides a kind of web data analysis processing device, Fig. 5 is at web data analysis provided by the invention The structural schematic diagram for managing Installation practice one, as shown in figure 5, described device includes:
Module 11 is obtained, for obtaining the url data of target webpage;
Module 12 is split, the url data for obtaining the acquisition module is split, and obtains binary group collection It closes, the binary group set includes the binary group set being made of the element information after splitting, wherein one group of element information is corresponding One binary group, the binary group include the location information of element and the element;
Processing module 13, frequency information for being occurred according to the corresponding element of the binary group to the url data into Row compression processing obtains the pattern mode of the URL.
The web data analysis processing device of the embodiment of the present invention is the web data analysis and processing method with embodiment one Corresponding Installation practice, principle is similar with effect, and details are not described herein again.
Fig. 6 is the structural schematic diagram of web data analysis processing device embodiment two provided by the invention, as shown in fig. 6, Described device further include:
Screening module 14 obtains frequency for screening to the binary group in the binary group set according to preset condition A numerous item collection, the preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;;
Then correspondingly, the processing module 13 is specifically used for:
The url data is pressed according to the frequency information that the corresponding element of binary group in a frequent item collection occurs Contracting processing, obtains the pattern mode of the URL.
Further, the fractionation module 12 is used for:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng Numerical value.
Further, the processing module 13 is specifically used for:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific Character is replaced, and obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as described in The pattern mode of URL;
If the frequency that the element in the Candidate Set occurs be less than or equal to the second preset threshold, use original URL as The pattern mode of the URL.
Fig. 7 is the structural schematic diagram of web data analysis processing device embodiment three provided by the invention, as shown in fig. 7, Described device further include:
Characteristic extracting module 15, for being carried out the pattern mode data of the URL as the training set of machine learning Feature extraction;
Unusual checking module 16 determines abnormal for carrying out behavior pattern safety analysis according to the information after extraction Behavior pattern.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the present invention Formula.Moreover, the present invention, which can be used, can use storage in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.

Claims (10)

1. a kind of web data analysis and processing method, which is characterized in that the described method includes:
Obtain the uniform resource position mark URL data of target webpage;
The url data is split, binary group set is obtained, the binary group set includes by the element information after splitting The set of the binary group of composition, wherein the corresponding binary group of one group of element information, the binary group includes element and the member The location information of element;
Compression processing is carried out to the url data according to the frequency information that the corresponding element of the binary group occurs, is obtained described The pattern mode of URL.
2. obtaining binary group the method according to claim 1, wherein described split the url data After set further include:
Binary group in the binary group set is screened according to preset condition, obtains a frequent item collection, the default item Part are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the frequency information occurred according to the corresponding element of the binary group compresses the url data Processing, obtains the pattern mode of the URL specifically:
The url data is carried out at compression according to the frequency information that the corresponding element of binary group in a frequent item collection occurs Reason, obtains the pattern mode of the URL.
3. the method according to claim 1, wherein it is described by the url data carry out split include:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and parameter Value.
4. according to the method described in claim 2, it is characterized in that, the frequency occurred according to the corresponding element of the binary group Rate information carries out compression processing to the url data, and the pattern mode for obtaining the URL includes:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific character It is replaced, obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as the URL Pattern mode;
If the frequency that the element in the Candidate Set occurs is less than or equal to the second preset threshold, use original URL as described in The pattern mode of URL.
5. method according to claim 1 or 2, which is characterized in that the frequency occurred according to each binary group Information carries out compression processing to the url data, after obtaining the pattern mode of the URL further include:
Feature extraction is carried out using the pattern mode data of the URL as the training set of machine learning;
Behavior pattern safety analysis is carried out according to the information after extraction, determines abnormal behaviour mode.
6. a kind of web data analysis processing device, which is characterized in that described device includes:
Module is obtained, for obtaining the url data of target webpage;
Module is split, the url data for obtaining the acquisition module is split, and obtains binary group set, described Binary group set includes the set for the binary group being made of the element information after splitting, wherein one group of element information is one corresponding Binary group, the binary group include the location information of element and the element;
Processing module, the frequency information for being occurred according to the corresponding element of the binary group compress the url data Processing, obtains the pattern mode of the URL.
7. device according to claim 6, which is characterized in that described device further include:
Screening module obtains frequent one for screening to the binary group in the binary group set according to preset condition Collection, the preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the processing module is specifically used for:
The url data is carried out at compression according to the frequency information that the corresponding element of binary group in a frequent item collection occurs Reason, obtains the pattern mode of the URL.
8. device according to claim 6, which is characterized in that the fractionation module is used for:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and parameter Value.
9. device according to claim 7, which is characterized in that the processing module is specifically used for:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific character It is replaced, obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as the URL Pattem mode;
If the frequency that the element in the Candidate Set occurs is less than or equal to the second preset threshold, use original URL as described in The pattem mode of URL.
10. device according to claim 6 or 7, which is characterized in that described device further include:
Characteristic extracting module is mentioned for the pattern mode data of the URL to be carried out feature as the training set of machine learning It takes;
Unusual checking module determines abnormal behaviour mould for carrying out behavior pattern safety analysis according to the information after extraction Formula.
CN201811084330.8A 2018-09-17 2018-09-17 Web data analysis and processing method and device Pending CN109408745A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811084330.8A CN109408745A (en) 2018-09-17 2018-09-17 Web data analysis and processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811084330.8A CN109408745A (en) 2018-09-17 2018-09-17 Web data analysis and processing method and device

Publications (1)

Publication Number Publication Date
CN109408745A true CN109408745A (en) 2019-03-01

Family

ID=65465024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811084330.8A Pending CN109408745A (en) 2018-09-17 2018-09-17 Web data analysis and processing method and device

Country Status (1)

Country Link
CN (1) CN109408745A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111163053A (en) * 2019-11-29 2020-05-15 深圳市任子行科技开发有限公司 Malicious URL detection method and system
CN111258574A (en) * 2020-01-14 2020-06-09 中科驭数(北京)科技有限公司 Programming method and system for accelerator architecture
CN115130023A (en) * 2022-07-08 2022-09-30 阿里巴巴(中国)有限公司 Regular expression generation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120158626A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Detection and categorization of malicious urls
CN103823892A (en) * 2014-03-10 2014-05-28 北京奇虎科技有限公司 Method and device of determining webpage clustering mode
CN104778164A (en) * 2014-01-09 2015-07-15 中国银联股份有限公司 Method and device for detecting repeated URL (Uniform Resource Locator)
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN105721427A (en) * 2016-01-14 2016-06-29 湖南大学 Method for mining attack frequent sequence mode from Web log
CN106095979A (en) * 2016-06-20 2016-11-09 百度在线网络技术(北京)有限公司 URL merging treatment method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120158626A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Detection and categorization of malicious urls
CN104778164A (en) * 2014-01-09 2015-07-15 中国银联股份有限公司 Method and device for detecting repeated URL (Uniform Resource Locator)
CN103823892A (en) * 2014-03-10 2014-05-28 北京奇虎科技有限公司 Method and device of determining webpage clustering mode
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN105721427A (en) * 2016-01-14 2016-06-29 湖南大学 Method for mining attack frequent sequence mode from Web log
CN106095979A (en) * 2016-06-20 2016-11-09 百度在线网络技术(北京)有限公司 URL merging treatment method and apparatus

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111163053A (en) * 2019-11-29 2020-05-15 深圳市任子行科技开发有限公司 Malicious URL detection method and system
CN111163053B (en) * 2019-11-29 2022-05-03 深圳市任子行科技开发有限公司 Malicious URL detection method and system
CN111258574A (en) * 2020-01-14 2020-06-09 中科驭数(北京)科技有限公司 Programming method and system for accelerator architecture
CN111258574B (en) * 2020-01-14 2021-01-15 中科驭数(北京)科技有限公司 Programming method and system for accelerator architecture
CN115130023A (en) * 2022-07-08 2022-09-30 阿里巴巴(中国)有限公司 Regular expression generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109525595A (en) A kind of black production account recognition methods and equipment based on time flow feature
Shi et al. Detecting malicious social bots based on clickstream sequences
CN110351244A (en) A kind of network inbreak detection method and system based on multireel product neural network fusion
CN105809190B (en) A kind of SVM cascade classifier methods based on Feature Selection
CN109922052A (en) A kind of malice URL detection method of combination multiple characteristics
CN109408745A (en) Web data analysis and processing method and device
CN105069355A (en) Static detection method and apparatus for webshell deformation
CN108491714A (en) The man-machine recognition methods of identifying code
Sahlabadi et al. Detecting abnormal behavior in social network websites by using a process mining technique
CN103679030B (en) Malicious code analysis and detection method based on dynamic semantic features
CN111786950A (en) Situation awareness-based network security monitoring method, device, equipment and medium
CN109462575A (en) A kind of webshell detection method and device
Liao et al. Feature extraction and construction of application layer DDoS attack based on user behavior
CN108229170B (en) Software analysis method and apparatus using big data and neural network
CN103457909A (en) Botnet detection method and device
CN114422211A (en) HTTP malicious traffic detection method and device based on graph attention network
CN114422271B (en) Data processing method, device, equipment and readable storage medium
Hostiadi et al. Dataset for Botnet group activity with adaptive generator
CN111784360B (en) Anti-fraud prediction method and system based on network link backtracking
CN111200607A (en) Online user behavior analysis method based on multilayer LSTM
CN110457603A (en) Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing
CN115221509A (en) Authentication behavior portrait method based on 5W1H account
Pan Network security and user abnormal behavior detection by using deep neural network
CN114676428A (en) Application program malicious behavior detection method and device based on dynamic characteristics
Sun et al. Visual analytics for anomaly classification in LAN based on deep convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190301