CN109408745A - Web data analysis and processing method and device - Google Patents
Web data analysis and processing method and device Download PDFInfo
- Publication number
- CN109408745A CN109408745A CN201811084330.8A CN201811084330A CN109408745A CN 109408745 A CN109408745 A CN 109408745A CN 201811084330 A CN201811084330 A CN 201811084330A CN 109408745 A CN109408745 A CN 109408745A
- Authority
- CN
- China
- Prior art keywords
- url
- binary group
- data
- information
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The embodiment of the present invention discloses a kind of web data analysis and processing method and device, this method comprises: obtaining the url data of target webpage;The url data is split, binary group set is obtained, binary group set includes the set for the binary group being made of the element information after splitting, wherein the corresponding binary group of one group of element information, binary group includes the location information of element and the element;Compression processing is carried out to url data according to the frequency information that the corresponding element of binary group occurs, obtains the pattern mode of URL.This method can be by a large amount of url datas in various Web application access data, it is compressed into a small amount of pattern mode, retain necessary character information, showed by pattern mode a small amount of after compression, it substantially reduces data processing amount and calculation amount, and directly can artificially be observed using the data of treated pattern mode and safety analysis.
Description
Technical field
The present invention relates to computer information safety technique field more particularly to a kind of web data analysis and processing methods and dress
It sets.
Background technique
With various electric business websites blowout increase, for mainstream be sold electric business website data cases, such as access record,
Average daily amount of access and safety etc. demand for statistical analysis is more more and more intense.
The method of traditional statistical analysis is based on uniform resource locator (Uniform Resource Locator, URL)
The each URL of joint account accesses total degree, to count the access record of entire website and the amount of access of special time period;The party
Since repeatedly URL quantity is not more in method, statistical result enormous amount, it is difficult to carry out rapid scan and analysis;And to access net
The safety analysis stood is, by regular expression, to carry out the matching analysis for access URL based on canonical tanalysis method.Generally
Attack malicious act can have direct feature to show in URL, so, by specific regular expression matching, access URL can be analyzed
It whether is malicious attack, but it includes various attacks class that different access URL, which may correspond to identical attack type or even a URL,
Type, fast in business development iteration for hundreds and thousands of server nodes, the operation system of millions user, million URL, URL becomes
Change the high internet area of frequency, this method is manually put into and maintenance is huge, it is difficult to be realized.
Summary of the invention
In order to solve the above technical problems, an embodiment of the present invention is intended to provide a kind of web data analysis and processing method and dresses
It sets, the problem of to reduce data processing amount and calculation amount in network access data.
The technical scheme of the present invention is realized as follows:
A kind of web data analysis and processing method, which comprises
Obtain the url data of target webpage;
The url data is split, binary group set is obtained, the binary group set includes by the element after splitting
The set of the binary group of information composition, wherein the corresponding binary group of one group of element information, the binary group includes element and institute
State the location information of element;
Compression processing is carried out to the url data according to the frequency information that the corresponding element of the binary group occurs, is obtained
The pattern mode of the URL.
It is described to split the url data in above scheme, after obtaining binary group set further include:
Binary group in the binary group set is screened according to preset condition, obtains a frequent item collection, it is described pre-
If condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the frequency information occurred according to the corresponding element of the binary group carries out the url data
Compression processing obtains the pattern mode of the URL specifically:
The url data is pressed according to the frequency information that the corresponding element of binary group in a frequent item collection occurs
Contracting processing, obtains the pattern mode of the URL.
In above scheme, it is described by the url data carry out split include:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng
Numerical value.
In above scheme, the frequency information occurred according to the corresponding element of the binary group to the url data into
Row compression processing, the pattern mode for obtaining the URL include:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific
Character is replaced, and obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as described in
The pattern mode of URL;
If the frequency that the element in the Candidate Set occurs be less than or equal to the second preset threshold, use original URL as
The pattern mode of the URL.
In above scheme, the frequency information occurred according to each binary group compresses the url data
Processing, after obtaining the pattern mode of the URL further include:
Feature extraction is carried out using the pattern mode data of the URL as the training set of machine learning;
Behavior pattern safety analysis is carried out according to the information after extraction, determines abnormal behaviour mode.
The embodiment of the present invention also provides a kind of web data analysis processing device, and described device includes:
Module is obtained, for obtaining the url data of target webpage;
Module is split, the url data for obtaining the acquisition module is split, binary group set is obtained,
The binary group set includes the set for the binary group being made of the element information after splitting, wherein one group of element information is corresponding
One binary group, the binary group include the location information of element and the element;
Processing module, the frequency information for being occurred according to the corresponding element of the binary group carry out the url data
Compression processing obtains the pattern mode of the URL.
In above scheme, described device further include:
Screening module obtains frequently for screening to the binary group in the binary group set according to preset condition
One item collection, the preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the processing module is specifically used for:
The url data is pressed according to the frequency information that the corresponding element of binary group in a frequent item collection occurs
Contracting processing, obtains the pattern mode of the URL.
In above scheme, the fractionation module is used for:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng
Numerical value.
In above scheme, the processing module is specifically used for:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific
Character is replaced, and obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as described in
The pattern mode of URL;
If the frequency that the element in the Candidate Set occurs be less than or equal to the second preset threshold, use original URL as
The pattern mode of the URL.
In above scheme, described device further include:
Characteristic extracting module, it is special for being carried out using the pattern mode data of the URL as the training set of machine learning
Sign is extracted;
Unusual checking module determines exception row for carrying out behavior pattern safety analysis according to the information after extraction
For mode.
The embodiment of the invention provides a kind of web data analysis and processing method and device, this method is obtaining target webpage
A plurality of url data after, the url data that will acquire is split, and binary group set is obtained, then in the binary group set
The frequency information that binary group occurs carries out compression processing to above-mentioned url data, obtains the pattern mode of URL, thus, it is possible to
By a large amount of url datas in various Web application access data, it is compressed into a small amount of pattern mode, retains necessary character letter
Breath is showed by pattern mode a small amount of after compression, substantially reduces data processing amount and calculation amount, and after utilization processing
The data of pattern mode directly can artificially be observed and safety analysis.
Detailed description of the invention
Fig. 1 is a kind of flow chart of data analysis processing method embodiment one provided by the invention;
Fig. 2 is the flow chart of web data analysis and processing method embodiment two provided by the invention;
Fig. 3 is the flow chart of web data analysis and processing method embodiment three provided by the invention;
Fig. 4 is the flow chart of web data analysis and processing method example IV provided by the invention;
Fig. 5 is the structural schematic diagram of web data analysis processing device embodiment one provided by the invention;
Fig. 6 is the structural schematic diagram of web data analysis processing device embodiment two provided by the invention;
Fig. 7 is the structural schematic diagram of web data analysis processing device embodiment three provided by the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description.
Fig. 1 is a kind of flow chart of data analysis processing method embodiment one provided by the invention, as shown in Figure 1, described
Method includes:
Step 101: obtaining the url data of target webpage.
In this step, the url data of target webpage is obtained first, which is global wide area network to be monitored
The website (World Wide Web, WEB), such as various retail electric business websites, url data are each IP user's access in a period of time
The access data of each webpage, same IP user correspond to the different addresses URL under different web sites, and different IP users are in same website
Under be also likely to be present a plurality of url data.Url data in this step includes the statistical data of above-mentioned all situations.Such as it obtains
To following access log:
[10.112.73.45-- 02/Feb/2018:09:11:20+0800] " GET/dev/codeCheck? userName
=18900964569×tamp=1517533878415HTTP/1.0 " 200 1983;
10.112.73.45-- [02/Feb/2018:09:11:27+0800] " POST/login HTTP/1.0 " 200226;
[10.112.73.45-- 02/Feb/2018:09:11:27+0800] " GET/HTTP/1.0 " 20010400;
[10.112.73.45-- 02/Feb/2018:09:11:28+0800] " GET/pushlet.srv? p_event=
Join-listen&p_format=xml-strict&p_mode=pull&p_subject=/d evice/844274E3EAD
8A3A5A9EAEBF23DEE3E5B HTTP/1.0″200 260
10.112.73.46-- [02/Feb/2018:09:11:35+0800] " GET/auth/products/index
HTTP/1.0″20013127;
[10.112.73.46-- 02/Feb/2018:09:11:36+0800] " GET/pushlet.srv? p_event=
Join-listen&p_format=xml-strict&p_mode=pu11&p_suDject=/d evice/844274E3EAD
8A3A5A9EAEBF23DEE3E5B HTTP/1.0″200 260;
10.112.73.46-- [02/Feb/2018:09:11:44+0800] " GET/auth/product/detail/
2301HTTP/1.0″20037601;
10.112.73.46-- [02/Feb/2018:09:11:45+0800] " GET/v1.0/product/2301? _=
1517533902735HTTP/1.0″2001196;
In above-mentioned log, an IP address represents a user, and the URL for including in the corresponding log of an IP is represented
URL of user accesses behavior, and a plurality of log represents different IP user's different access behaviors;Based on a series of continuous URL of timing
Access record, indicates the different behavior paths of different user.
Step 102: url data being split, binary group set is obtained, binary group set includes by the element after splitting
The set of the binary group of information composition, wherein the corresponding binary group of one group of element information, the binary group includes element and institute
State the location information of element.
In this step, the url data in step 1 is split, obtains binary group set.The element of the set by
Binary group composition, one group of element information after the corresponding fractionation of a binary group, binary group includes element and the corresponding position of element
Information.Such as in above-mentioned log, " GET " represents one group of element information, can be indicated with binary group P1, P1={ " GET ", 01 },
Indicate to include element " GET " and its location information 01 in this URL in binary group P1.
Step 103: compression processing being carried out to url data according to the frequency information that the corresponding element of binary group occurs, is obtained
The pattern mode of URL.
In this step, after step 1 and step 2 processing, every URL can split to obtain multiple binary groups,
According to the frequency information that the corresponding element of the binary group occurs, compression processing is carried out to url data, such as specific time will be met
Occur 3 binary groups in section, its corresponding element is retained in URL, other elements are replaced with asterisk wildcard * or it
He replaces character, obtains the pattern mode of URL, wherein Pattern mode is the certain same positions of URL by universal character *
Or other characters replace, and indicate that these URL have identical form, we term it pattern modes for this form.
In the present embodiment, after a plurality of url data for obtaining target webpage, the url data that will acquire is split, and is obtained
To binary group set, then above-mentioned url data is compressed according to the frequency information that the corresponding element of each binary group occurs
Processing, obtains the pattern mode of URL, thus, it is possible to by a large amount of url datas in various Web application access data, compression
At a small amount of pattern mode, retain necessary character information, is showed by pattern mode a small amount of after compression, subtracted significantly
Small data processing amount and calculation amount, and using the data of treated pattern mode can directly carry out artificially observing with
Safety analysis.
It should be noted that necessary character information feature refers to that those retained information are that frequency of occurrence is enough,
These information are likely to be the title of certain grade of catalogue, the title of institute's Transfer Parameters, the value of institute's Transfer Parameters, these information be exactly because
It is retained to there are enough times, what we to be observed is also exactly this behavior frequently occurred.
Fig. 2 is the flow chart of web data analysis and processing method embodiment two provided by the invention, as shown in Fig. 2, in reality
On the basis of applying example one, step 102 splits the url data, after obtaining binary group set further include:
Step 1021: the binary group in binary group set being screened according to preset condition, obtains a frequent item collection, institute
State preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the frequency information that is occurred according to the corresponding element of the binary group of step 103 to the url data into
Row compression processing obtains the pattern mode of the URL specifically:
The frequency information occurred according to the corresponding element of binary group described in a frequent item collection to the url data into
Row compression processing obtains the pattern mode of the URL.
Specifically, being screened after one step 102 of embodiment to obtained binary group set, if the binary group set
In, the frequency that some binary group occurs is greater than the first preset threshold, then retains the binary group, otherwise, weed out the binary group;Such as
This, then obtain a frequent item collection for binary group, which is greater than the binary group group of the first preset threshold by the frequency of occurrences
At set.Wherein, the first preset threshold can be set by user, be also possible to default value, come with specific reference to user demand
Setting.
Further, the url data split described in step 102 and include:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng
Numerical value.
The preset character information includes any of the following: slash, question mark, equal sign or " & ".
It is split specifically, specific character can be used, such as "/", "? ", "=", these symbols such as " & " can will
Url splits into the significant part such as catalogue, parameter, parameter value.Such as URL: " GET/auth/products//1.0 " 200
13127 ", after being split with "/", available " GET ", " auth ", " products ", " index HTTP ", " 1.0 " etc..
Fig. 3 is the flow chart of web data analysis and processing method embodiment three provided by the invention, as shown in figure 3, in reality
On the basis of applying example two, step 103 carries out the url data according to the frequency information that the corresponding element of the binary group occurs
Compression processing, the pattern mode for obtaining the URL include:
Step 1031: the corresponding element of binary group of the frequent item collection of this in URL being retained, other elements utilize spy
Determine character to be replaced, obtain the Candidate Set of the URL, wherein the specific character is the general words that computer can identify
Symbol.
In this step, compression processing is carried out to URL, specifically, to the binary in the frequent item collection in embodiment two
The corresponding element of group is retained, and other elements are replaced with universal character, such as " black earth replaces, and obtains the Candidate Set of the URL.
Step 1032: whether the frequency for judging that the element in Candidate Set occurs is greater than the second preset threshold;
Wherein, the second preset threshold can be set by user, be also possible to default value, set with specific reference to user demand
It is fixed.
Step 1033: if the frequency that the element in Candidate Set occurs is greater than the second preset threshold, using Candidate Set as institute
State the pattern mode of URL.
Step 1034: if the frequency that the element in Candidate Set occurs is less than or equal to the second preset threshold, using original URL
Pattern mode as the URL.
In step 1032 and 1033, the element in Candidate Set is further screened, judges the element in Candidate Set
Whether the frequency of appearance is greater than the second preset threshold, if the frequency that the element in Candidate Set occurs is greater than the second preset threshold,
It using the Candidate Set as the pattern mode of the URL, is otherwise not processed, uses original URL as its pattern mode.
In the present embodiment, the element in Candidate Set is further screened, retains frequency of occurrence and meets the second preset threshold
Element so as to which URL to be compressed into a small amount of pattern mode, and is protected as the pattern mode of final URL
Stay necessary character information.Showed by pattern mode a small amount of after compression, can directly carry out artificially observing and safety point
Analysis, substantially reduces data processing amount and calculation amount.
Fig. 4 is the flow chart of web data analysis and processing method example IV provided by the invention, as shown in figure 4, in reality
On the basis of applying example one or embodiment two, frequency information that step occurs according to each binary group to the url data into
Row compression processing, after obtaining the pattern mode of the URL further include:
Step 104: carrying out feature extraction for the pattern mode data of URL as the training set of machine learning.
Step 105: behavior pattern safety analysis being carried out according to the information after extraction, determines abnormal behaviour mode.
It in step 104 and 105, is inputted the pattern mode of URL as the training set of machine learning, carries out feature
It extracts, the modeling analysis of user behavior pattern safety analysis is then carried out, thus the user behavior pattern that notes abnormalities.
A large amount of URL are summarized as particular category, the property of initial data is judged by analyzing the classification being abstracted, such as
On user behavior analysis, correlation rule can be used, perhaps do not need to be concerned about which webpage user specifically has accessed, because
One website subdomain name and related urls may have very much, and different user may also be difficult have duplicate behavior, so passing through
It obtains user and accesses which pattern mode URL belongs to, can quickly analyze and determine the safety of its behavior.
Single user's behavior pattern refers to based on a series of URL access record point that in a period of time, which is made
Analysis, specifically includes such as: which URL is all had accessed, sequentially how access frequency has accessed which business (URL structure and main body with
Content, similar to regard same line of business as), whether access behavior set has similitude etc..
Based on single user's Behavior Pattern Analysis, while to User IPs multiple in same amount of time, the URL access made is gone
For association analysis, comprising: which IP has done the behavior that most of IP is done, which IP has done different from most of IP lack
Several rows are to realize multi-user IP Behavior Pattern Analysis purpose.
The behavior pattern of comparative analysis different user finds a small number of abnormal user behavior patterns, and is directed to abnormal behaviour
Mode does safe qualitative analysis, realizes user behavior pattern safety analysis purpose.
In the present embodiment, the pattern mode data amount of the URL after carrying out compression processing is smaller, is calculated using machine learning
Method uses compressed URL as training set to extract feature, carries out user behavior pattern safety analysis, thus greatly
Reduce the training time of model, improves machine learning data-handling efficiency.
The present invention also provides a kind of web data analysis processing device, Fig. 5 is at web data analysis provided by the invention
The structural schematic diagram for managing Installation practice one, as shown in figure 5, described device includes:
Module 11 is obtained, for obtaining the url data of target webpage;
Module 12 is split, the url data for obtaining the acquisition module is split, and obtains binary group collection
It closes, the binary group set includes the binary group set being made of the element information after splitting, wherein one group of element information is corresponding
One binary group, the binary group include the location information of element and the element;
Processing module 13, frequency information for being occurred according to the corresponding element of the binary group to the url data into
Row compression processing obtains the pattern mode of the URL.
The web data analysis processing device of the embodiment of the present invention is the web data analysis and processing method with embodiment one
Corresponding Installation practice, principle is similar with effect, and details are not described herein again.
Fig. 6 is the structural schematic diagram of web data analysis processing device embodiment two provided by the invention, as shown in fig. 6,
Described device further include:
Screening module 14 obtains frequency for screening to the binary group in the binary group set according to preset condition
A numerous item collection, the preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;;
Then correspondingly, the processing module 13 is specifically used for:
The url data is pressed according to the frequency information that the corresponding element of binary group in a frequent item collection occurs
Contracting processing, obtains the pattern mode of the URL.
Further, the fractionation module 12 is used for:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and ginseng
Numerical value.
Further, the processing module 13 is specifically used for:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific
Character is replaced, and obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as described in
The pattern mode of URL;
If the frequency that the element in the Candidate Set occurs be less than or equal to the second preset threshold, use original URL as
The pattern mode of the URL.
Fig. 7 is the structural schematic diagram of web data analysis processing device embodiment three provided by the invention, as shown in fig. 7,
Described device further include:
Characteristic extracting module 15, for being carried out the pattern mode data of the URL as the training set of machine learning
Feature extraction;
Unusual checking module 16 determines abnormal for carrying out behavior pattern safety analysis according to the information after extraction
Behavior pattern.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the present invention
Formula.Moreover, the present invention, which can be used, can use storage in the computer that one or more wherein includes computer usable program code
The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.
Claims (10)
1. a kind of web data analysis and processing method, which is characterized in that the described method includes:
Obtain the uniform resource position mark URL data of target webpage;
The url data is split, binary group set is obtained, the binary group set includes by the element information after splitting
The set of the binary group of composition, wherein the corresponding binary group of one group of element information, the binary group includes element and the member
The location information of element;
Compression processing is carried out to the url data according to the frequency information that the corresponding element of the binary group occurs, is obtained described
The pattern mode of URL.
2. obtaining binary group the method according to claim 1, wherein described split the url data
After set further include:
Binary group in the binary group set is screened according to preset condition, obtains a frequent item collection, the default item
Part are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the frequency information occurred according to the corresponding element of the binary group compresses the url data
Processing, obtains the pattern mode of the URL specifically:
The url data is carried out at compression according to the frequency information that the corresponding element of binary group in a frequent item collection occurs
Reason, obtains the pattern mode of the URL.
3. the method according to claim 1, wherein it is described by the url data carry out split include:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and parameter
Value.
4. according to the method described in claim 2, it is characterized in that, the frequency occurred according to the corresponding element of the binary group
Rate information carries out compression processing to the url data, and the pattern mode for obtaining the URL includes:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific character
It is replaced, obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as the URL
Pattern mode;
If the frequency that the element in the Candidate Set occurs is less than or equal to the second preset threshold, use original URL as described in
The pattern mode of URL.
5. method according to claim 1 or 2, which is characterized in that the frequency occurred according to each binary group
Information carries out compression processing to the url data, after obtaining the pattern mode of the URL further include:
Feature extraction is carried out using the pattern mode data of the URL as the training set of machine learning;
Behavior pattern safety analysis is carried out according to the information after extraction, determines abnormal behaviour mode.
6. a kind of web data analysis processing device, which is characterized in that described device includes:
Module is obtained, for obtaining the url data of target webpage;
Module is split, the url data for obtaining the acquisition module is split, and obtains binary group set, described
Binary group set includes the set for the binary group being made of the element information after splitting, wherein one group of element information is one corresponding
Binary group, the binary group include the location information of element and the element;
Processing module, the frequency information for being occurred according to the corresponding element of the binary group compress the url data
Processing, obtains the pattern mode of the URL.
7. device according to claim 6, which is characterized in that described device further include:
Screening module obtains frequent one for screening to the binary group in the binary group set according to preset condition
Collection, the preset condition are as follows: the frequency that the binary group occurs is greater than the first preset threshold;
Then correspondingly, the processing module is specifically used for:
The url data is carried out at compression according to the frequency information that the corresponding element of binary group in a frequent item collection occurs
Reason, obtains the pattern mode of the URL.
8. device according to claim 6, which is characterized in that the fractionation module is used for:
The URL is split using preset character information;Wherein, the URL after fractionation includes catalogue, parameter and parameter
Value.
9. device according to claim 7, which is characterized in that the processing module is specifically used for:
The corresponding element of binary group of a frequent item collection described in the URL is retained, other elements utilize specific character
It is replaced, obtains the Candidate Set of the URL, wherein the specific character is the universal character that computer can identify;
Whether the frequency for judging that the element in the Candidate Set occurs is greater than the second preset threshold;
If the frequency that the element in the Candidate Set occurs is greater than the second preset threshold, using the Candidate Set as the URL
Pattem mode;
If the frequency that the element in the Candidate Set occurs is less than or equal to the second preset threshold, use original URL as described in
The pattem mode of URL.
10. device according to claim 6 or 7, which is characterized in that described device further include:
Characteristic extracting module is mentioned for the pattern mode data of the URL to be carried out feature as the training set of machine learning
It takes;
Unusual checking module determines abnormal behaviour mould for carrying out behavior pattern safety analysis according to the information after extraction
Formula.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811084330.8A CN109408745A (en) | 2018-09-17 | 2018-09-17 | Web data analysis and processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811084330.8A CN109408745A (en) | 2018-09-17 | 2018-09-17 | Web data analysis and processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109408745A true CN109408745A (en) | 2019-03-01 |
Family
ID=65465024
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811084330.8A Pending CN109408745A (en) | 2018-09-17 | 2018-09-17 | Web data analysis and processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408745A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111163053A (en) * | 2019-11-29 | 2020-05-15 | 深圳市任子行科技开发有限公司 | Malicious URL detection method and system |
CN111258574A (en) * | 2020-01-14 | 2020-06-09 | 中科驭数(北京)科技有限公司 | Programming method and system for accelerator architecture |
CN115130023A (en) * | 2022-07-08 | 2022-09-30 | 阿里巴巴(中国)有限公司 | Regular expression generation method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120158626A1 (en) * | 2010-12-15 | 2012-06-21 | Microsoft Corporation | Detection and categorization of malicious urls |
CN103823892A (en) * | 2014-03-10 | 2014-05-28 | 北京奇虎科技有限公司 | Method and device of determining webpage clustering mode |
CN104778164A (en) * | 2014-01-09 | 2015-07-15 | 中国银联股份有限公司 | Method and device for detecting repeated URL (Uniform Resource Locator) |
CN105095209A (en) * | 2014-04-21 | 2015-11-25 | 北京金山网络科技有限公司 | Document clustering method, document clustering device and network equipment |
CN105721427A (en) * | 2016-01-14 | 2016-06-29 | 湖南大学 | Method for mining attack frequent sequence mode from Web log |
CN106095979A (en) * | 2016-06-20 | 2016-11-09 | 百度在线网络技术(北京)有限公司 | URL merging treatment method and apparatus |
-
2018
- 2018-09-17 CN CN201811084330.8A patent/CN109408745A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120158626A1 (en) * | 2010-12-15 | 2012-06-21 | Microsoft Corporation | Detection and categorization of malicious urls |
CN104778164A (en) * | 2014-01-09 | 2015-07-15 | 中国银联股份有限公司 | Method and device for detecting repeated URL (Uniform Resource Locator) |
CN103823892A (en) * | 2014-03-10 | 2014-05-28 | 北京奇虎科技有限公司 | Method and device of determining webpage clustering mode |
CN105095209A (en) * | 2014-04-21 | 2015-11-25 | 北京金山网络科技有限公司 | Document clustering method, document clustering device and network equipment |
CN105721427A (en) * | 2016-01-14 | 2016-06-29 | 湖南大学 | Method for mining attack frequent sequence mode from Web log |
CN106095979A (en) * | 2016-06-20 | 2016-11-09 | 百度在线网络技术(北京)有限公司 | URL merging treatment method and apparatus |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111163053A (en) * | 2019-11-29 | 2020-05-15 | 深圳市任子行科技开发有限公司 | Malicious URL detection method and system |
CN111163053B (en) * | 2019-11-29 | 2022-05-03 | 深圳市任子行科技开发有限公司 | Malicious URL detection method and system |
CN111258574A (en) * | 2020-01-14 | 2020-06-09 | 中科驭数(北京)科技有限公司 | Programming method and system for accelerator architecture |
CN111258574B (en) * | 2020-01-14 | 2021-01-15 | 中科驭数(北京)科技有限公司 | Programming method and system for accelerator architecture |
CN115130023A (en) * | 2022-07-08 | 2022-09-30 | 阿里巴巴(中国)有限公司 | Regular expression generation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109525595A (en) | A kind of black production account recognition methods and equipment based on time flow feature | |
Shi et al. | Detecting malicious social bots based on clickstream sequences | |
CN110351244A (en) | A kind of network inbreak detection method and system based on multireel product neural network fusion | |
CN105809190B (en) | A kind of SVM cascade classifier methods based on Feature Selection | |
CN109922052A (en) | A kind of malice URL detection method of combination multiple characteristics | |
CN109408745A (en) | Web data analysis and processing method and device | |
CN105069355A (en) | Static detection method and apparatus for webshell deformation | |
CN108491714A (en) | The man-machine recognition methods of identifying code | |
Sahlabadi et al. | Detecting abnormal behavior in social network websites by using a process mining technique | |
CN103679030B (en) | Malicious code analysis and detection method based on dynamic semantic features | |
CN111786950A (en) | Situation awareness-based network security monitoring method, device, equipment and medium | |
CN109462575A (en) | A kind of webshell detection method and device | |
Liao et al. | Feature extraction and construction of application layer DDoS attack based on user behavior | |
CN108229170B (en) | Software analysis method and apparatus using big data and neural network | |
CN103457909A (en) | Botnet detection method and device | |
CN114422211A (en) | HTTP malicious traffic detection method and device based on graph attention network | |
CN114422271B (en) | Data processing method, device, equipment and readable storage medium | |
Hostiadi et al. | Dataset for Botnet group activity with adaptive generator | |
CN111784360B (en) | Anti-fraud prediction method and system based on network link backtracking | |
CN111200607A (en) | Online user behavior analysis method based on multilayer LSTM | |
CN110457603A (en) | Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing | |
CN115221509A (en) | Authentication behavior portrait method based on 5W1H account | |
Pan | Network security and user abnormal behavior detection by using deep neural network | |
CN114676428A (en) | Application program malicious behavior detection method and device based on dynamic characteristics | |
Sun et al. | Visual analytics for anomaly classification in LAN based on deep convolutional neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190301 |