URL analysis method, device, equipment and medium
Technical field
The present invention relates to technical field of network security more particularly to a kind of URL analysis method, device, equipment and media.
Background technique
Traditional gateway can provide function of surfing the Net, but its website that can not be accessed user is detected, and safety is
Number is lower, and as amount of access increases, user is frequently encountered the dangerous websites such as fishing website, website comprising trojan horse,
It may under attack or virus infection when access.Therefore, there is the intelligent gateway with firewall functionality, it both will not shadow
The operational efficiency for ringing user's smart machine can also carry out security protection to the equipment for accessing same gateway simultaneously.This intelligent network
It closes while providing the user with function of surfing the Net, also will record the footprint of user's online, URL is exactly one of them, can be passed through
The url data for accessing website to user detects, and then to recording there may be threat, reminds these websites of user
There are threats, or close to the website containing wooden horse, forbid accessing, and attack user largely from threatening.
In addition, also having intelligent gateway by analyzing url data, so that user be allow to check its internet behavior.
But when current intelligent gateway URL analysis, threat detection and internet behavior are implemented separately, or
Person only has one of function, carries out that also the url data of threat, high risk can be analyzed simultaneously when internet behavior analysis, such as fish
Fishnet station, webpage with wooden horse etc., these websites also carry out internet behavior analysis as other security websites, many times
These websites have been intercepted, and there is no practical browsings by user, alternatively, the content in these danger URL is also used to build up user
Behavior database causes the analysis result inaccuracy of internet behavior;In addition, behavior library need manually be updated, efficiency compared with
It is low, for the URL being not present in behavior library, it is difficult to carry out accurate judgement.
Summary of the invention
For overcome the deficiencies in the prior art, one of the objects of the present invention is to provide a kind of URL analysis method, pass through
Threat analysis is first carried out, then carries out behavioural analysis, and by key word analysis, and then obtains accurate URL behavioural analysis result.
An object of the present invention is implemented with the following technical solutions:
A kind of URL analysis method, comprising the following steps:
Url data is received, the url data is stored in url database;
It is matched according to the url data with library is threatened, filtering has the url data of threat, obtains safe URL number
According to threat record deposit threatens record storehouse;
It is matched according to the safe url data with the known URL in behavior library:
When the safe url data successful match, obtains behavior record and be stored in behavior record library, the behavior record is
The corresponding behavior classification of safe URL;
The safe url data is when it fails to match, and to the URL that it fails to match, referred to as unknown URL extracts target keyword,
Key word analysis is carried out according to the keyword in the behavior library, analysis result is stored in behavior record as the behavior record
Library;
According to target keyword, unknown URL and its corresponding behavior classification regeneration behavior library in the analysis result.
Further, it is updated according to target keyword, unknown URL and its corresponding behavior classification in the analysis result
Behavior library, comprising the following steps:
Behavior library is added in the target keyword and its frequency;
According to updated keyword, each keyword weight is recalculated, the behavior library is updated according to weight.
Further, the unknown URL webpage further includes the URL that crawler crawls at random.
Further, the behavior library includes the library URL and keywords database, and the library URL includes known URL and its corresponding
Behavior classification, the keywords database are the corresponding keyword of behavior classification, and the keywords database is divided into according to weight: being higher than default
The judgement keywords database of weight and other keywords databases lower than default weight, other described keywords databases are according to weight by height
Keywords database and non-judgement keywords database are not determined to low be also divided into, and the judgement keywords database has behavior classification corresponding one
Group or multiple groups determine keyword and frequency, and the key word analysis is obtained described by being matched with judgement keywords database
Analysis result.
Further, the key word analysis obtains the analysis knot by being matched with judgement keywords database
Fruit, comprising the following steps:
Behavior classification is arbitrarily chosen, obtains that behavior classification is corresponding to be determined keyword and described determine keyword
Frequency is denoted as the first frequency, constructs the first array with the first frequency;
It counts weight in the unknown URL and is higher than the target keyword of default weight and the frequency of the target keyword,
It is denoted as the second frequency, the second array is constructed with the second frequency;
First array and the second array are subjected to similarity comparison, obtain the phase of the unknown URL with the behavior classification
Like angle value;
According to this, the similarity for calculating unknown URL Yu all behavior classifications obtains the analysis as a result, the analysis result
For with the maximum behavior classification of unknown url data similarity.
Further, the frequency of the highest keyword of weight and the keyword in the unknown URL is counted, including following
Step:
Crawl the webpage of the unknown URL;
The content of the webpage is segmented, all keywords in the webpage are obtained;
Calculate the weight of all keywords;
The keyword for being higher than default weight is filtered out, the target keyword is obtained.
Further, it by the data-pushing in the url database to security platform, is returned according to the security platform
As a result the threat library is updated, the result of the return is newly-increased threat URL.
The second object of the present invention is implemented with the following technical solutions:
A kind of URL analytical equipment comprising:
Module is obtained the url data is stored in url database for receiving url data;
Filtering module, for being matched according to the url data with library is threatened, filtering has the url data of threat, obtains
To safe url data, record deposit is threatened to threaten record storehouse;
Analysis module, for being matched according to the safe url data with the known URL in behavior library:
When the safe url data successful match, obtains behavior record and be stored in behavior record library, the behavior record is
The corresponding behavior classification of safe URL;
The safe url data is when it fails to match, and to the URL that it fails to match, referred to as unknown URL extracts target keyword,
Key word analysis is carried out according to the keyword in the behavior library, analysis result is stored in behavior record as the behavior record
Library;
Update module, for according to target keyword, unknown URL and its corresponding behavior classification in the analysis result
Regeneration behavior library.
The third object of the present invention is to provide the electronic equipment for executing one of goal of the invention comprising processor, storage
Medium and computer program, the computer program are stored in storage medium, and the computer program is executed by processor
Shi Shixian above-mentioned URL analysis method.
The fourth object of the present invention is to provide the computer readable storage medium of one of storage goal of the invention, store thereon
There is computer program, the computer program realizes above-mentioned URL analysis method when being executed by processor.
Compared with prior art, the beneficial effects of the present invention are:
The present invention threatens URL by filtering, filters out safe URL, carries out behavioural analysis to safe URL, avoids behavior
It is analyzed simultaneously when analysis and threatens URL, cause result inaccurate, can directly inquire threat note by threatening storehouse matching and recording
Record, behavior record can directly be inquired by passing through behavior storehouse matching and recording;For the URL not having in behavior library, pass through keyword
Analysis obtains the corresponding behavior classification of URL, and real-time update behavior library based on the analysis results, further improves the accurate of analysis
Rate, without being artificially updated to behavior library.
Detailed description of the invention
Fig. 1 is the flow chart of the URL analysis method of embodiment one;
Fig. 2 is the flow chart of the keyword analysis method and keyword analysis of embodiment three;
Fig. 3 is the structural block diagram of the URL analytical equipment of embodiment five;
Fig. 4 is the structural block diagram of the electronic equipment of embodiment six.
Specific embodiment
Below with reference to attached drawing, the present invention is described in more detail, it should be noted that right referring to the drawings
The description that the present invention carries out is only illustrative, and not restrictive.It can be combined with each other between each difference embodiment,
To constitute the other embodiments not shown in the following description.
Embodiment one
Embodiment one provides a kind of URL analysis method, threatens URL by first recording, then analyze the behavior class of safe URL
Not, and then accurate threat record and behavior record are obtained;It is compared by keyword, improves behavior record, and use keyword
Comparison result real-time update behavior library can obtain the behavior classification of all URL, be compared by keyword in this way
As a result regeneration behavior library, instead of the artificial process for carrying out URL behavior library and updating.
It please refers to shown in Fig. 1, a kind of URL analysis method, comprising the following steps:
S110, url data is received, the url data is stored in url database;
The url data received, usually collected url data on gateway box, main includes the complete trails of URL,
URL access times, access time etc., these data are pushed in real time on the topic specified, use Structured
The mode of Streaming or similar stream process engine receives topic data in real time, obtains user and accesses the information such as URL.
S120, it is matched according to the url data with library is threatened, filtering has the url data of threat, obtains safety
Url data threatens record deposit to threaten record storehouse;
When inquiry, threat historical record can be directly inquired from threatening in record storehouse.
S130, it is matched according to the safe url data with the known URL in behavior library:
When the safe url data successful match, obtains behavior record and be stored in behavior record library, the behavior record is
The corresponding behavior classification of safe URL;
The safe url data is when it fails to match, and to the URL that it fails to match, referred to as unknown URL extracts target keyword,
Key word analysis is carried out according to the keyword in the behavior library, analysis result is stored in behavior record as the behavior record
Library;
Unknown URL webpage further includes the URL that crawler crawls at random, increases the URL crawled at random to improve behavior library more
New efficiency.
When matching, matched according to safe URL with the URL in behavior library, for example, the address URL is in behavior library
" www.taobao.com " corresponding behavior classification is " shopping ", then all in safe URL includes " www.taobao.com "
The corresponding behavior classification of URL exactly " do shopping ", when matching, usually also need to carry out standardization pretreatment to safe URL, such as go
Except URL protocol header.
When key word analysis, keyword comparison is carried out to the webpage of unknown URL, for example, there is one group of behavior class in behavior library
Not Wei the keyword of " shopping " be " along rich ", " preferential ", extracting target keyword has " along rich ", " preferential ", " apple ", then
It can be assumed that the behavior classification of the unknown URL is " shopping ";Wherein target keyword is extracted according to the condition of setting, is led to
Often by filtering out the keyword for being higher than default weight as target keyword;It can also be filtered out default according to weight sequencing
The highest target keyword of the weight of number.
When inquiry, behavior historical record can directly be inquired by behavior record library.
S140, according to it is described analysis result in target keyword, unknown URL and its corresponding behavior classification regeneration behavior
Library.
Embodiment two
Embodiment mainly solves the calculating in behavior library and weight second is that the improvement carried out on the basis of embodiment one
It releases and illustrates.
It is matched for the ease of key word analysis and behavior type, behavior library includes the library URL and keywords database, the library URL
Including known URL and its corresponding behavior classification, the keywords database is the corresponding keyword of behavior classification, the keywords database
It is divided into according to weight: higher than the judgement keywords database of default weight and lower than other keywords databases of default weight, described its
His keywords database is also divided into from high to low according to weight does not determine keywords database and non-judgement keywords database, the judgement keyword
Inventory has the corresponding one or more groups of judgement keywords of behavior classification and frequency, and the key word analysis passes through crucial with judgement
Dictionary is matched, and the analysis result is obtained.
When regeneration behavior library, weight is calculated according to the target keyword being newly added, for determining in keywords database lower than pre-
Do not determine in keyword if the keyword of weight is added to, do not determine keywords database similarly, the keyword that weight is reduced is added
Into non-judgement keyword, the keyword higher than default weight is added in judgement keyword.
Specifically, TF-IDF algorithm can be used in weight calculation, other available keyword weights also can be used
Algorithm.
By taking TF-IDF algorithm as an example, in TF-IDF algorithm, TF indicates what some word or expression occurred in some document
Frequency refers herein to the frequency that a keyword occurs in webpage, for example, " preferential " is shopping in a certain behavior classification
The webpage frequency of occurrences, formula:I is i-th of word in keywords database, and j is that the keyword corresponds to webpage
Number, for example, " preferential " occurs 5 times in the shopping webpage that number is " 1 ", which shares 100 keys
Word determines that keywords database has the corresponding key of webpage that the number is " 1 " then the TF value of " preferential " is 5/100=0.05
Word and frequency also have the TF value of keyword.
IDF indicates reverse document-frequency, refers herein to the significance level that a certain keyword judges behavior classification,
Its formula:| D | refer to all webpage numbers, { j:t in a certain behavior classificationi∈djRefer to wrapping
Webpage number containing a certain keyword, such as one 100 " shopping " webpages are shared in behavior library, wherein with the presence of 10 webpages
" preferential " this keyword, then its IDF value is 1.
TFIDF value is that the TF and IDF of " preferential " this keyword in the product of TF and IDF, such as citing are respectively 0.05
With 1, then its TFIDF value is equal to 0.05.
Weighted value is preset according to the actual situation, weighted value is higher as keyword is determined, keyword judgement is carried out, it is anti-
The only lower keyword of weight, such as " ", the high frequency words such as " " influence the judgement of behavior classification.
Embodiment three
Embodiment is third is that carry out on the basis of embodiment one or/and embodiment two, mainly to key word analysis
Detailed process is explained and illustrates.
Key word analysis the following steps are included:
S210, a behavior classification is arbitrarily chosen, obtains the corresponding judgement keyword of behavior classification and the judgement is closed
The frequency of keyword is denoted as the first frequency, constructs the first array with the first frequency;
Weight is higher than the target keyword and the target keyword for presetting weight in S220, the statistics unknown URL
Frequency is denoted as the second frequency, constructs the second array with the second frequency;
Specifically, the frequency of the highest keyword of weight and the keyword in the unknown URL, including following step are counted
It is rapid:
Crawl the webpage of the unknown URL;
The content of the webpage is segmented, all keywords in the webpage are obtained;
Calculate the weight of all keywords;
The keyword for being higher than default weight is filtered out, the target keyword is obtained.
S230, the first array and the second array are subjected to similarity comparison, obtain the unknown URL and the behavior classification
Similarity value;
According to this, the similarity for calculating unknown URL Yu all behavior classifications obtains the analysis as a result, the analysis result
For with the maximum behavior classification of unknown url data similarity.
Specifically, similarity or other methods that can calculate similarity can be calculated by the cosine law.
By taking the cosine law as an example, the cosine law meets formula:
Wherein, A and B are respectively indicated
First array and the second array, similarity calculated result indicate that the similarity of two groups of keywords is higher closer to 1.
Example IV
Example IV carries out on the basis of example 1.It is mainly explained and says to the update for threatening library
It is bright.
Specifically, by the data-pushing in url database to security platform, more according to the result of security platform return
The new threat library, the result of the return are newly-increased threat URL.
To in url database or URL that crawler the crawls at random event analysis that impends is threatened with more new threat library
Detection can be by sending security platform for url data, and total amount threatens library according to real-time update according to testing result, to some
In the presence of the website seriously threatened or the corresponding IP of URL by being issued to intelligent gateway box firewall system, to reach resistance
Disconnected purpose.
It by security platform, can further impend detection to URL, and by testing result more new threat library,
Keep accuracy rate when threatening storehouse matching higher.
Embodiment five
Embodiment five discloses a kind of corresponding device of the above-mentioned URL analysis method of correspondence, is the virtual dress of above-described embodiment
Structure is set, it is shown referring to figure 3., comprising:
Module 310 is obtained the url data is stored in url database for receiving url data;
Filtering module 320, for being matched according to the url data with library is threatened, filtering has the URL number of threat
According to obtaining safe url data, record deposit threatened to threaten record storehouse;
Analysis module 330, for being matched according to the safe url data with the known URL in behavior library:
When the safe url data successful match, obtains behavior record and be stored in behavior record library, the behavior record is
The corresponding behavior classification of safe URL;
The safe url data is when it fails to match, and to the URL that it fails to match, referred to as unknown URL extracts target keyword,
Key word analysis is carried out according to the keyword in the behavior library, analysis result is stored in behavior record as the behavior record
Library;
Update module 340, for according to target keyword, unknown URL and its corresponding behavior in the analysis result
Classification regeneration behavior library.
Preferably, according to target keyword, unknown URL and its corresponding behavior classification more newline in the analysis result
For library, comprising the following steps:
Behavior library is added in the target keyword and its frequency;
According to updated keyword, each keyword weight is recalculated, the behavior library is updated according to weight.
The unknown URL webpage further includes the URL that crawler crawls at random.
Preferably, the behavior library includes the library URL and keywords database, and the library URL includes known URL and its corresponding row
For classification, the keywords database is the corresponding keyword of behavior classification, and the keywords database is divided into according to weight: being higher than default power
Weight judgement keywords database and other keywords databases lower than default weight, other described keywords databases according to weight by height to
Low be also divided into does not determine keywords database and non-judgement keywords database, and it is one group corresponding that the judgement keywords database has behavior classification
Or multiple groups determine keyword and frequency, the key word analysis by with determine that keywords database is matched, obtain described
Analyze result.
Preferably, the key word analysis by with determine that keywords database is matched, obtain the analysis as a result,
The following steps are included:
Behavior classification is arbitrarily chosen, obtains that behavior classification is corresponding to be determined keyword and described determine keyword
Frequency is denoted as the first frequency, constructs the first array with the first frequency;
It counts weight in the unknown URL and is higher than the target keyword of default weight and the frequency of the target keyword,
It is denoted as the second frequency, the second array is constructed with the second frequency;
First array and the second array are subjected to similarity comparison, obtain the phase of the unknown URL with the behavior classification
Like angle value;
According to this, the similarity for calculating unknown URL Yu all behavior classifications obtains the analysis as a result, the analysis result
For with the maximum behavior classification of unknown url data similarity.
Count the frequency of the highest keyword of weight and the keyword in the unknown URL, comprising the following steps:
Crawl the webpage of the unknown URL;
The content of the webpage is segmented, all keywords in the webpage are obtained;
Calculate the weight of all keywords;
The keyword for being higher than default weight is filtered out, the target keyword is obtained.
Preferably, by the data-pushing in the url database to security platform, the knot returned according to the security platform
Fruit updates the threat library, and the result of the return is newly-increased threat URL.
Embodiment six
Fig. 4 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention six provides, as shown in figure 4, the electronics is set
Standby includes processor 410, memory 420, input unit 430 and output device 440;The number of processor 410 in computer equipment
It measures and can be one or more, in Fig. 4 by taking a processor 410 as an example;Processor 410, memory 420 in electronic equipment,
Input unit 430 can be connected with output device 440 by bus or other modes, in Fig. 4 for being connected by bus.
Memory 420 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer
Sequence and module, if the corresponding program instruction/module of URL analysis method in the embodiment of the present invention is (for example, URL analysis method
Data acquisition module 310, filtering module 320, analysis module 330 and update module 340 in device).Processor 410 passes through fortune
The row software program, instruction and the module that are stored in memory 420, thereby executing electronic equipment various function application and
Data processing, i.e. the URL analysis method of realization above-described embodiment one to example IV.
Memory 420 can mainly include storing program area and storage data area, wherein storing program area can store operation system
Application program needed for system, at least one function;Storage data area, which can be stored, uses created data etc. according to terminal.This
Outside, memory 420 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one
Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 420 can be into one
Step includes the memory remotely located relative to processor 410, these remote memories can be set by network connection to electronics
It is standby.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Input unit 430 can be used for receiving the subscriber identity information of input, default weight etc..Output device 440 may include
Display screen etc. shows equipment.
Embodiment seven
The embodiment of the present invention seven also provides a kind of storage medium comprising computer executable instructions, and the computer can be held
Row instruction is used to execute URL analysis method when being executed by computer processor, this method comprises:
Url data is received, the url data is stored in url database;
It is matched according to the url data with library is threatened, filtering has the url data of threat, obtains safe URL number
According to threat record deposit threatens record storehouse;
It is matched according to the safe url data with the known URL in behavior library:
When the safe url data successful match, obtains behavior record and be stored in behavior record library, the behavior record is
The corresponding behavior classification of safe URL;
The safe url data is when it fails to match, and to the URL that it fails to match, referred to as unknown URL extracts target keyword,
Key word analysis is carried out according to the keyword in the behavior library, analysis result is stored in behavior record as the behavior record
Library;
According to target keyword, unknown URL and its corresponding behavior classification regeneration behavior library in the analysis result.
Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention
The method operation that executable instruction is not limited to the described above can also be performed provided by any embodiment of the invention based on URL
Relevant operation in analysis method.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention
It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more
Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art
Part can be embodied in the form of software products, which can store in computer readable storage medium
In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer
Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions use so that an electronic equipment
(can be mobile phone, personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
It is worth noting that, in the above-mentioned embodiment based on URL analysis method device, included each unit and module
It is only divided according to the functional logic, but is not limited to the above division, as long as corresponding functions can be realized;
In addition, the specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
It will be apparent to those skilled in the art that can make various other according to the above description of the technical scheme and ideas
Corresponding change and deformation, and all these changes and deformation all should belong to the protection scope of the claims in the present invention
Within.