CN105956472B - Identify webpage in whether include hostile content method and system - Google Patents
Identify webpage in whether include hostile content method and system Download PDFInfo
- Publication number
- CN105956472B CN105956472B CN201610313359.3A CN201610313359A CN105956472B CN 105956472 B CN105956472 B CN 105956472B CN 201610313359 A CN201610313359 A CN 201610313359A CN 105956472 B CN105956472 B CN 105956472B
- Authority
- CN
- China
- Prior art keywords
- webpage
- feature
- identified
- result
- url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2119—Authenticating web pages, e.g. with suspicious links
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses identification webpage in whether include hostile content method, one of recognition methods comprising steps of parse the URL of webpage to be identified with from URL extract URL feature to generate fisrt feature collection;First eigenvector is generated according to fisrt feature collection;And the first eigenvector is handled using fisrt feature model, and export the first result to characterize whether the webpage to be identified includes hostile content.The invention also discloses other three kinds of recognition methods and it is corresponding identification webpage in whether include hostile content system.
Description
Technical field
Whether the present invention relates to include the method for hostile content in technical field of network security, especially identification webpage and be
System.
Background technique
With internet development, the application based on WEB is also become increasingly popular, and people can inquire bank's account by browser
Family, shopping online etc., WEB provide a convenient efficiently interactive mode.But accompanying problem is that: a large amount of malice
Website attack is incremented by double year by year, pretends identity by a series of technological means to gain the trust of user by cheating, and then seek non-
Method interests, user is under the attack of malicious websites by huge economic loss.Therefore how to identify hostile content in webpage,
Preventing malice website becomes the significantly research topic of network safety filed one.
The technology of existing preventing malice website is mainly the URL of a given suspicious webpage, sends it to blacklist
Database is inquired, however since fishing website is kept updating, inspection of this method to malicious websites such as fishing websites
Extracting rate is not high and has hysteresis quality.Either by scanning web page contents, searches and whether there is malice keyword in webpage;Or
The essential characteristic for extracting Web page image, calculates the similarity between suspicious webpage and true webpage, judges suspicious webpage with this
Whether there is imitation suspicion, but the above method has respective limitation, causes False Rate higher.
Summary of the invention
For this purpose, the present invention provides identification webpage in whether include hostile content method and system, with try hard to solve or
Person at least alleviates at least one existing problem above.
According to an aspect of the invention, there is provided it is a kind of identification webpage in whether include hostile content method, including
Step: the URL of webpage to be identified is parsed to extract URL feature from URL to generate fisrt feature collection;It is raw according to fisrt feature collection
At first eigenvector;And the first eigenvector is handled using fisrt feature model, and export the first result with table
Levy whether the webpage to be identified includes hostile content.
Further include pre-treatment step in recognition methods according to the present invention: extracting the URL of webpage to be identified, judge to
Identify whether the webpage URL and URL in pre-stored data library is consistent, if webpage URL to be identified sentences in the first pre-stored data library
Break the webpage to be identified include hostile content;And if webpage URL to be identified, in the second pre-stored data library, judgement should be wait know
Other webpage does not include hostile content.
According to another aspect of the present invention, provide it is a kind of identification webpage in whether include hostile content method, including
Step: grabbing web page contents to be identified, carries out word segmentation processing to the web page contents grabbed and obtains word sequence;According in word sequence
The second feature vector that dimension is the first predetermined number is constructed with the presence or absence of the Feature Words of second feature concentration, wherein second is special
The first predetermined number Feature Words have been prestored in collection;And using second feature vector described in second feature model treatment, and
The second result is exported to characterize whether the webpage to be identified includes hostile content.
According to an aspect of the present invention, provide in a kind of identification webpage whether include hostile content method, including step
It is rapid: the first identity information of webpage to be identified is extracted according to the URL of webpage to be identified;Extract all exterior chains of the webpage to be identified
It connects;According to outer the second identity information for linking the determining webpage to be identified;And compare the first identity information and the second identity letter
Breath exports third result to characterize whether the webpage to be identified includes hostile content.
According to an aspect of the present invention, provide in a kind of identification webpage whether include hostile content method, including step
It is rapid: to execute recognition methods as described above to export the first result;Recognition methods as described above is executed to export the second result;
Recognition methods as described above is executed to export third result;Calculation is weighted to the first result, the second result and third result
Method obtains final result;If final result is greater than threshold value, determine in the webpage to be identified comprising hostile content;And if most
The fruit that terminates is not more than threshold value, then determines not including hostile content in the webpage to be identified.
Correspondingly, the present invention also provides in four kinds of identification webpages corresponding with above-mentioned four kinds of recognition methods respectively whether
System comprising hostile content.
Based on description above, this programme is intended to provide a kind of scheme of efficient, strong applicability identification malicious web pages, should
Scheme includes following several recognition methods:
Firstly, being filtered by URL of the black and white lists to webpage to be identified;
Then, it parses the URL of webpage to be identified and extracts fisrt feature collection, the first spy is handled using machine learning model
Collection exports the first result to characterize whether webpage to be identified includes hostile content;
Meanwhile second feature vector is extracted according to the web page contents of webpage to be identified, the is handled using machine learning model
Two feature vectors export the second result to characterize whether webpage to be identified includes hostile content;
Alternatively, judging that webpage to be identified is by analyzing webpage to be identified and the outer webpage identity information linked of its correspondence
It is no that there is imitation suspicion, and third result is exported to characterize whether webpage to be identified includes hostile content;
Finally, ranking operation can also be done for above-mentioned first result, the second result, third result, to reach more fully
Identify the purpose of judgement.
In this way, this programme on the basis of traditional black and white lists recognition methods, in conjunction with machine learning model and imitates suspicion
Recognition methods is doubted, while considering URL feature and web page contents, has not only solved the hysteresis quality of black and white lists identification, but also is had certain
The unknown malicious websites of detection ability, also save human resources, webpage to be identified identified by automatic mode.
And it is possible to which the above-mentioned recognition methods of flexible choice is combined, according to the demand of application scenarios in order to quickly and accurately know
It whether include hostile content in other webpage.
Detailed description of the invention
To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings
Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect
It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned
And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical
Component or element.
Fig. 1 show in identification webpage according to an embodiment of the invention whether the method 100 comprising hostile content
Flow chart;
Fig. 2 shows in identification webpage according to another embodiment of the present invention whether the method 200 comprising hostile content
Flow chart;
Fig. 3 show in the identification webpage of another embodiment according to the present invention whether the method 300 comprising hostile content
Flow chart;
Fig. 4 show in the identification webpage of another embodiment according to the present invention whether the method 400 comprising hostile content
Flow chart;
Fig. 5 show in identification webpage according to an embodiment of the invention whether the system 500 comprising hostile content
Schematic diagram;
Fig. 6 show in identification webpage according to another embodiment of the present invention whether the system 600 comprising hostile content
Schematic diagram;
Fig. 7 show in the identification webpage of another embodiment according to the present invention whether the system 700 comprising hostile content
Schematic diagram;And
Fig. 8 show in the identification webpage of another embodiment according to the present invention whether the system 800 comprising hostile content
Schematic diagram.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Fig. 1 show in identification webpage according to an embodiment of the invention whether the method 100 comprising hostile content
Flow chart.
According to one embodiment of present invention, it is the recognition efficiency for improving malicious web pages, the webpage to be identified of input is done
Pretreatment operation, that is, webpage to be identified is filtered using black and white lists, fall readily identified webpage out to delete choosing.Specifically,
The URL for extracting webpage to be identified judges the URL in the webpage URL to be identified and pre-stored data library (i.e. blacklist and white list)
It is whether consistent, if webpage URL to be identified in the first pre-stored data library (that is, blacklist), judges that the webpage to be identified includes
Hostile content;If webpage URL to be identified in the second pre-stored data library (that is, white list), judges that the webpage to be identified does not wrap
Containing hostile content;For the remaining webpage to be identified being not matched to, then the operation of step S110 is carried out, to continue to it point
Analysis.
Code when showing black and white lists filtering as follows executes logic, and wherein whitelist refers to white list,
Blacklist refers to blacklist:
By pre-treatment step, first simple screening falls webpage easy to identify, then analyzes webpage to be identified.The pre-treatment step
It can be combined with other recognition methods, the invention is not limited in this regard.
In step s 110, it parses the URL of webpage to be identified and generates the first spy to extract URL feature from the URL
Collection.
Each segment of URL conveys specific information to client and server, and the URL of a webpage can be decomposed
It is as follows for several major parts:
Wherein the introduction of each element such as agreement (protocol), host (host), path (path) is not herein
Work is unfolded.By taking following URL as an example:
Http:// www.baidu.com/path/index.hrml? q=adf
It is obtained after parsing:
Protocol:http
Host:www.baidu.com
Path:path/index.hrml? q=adf
Pathname:path/index.hrml
Query:? q=adf
Then URL feature is extracted to generate fisrt feature collection.
According to an embodiment of the invention, 18 structure features and 7 lexical features of URL are extracted altogether, as follows (with Fi
Indicate ith feature):
F1: the URL length of url_len, URL length, usual malicious web pages are all too long;
F2: the access times of http_n, http agreement, the webpage comprising hostile content, such as fishing link would generally be more
It is secondary to use http agreement, link guiding is changed with this, by the designed fishing website of user guiding, e.g., http: //
Www.taobao.com/url? q=http: //www.59adfadss123.com, which seems to guiding Taobao
Homepage, and in fact when the user clicks when can be redirected to subsequent fishing website up.Therefore, http agreement is used for multiple times
Link be likely to be fishing link;
F3: whether tld_inht, top level domain are legal, wherein indicate legal with 1,0 indicates illegal;
F4: whether is_ip contains IP address in link, and the link for usually containing IP address is likely to fishing link, and
Legal link substantially will not include IP address, and equally, use 1 indicates it is that 0 indicates no;
F5And F6It indicates the number containing designated character in URL link, is respectively as follows:
F5: url_n_percent, character ' % ' number in link, usually contain ' URL of % ' compiled using unicode
Code, e.g.,
Http:// www.taobao.com@%77%77%77%2E%70%68%69%73%68%2E%63%
6F%6D;
F6: url_n_token, in link containing ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators;
F7: host_len, the length of host character string;
F8And F9It indicates the number containing designated character in host character string, is respectively as follows:
F8: host_n_dot, host character string contain the number of point number separator;
F9: host_n_token, host character string contains ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators;
F10: host_max_len, length of the host character string by longest character string after the segmentation of point number separator, such as www.t
Aobao.1242.59adfadss123.com divide according to point number after character string are as follows: " www ", " taobao ", " 1242 ", "
59adfadss123 ", " com ", wherein F10=12;
F11And F12It indicates the number containing designated character in path, is respectively as follows:
F11: path_n_dot, the number containing point number separator in path;
F12: path_n_token, in path containing ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators;
F13: pathname_len, the length of pathname;
F14And F15It indicates the number containing designated character in pathname, is respectively as follows:
F14: pathname_n_dot, the number containing point number separator in pathname;
F15: pathname_n_token, in pathname containing ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators;
F16: pathname_max_len, pathname by '/' segmentation after longest character string length, same to F10;
F17: n_subdir, pathdepth, with the depth in '/' characterization path in pathname, the link of usual malice all passes through
Deepen path and carrys out confusing user;
F18: the length of query_len, query field;
F19~F25: respectively indicate in URL link whether comprising character string " secure ", " account ", " webscr ", "
Login ", " signin ", " banking ", " confirm ", usual malicious link can include these character strings.
The present embodiment has been merely given as an example of fisrt feature collection, fisrt feature collection may include it is above-mentioned at least one
URL feature can also extract other URL features, the invention is not limited in this regard.
Then in the step s 120, first eigenvector is generated according to above-mentioned fisrt feature collection.
A) each feature first concentrated to fisrt feature is quantized to obtain characteristic value, by all eigenvalue clusters at one
Feature vector.By taking 25 URL features above as an example, for following URL:
Http:// www.dyfdzx.com/js/? app=com-d3&;Ref=http: //
jebvahnus.battle.net/d3/en/index
Extract F1To F25Feature obtains characteristic value, forms the feature vector of one 25 dimension
B) every one-dimensional characteristic value in features described above vector is normalized again, generates first eigenvector.
According to one embodiment of present invention, as follows normalize to every one-dimensional characteristic value in feature vector [-
1,1] between:
Wherein, FiFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, Fi,maxFor the maximum of i-th dimension characteristic value
Value, Fi,minFor the minimum value of i-th dimension characteristic value.
Therefore, the feature vector that step a) is generatedAfter normalizing are as follows:
Then in step s 130, using fisrt feature model come first eigenvector obtained in processing step S120,
And the first result is exported to characterize whether the webpage to be identified includes hostile content.
Embodiment according to the present invention classifies to first eigenvector using algorithm of support vector machine (SVM), defeated
0 or 1 conduct first if exporting the first result and being 1 as a result, specifically, indicate that webpage to be identified includes hostile content out;If defeated
The first result, which is 0, out indicates that webpage to be identified does not include hostile content.
Support vector machines (Support Vector Machine, SVM) is a kind of engineering based on Statistical Learning Theory
Learning method, core are to find a hyperplane (hyperplane) to separate training data, guarantee this hyperplane
The interval (margin) of two sides is maximum, that is to say, that SVM algorithm is extensive to improve learning machine by seeking structuring least risk
Ability realizes that the minimum of empiric risk and fiducial range can also obtain to reach in the case where statistical sample amount is less
The purpose of good statistical law.Theoretically it is a binary classifier, but can be expanded into multivariate classification device.It should infuse
Meaning, characteristic model (for example, fisrt feature model) of the present invention for training are not only restricted to this.
For example, for webpage A, URL to be identified are as follows:
http://ssol.iitk.ac.in/wp-content/onlineinformationnabaustralia/
Informationsec ureonline/login.php? NAB82515Reset-Online-Account7137
It extracts its URL feature and generates feature vector are as follows:
First eigenvector is obtained through normalization again:
It willFisrt feature model is inputted, the first result of output is 1, indicates that webpage A includes hostile content.
For another example, for webpage B, URL to be identified are as follows:
http://www.annyway.com/annyway/MMSC.84+M5d637b1e38d.0.html
It extracts its URL feature and generates feature vector are as follows:
First eigenvector is obtained through normalization again:
It willAfter inputting fisrt feature model, the first result of output is 0, indicates that webpage B does not include hostile content.
Implementation according to the present invention, the recognition methods 100 further include the steps that trained fisrt feature model:
(1) the URL work for largely having been marked as the webpage not comprising hostile content and the webpage comprising hostile content is chosen
For sample data, and the operation of step S110 is executed to sample data, obtain the fisrt feature collection of sample data.
(2) with step S120, corresponding first eigenvector is generated according to the fisrt feature collection of sample data, as training
Parameter.
(3) it using the training parameter in machine learning algorithm (algorithm of support vector machine) training step (2), obtains original
Classification learning model SVM-Model, i.e. fisrt feature model.
According to an embodiment of the invention, the recognition methods 100 further includes online for the variability of reply malicious websites attack
The step of updating fisrt feature model: updating sample data in the given time, then execute above-mentioned steps (1), (2), generates new
Sample data first eigenvector, the first eigenvector of update input fisrt feature model is trained, is generated new
Fisrt feature model and replace old fisrt feature model.
Furthermore since malicious link often changes, this programme can be also updated the generating algorithm of first eigenvector,
Such as the dimension ... for increasing new URL feature, deleting some existing URL feature, changing first eigenvector
According to the above-mentioned description to recognition methods 100, the URL of webpage to be identified is parsed to extract fisrt feature collection, then will
The corresponding first eigenvector of fisrt feature collection is input in fisrt feature model, obtains the sky of feature belonging to webpage to be identified
Between, to judge whether this feature space belongs to the feature space of the webpage comprising hostile content, if so, output 1 indicates the net
Page includes hostile content.Method 100 is not necessarily to manual identified URL, does not also need manually to lay down a regulation, to save manpower.Separately
Outside, it is contemplated that the variability of malicious websites, timing update fisrt feature model, also improve lacking for existing recognition methods lag
Point.
Fig. 2 shows in identification webpage according to another embodiment of the present invention whether the method 200 comprising hostile content
Flow chart.As shown in Fig. 2, the recognition methods 200 includes the following steps:
In step S210, web page contents to be identified are grabbed, word segmentation processing is carried out to the web page contents grabbed and obtains word order
Column.
According to one embodiment of present invention, web page contents are crawled using scrapy frame, then uses MMSEG
Word segmentation processing is carried out to the web page contents crawled and obtains word sequence.MMSEG be in Chinese word segmentation one it is common, based on dictionary
Segmentation methods have Simple visual, realize uncomplicated, the fast advantage of the speed of service.Briefly, the segmentation methods include "
With algorithm " and " disambiguation rule ", wherein matching algorithm refers to how according to the word saved in dictionary, to the sentence for wanting cutting
It is matched;" disambiguation rule " is says when in short can divide in this way or divide like that, with what it is regular come
Determine which kind of point-score, such as " facility and service " this phrase used, is segmented into " facility/kimonos/business ", is also segmented into
Which word segmentation result " facility/and/service ", select, and is exactly the function of " disambiguation rule ".In MMSEG algorithm, definition
With there are two types of algorithms: simple maximum matching and complicated maximum matching;There are four types of the rules of the disambiguation of definition: maximum matching
(Maximum matching, corresponding above two matching algorithm), maximum average word length (Largest average word
Length), the minimum rate of change (Smallest variance of word lengths) of word length, calculate phrase in
Then obtained value is added, takes the maximum phrase of summation (Largest sum of by the natural logrithm of all monosyllabic word word frequency
degree of morphemic freedom of one-character words)。
Then in step S220, it is to construct dimension according to whether there is the Feature Words that second feature is concentrated in word sequence
The second feature vector of first predetermined number, wherein second feature concentration has prestored the first predetermined number Feature Words.
Firstly, according to one embodiment of present invention, second feature collection takes following method to generate: obtaining preset webpage
Web page contents carry out word segmentation processing to acquired web page contents and obtain word sequence, to each word in word sequence, computational chart
The Second Eigenvalue for levying the word importance chooses the first predetermined number (example according to the sequence of Second Eigenvalue from high to low
Such as, 500) word forms second feature collection as Feature Words.
Wherein, whether Second Eigenvalue is defined as under conditions of there is certain word, include hostile content in webpage
Probability distribution and webpage whether include hostile content probability distribution distance, that is, the expectation cross entropy (Expected of word
Cross Entropy), it is however generally that, the expectation intersection of word w is closely related bigger, and the ability for distinguishing sample is stronger, it is expected that cross entropy
Calculation formula it is as follows:
Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, P
(phish) refer to the probability of fishing webpage, P (nophish | w) refers to that webpage to be identified is not Fishing net under conditions of word w occurs
The probability of page, P (nophish) refer to the probability of non-fishing webpage.
Then, include: the step of second feature vector to construct with the presence or absence of Feature Words according in word sequence
1. sequentially searching whether there is the specific word for each Feature Words that second feature is concentrated in word sequence:
If there are the specific words in word sequence, the value of corresponding position is concentrated to be assigned to 1 in second feature the specific word;
If the specific word is not present in word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word
0。
2. generating the second feature vector that dimension is the first predetermined number, example according to the assignment of Feature Words corresponding position
Such as, N number of word is chosen as Feature Words (embodiment according to the present invention, N generally take between 450~550), then second is special
Sign vector can indicate are as follows:
Then in step S230, the second feature vector generated using second feature model treatment step S220, and it is defeated
The second result is out to characterize whether webpage to be identified includes hostile content.According to an embodiment of the invention, if the second result of output
Indicate that webpage to be identified includes hostile content for 1;Indicate that webpage to be identified does not include in malice if exporting the second result and being 0
Hold.
With the step of like that, which also includes training second feature model described in recognition methods 100:
(1) webpage for largely having been labeled as the webpage comprising hostile content and the webpage not comprising hostile content is chosen
Content is as sample data, in step S210, carries out word segmentation processing to the web page contents grabbed and obtains word sequence.
(2) Feature Words concentrated according to second feature execute the operation in step S220, generate the net as sample data
The second feature vector of page content is as training parameter.
(3) it using the training parameter in machine learning algorithm (support vector machine method) training step (2), obtains original
Classification learning model SVM-Model, i.e. second feature model.
Similarly, which further includes the steps that online updating second feature model: updating in the given time
Above-mentioned sample data repeats the training step of (2), (3), replaces original second feature mould to generate new second feature model
Type.
According to the above-mentioned description to recognition methods 200, recognition methods 200 is different from traditional key based on web page contents
Word scan method --- as long as scoring simply is weighted to each keyword, but by the web page contents vectorization of crawl,
Then webpage is sorted out automatically with machine learning algorithm, to improve the accuracy of webpage identification.
In general, the topological structure of malicious websites is simple and the domain name of exterior chain and itself domain name are inconsistent, it is based on this point,
The present invention provides in another webpage for identification whether include hostile content method.As shown in figure 3, the recognition methods 300
Mainly judge whether the webpage contains hostile content by the outer number of links of webpage to be identified and webpage identity.
This method 300 starts from step S310, is believed according to the first identity that the URL of webpage to be identified extracts webpage to be identified
Breath.Specifically, the URL for parsing webpage to be identified first obtains the domain name of webpage to be identified, then using the domain name as this wait know
First identity information of other webpage.Such as the URL of webpage to be identified is:
http://likersgames.netne.net/
It is netne.net that parsing URL, which obtains its domain name, therefore the first identity information of the webpage to be identified is netne.net.
Then in step s 320, all outer links of the webpage to be identified are extracted.
For popular, outer link exactly refers to the link that oneself website is imported into from other website.It can be according to URL link
Html web page, extracts its all outer link, and the present invention is to extracting the outer method linked and with no restriction.
Then in step S330, is fetched according to all exterior chains extracted and determine that the second identity of the webpage to be identified is believed
Breath.According to one embodiment of present invention, the corresponding all exterior chains of the webpage to be identified are counted and pick out existing number, use appearance
Second identity information of the most outer link domain name of number as webpage.Or by taking the URL in step S310 as an example, extract
Outer link and outer number of links are respectively as follows:
000webhost.com:16
serviceuptime.com:1
hosting24.com:5
So the second identity information of the webpage to be identified are as follows: 000webhost.com.
In step S340, compare the first identity information (being obtained by step S310) and the second identity information (by step
S330 is obtained), third result is exported to characterize whether the webpage to be identified includes hostile content.
For URL above, the first identity information (netne.net) and the second identity information (000webhost.com)
It is not consistent, therefore exporting third result is 1, is indicated in the webpage to be identified comprising hostile content.Conversely, if the second identity information with
First identity information is consistent, then exporting third result is 0, indicates not including hostile content in the webpage to be identified.
The URL of webpage to be identified for another example are as follows:
http://www.baidu.com
The URL is parsed, obtains the first identity information are as follows: baidu.com;
Extract the outer link and outer number of links that it contains are as follows:
bdstatic.com:5
hao123.com:2
baidu.com:27
Obtain the second identity information are as follows: baidu.com;
Second identity information and the first identity information are identical, therefore export third result 0, judge that the webpage to be identified does not include
Hostile content.
To sum up, recognition methods 100, recognition methods 200, recognition methods 300 respectively illustrate identification malicious web pages (comprising disliking
Anticipate content webpage) 3 kinds of methods: the URL of 100 analyzing web page of recognition methods, extract URL feature simultaneously use machine learning model
Classify to webpage;Recognition methods 200 grabs web page contents, according to preset Feature Words by web page contents vectorization, and adopts
With machine learning model to Web page classifying;Webpage identity is analyzed in recognition methods 300, to identify the evil with imitation suspicion
Meaning webpage.Whether above 3 kinds of methods are identified in webpage from different angles comprising hostile content, a reality according to the present invention
Example is applied, can be in conjunction with above-mentioned 3 kinds of recognition methods, whether comprehensive analysis webpage to be identified includes hostile content, i.e. recognition methods
400。
The flow chart of the recognition methods 400 is as shown in Figure 4.As previously mentioned, recognition methods 400 is filtered in traditional black and white lists
On the basis of, comprehensively consider the URL feature and content characteristic of webpage, while in view of skill is pretended in the used imitation having of malicious websites
Art analyzes webpage identity to identify the malicious web pages with imitation suspicion;In implementation method, using machine learning model to net
Page is classified;Not only it had solved the hysteresis quality disadvantage of traditional recognition method, but also has had the ability of certain unknown malicious web pages of detection,
Improve the accuracy of identification.
Specifically, the step of recognition methods 400, is as follows:
In step S410, recognition methods 100 as shown in Figure 1 is executed to export the first result.
In step S420, recognition methods 200 as shown in Figure 2 is executed to export the second result.
In step S430, recognition methods 300 as shown in Figure 3 is executed to export third result.
Then, in step S440, algorithm is weighted to above-mentioned first result, the second result and third result, is obtained
Final result, and judged:
If final result is greater than threshold value (in the present embodiment, threshold value 0.5), then determine in the webpage to be identified comprising disliking
Meaning content;
If final result is not more than threshold value, determine not including hostile content in the webpage to be identified.
It according to one embodiment of present invention, can be using simple weighting algorithm to the first result (r1), the second result
(r2) it carries out calculation process with third result (r3) and obtains final result (r):
R=w1×r1+w2×r2+w3×r3
Wherein, w1、w2And w3The first result, the second result, the corresponding weight of third result are respectively represented, and according to this hair
Bright one embodiment distinguishes value 0.4,0.4,0.2.
Correspondingly, Fig. 5 to Fig. 8 shows the identification according to an embodiment of the present invention for realizing 4 kinds of recognition methods as above
System will be introduced respectively below.
Fig. 5 show in identification webpage according to an embodiment of the invention whether the system 500 comprising hostile content
Schematic diagram.The system 500 includes including at least URL extractor 510, fisrt feature extractor 520 and the first recognition unit 530.
According to a kind of implementation, system 500 further includes judging filter element 540, be suitable for judge webpage URL to be identified and
Whether the URL in pre-stored data library is consistent:
If webpage URL to be identified in the first pre-stored data library (that is, blacklist), judges that the webpage to be identified includes to dislike
Meaning content;And
If webpage URL to be identified in the second pre-stored data library (that is, white list), judges that the webpage to be identified does not include
Hostile content.
For by above-mentioned black and white lists it is unidentified go out URL, then send it to URL extractor 510.
URL extractor 510 is suitable for parsing the URL of webpage to be identified.
Fisrt feature extractor 520 is suitable for extracting URL feature from the URL identified to generate fisrt feature collection.According to
One embodiment of the present of invention, fisrt feature collection include one or more in following: URL length, http agreement use secondary
Whether number, top level domain are legal, whether include number, host string length, host containing designated character in IP address, URL
The length of longest character string in number, host character string in character string containing designated character, the number in path containing designated character,
The length of longest character string, pathdepth, inquiry ginseng in number, pathname in pathname length, pathname containing designated character
In digital segment length, URL whether string containing designated character.To being discussed in detail referring to the description based on Fig. 1 for each feature.
Fisrt feature extractor 520 is further adapted for generating first eigenvector according to fisrt feature collection.One according to the present invention
Embodiment, fisrt feature extractor 520 include numeralization subelement 522 and normalization subelement 524.
Numeralization subelement 522 is suitable for being quantized to obtain characteristic value to each feature that fisrt feature is concentrated, will be special
Value indicative forms a feature vector.
Normalization subelement 524 is suitable for every one-dimensional characteristic value in the feature vector after logarithm value and place is normalized
Reason generates first eigenvector.Such as normalization subelement 524 is configured as normalizing every one-dimensional characteristic value of feature vector
To between [- 1,1]:
Wherein, FiFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, Fi,maxFor the maximum of i-th dimension characteristic value
Value, Fi,minFor the minimum value of i-th dimension characteristic value.
First recognition unit 530 is suitable for handling first eigenvector using fisrt feature model, export the first result with
Characterize whether webpage to be identified includes hostile content.Wherein, if the first result of output is 1, then it represents that webpage to be identified includes
Hostile content;If the first result of output is 0, then it represents that webpage to be identified does not include hostile content.
According to an embodiment of the invention, system 500 is additionally configured to execute the operation of trained fisrt feature model.
Wherein, URL extractor 510 is further adapted for the webpage extracted largely have been marked as not comprising hostile content and comprising disliking
The URL of the webpage for content of anticipating is as sample data.Fisrt feature extractor 520 is further adapted for forming fisrt feature according to above-mentioned URL
Collection, and corresponding first eigenvector is generated according to fisrt feature collection, as training parameter.In addition, system 500 further includes and the
First training unit 550 of one feature extractor, 520 phase coupling, is suitable for using machine learning algorithm (for example, support vector machines side
Method SVM) training parameter extracted by fisrt feature extractor 520 is trained, obtain fisrt feature model.
In the present embodiment, in order to cope with the variability that malicious websites are attacked, system 500 can also include that the first update is single
Member 560, suitable for updating sample data in the given time, generating the first eigenvector of new sample data and updating
First eigenvector input fisrt feature model be trained, to regularly update fisrt feature model.
Furthermore the first updating unit 560 is further adapted for the feature by increasing, deleting fisrt feature concentration, it is special to change first
The dimension of vector is levied, to generate new first eigenvector.
Fig. 6 show in identification webpage according to another embodiment of the present invention whether the system 600 comprising hostile content
Schematic diagram.The system 600 includes at least: page analyzer 610, second feature extractor 620 and the second recognition unit 630.
Page analyzer 610 is suitable for grabbing web page contents to be identified, carries out word segmentation processing to the web page contents grabbed and obtains
To word sequence.It include the participle suitable for web page contents are carried out with word segmentation processing according to a kind of implementation, in page analyzer 610
Device, the segmenter are suitable for carrying out word segmentation processing to web page contents using the segmentation methods based on dictionary, and wherein segmentation methods can be with
It is the MMSEG algorithm of the rule comprising a dictionary, two kinds of matching algorithms and four disambiguations.
Page analyzer 610 is further adapted for obtaining the web page contents of preset webpage, and divides acquired web page contents
Word handles to obtain word sequence.
Second feature extractor 620 is suitable for constructing dimension according to whether there is the Feature Words that second feature is concentrated in word sequence
Degree is the second feature vector of the first predetermined number (for example, choosing the first predetermined number between 450-550), wherein second is special
The first predetermined number Feature Words have been prestored in collection.
According to the implementation, second feature extractor 620 further includes coupling subelement 622.Coupling subelement 622 is suitable for
To each Feature Words that second feature is concentrated, sequentially searching whether there is the specific word in word sequence:
If being matched to certain Feature Words in word sequence, the value of corresponding position is concentrated to assign in second feature the specific word
It is 1;
If not being matched to certain Feature Words in word sequence, the specific word is concentrated to the value of corresponding position in second feature
It is assigned to 0.
Second feature extractor 620 is further adapted for generating dimension according to the assignment of Feature Words corresponding position being the first predetermined number
Purpose second feature vector.
The system 600 further includes feature set generation unit 640, suitable for each word in word sequence, computational representation should
The Second Eigenvalue of word importance simultaneously chooses the first predetermined number word work according to the sequence of Second Eigenvalue from high to low
It is characterized word, forms second feature collection.Wherein, Second Eigenvalue be defined as be in webpage under conditions of there is certain word
The no probability distribution comprising hostile content and webpage whether include hostile content probability distribution distance, the phase of word can be used
Cross entropy is hoped to indicate:
Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, P
(phish) refer to the probability of fishing webpage, P (nophish | w) refers to that webpage to be identified is not Fishing net under conditions of word w occurs
The probability of page, P (nophish) refer to the probability of non-fishing webpage.
Second recognition unit 630 is suitable for using second feature vector described in second feature model treatment, and exports the second knot
Fruit is to characterize whether the webpage to be identified includes hostile content.Wherein, if the second result of output is 1, then it represents that be identified
Webpage includes hostile content;If the second result of output is 0, then it represents that webpage to be identified does not include hostile content.
As homologous ray 500, system 600 is also arranged as the operation for executing training second feature model.At this point, webpage point
Parser 610 is further adapted for the webpage that crawl largely has been marked as the webpage not comprising hostile content and the webpage comprising hostile content
Content is as sample data.Second feature extractor 620 is further adapted for the Feature Words concentrated according to second feature, generates and is used as sample
The second feature vector of the web page contents of data is as training parameter.In addition, system 600 further includes the second training unit 650, fit
In using the machine learning algorithm training training parameter, second feature model is obtained.
Furthermore in order to cope with the variability of malicious websites attack, system 600 further includes the second updating unit 660, it is suitable for
Sample data, repetition training step, to regularly update second feature model are updated in predetermined time.
Fig. 7 show in the identification webpage of another embodiment according to the present invention whether the system 700 comprising hostile content
Schematic diagram.The system 700 includes: first information acquiring unit 710, the second information acquisition unit 720 and third recognition unit
730。
The first identity that first information acquiring unit 710 is suitable for extracting webpage to be identified according to the URL of webpage to be identified is believed
Breath.Specifically, first information acquiring unit 710 is suitable for parsing the URL of webpage to be identified, obtains the domain name, simultaneously of webpage to be identified
And using the domain name as the first identity information of the webpage to be identified.
Second information acquisition unit 720 is suitable for extracting all outer links of the webpage to be identified, and is determined according to outer link
Second identity information of the webpage to be identified.According to a kind of implementation, the second information acquisition unit 720 may include statistics
Unit 722, all exterior chains suitable for counting the webpage to be identified extracted pick out existing number, the second information acquisition unit
720, suitable for choosing the domain name of the most outer link of frequency of occurrence as the second identity information.Such as following URL:
Http:// www.baidu.com, extracting its outer link is respectively that bdstatic.com (occurring 5 times), baidu.com (occur
27 times), that determines that baidu.com is the second identity information of the URL.
Third recognition unit 730 is adapted to compare the first identity information and the second identity information, exports third result to characterize
Whether the webpage to be identified includes hostile content.Specifically, if the second identity information is not consistent with the first identity information, it is defeated
Third result is 1 out, is indicated in the webpage to be identified comprising hostile content;If the second identity information and the first identity information phase
Symbol, then exporting third result is 0, indicates not including hostile content in the webpage to be identified.
Fig. 8 show in the identification webpage of another embodiment according to the present invention whether the system 800 comprising hostile content
Schematic diagram.The system 800 combines above system 500, system 600, system 700 and weighted units 810 and the 4th identification is single
Member 820.
Identifying system 500 is suitable for the first result of output;
Identifying system 600 is suitable for the second result of output;
Identifying system 700 is suitable for output third result;
Weighted units 810 are suitable for being weighted algorithm to the first result, the second result and third result, are most terminated
Fruit.
It according to one embodiment of present invention, can be using simple weighting algorithm to the first result (r1), the second result
(r2) it carries out calculation process with third result (r3) and obtains final result (r):
R=w1×r1+w2×r2+w3×r3
Wherein, w1、w2And w3The first result, the second result, the corresponding weight of third result are respectively represented, and according to this hair
Bright one embodiment distinguishes value 0.4,0.4,0.2.
If the 4th recognition unit 820 is suitable for final result and is greater than threshold value (for example, 0.5), the webpage to be identified is identified
In include hostile content, if final result be not more than threshold value, identify in the webpage to be identified do not include hostile content.
For identifying system 800 on the basis of traditional black and white lists filter, URL feature and the content for comprehensively considering webpage are special
Sign, while in view of the used imitation camouflage having of malicious websites, webpage identity is analyzed to identify the malice with imitation suspicion
Webpage
In implementation method, is classified using machine learning model to webpage, both solved the stagnant of traditional recognition method
Property disadvantage afterwards, and have the ability of certain unknown malicious web pages of detection, to improve the accuracy of identification.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right above
In the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure or
In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hair
Bright requirement is than feature more features expressly recited in each claim.More precisely, as the following claims
As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real
Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hair
Bright separate embodiments.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups
Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example
In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple
Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
A3, the method as described in A1 or 2, wherein fisrt feature collection includes one or more in following: URL length,
Whether http agreement access times, top level domain are legal, whether include the number in IP address, URL containing designated character, host word
Accord with string length, the number in host character string containing designated character, the length of longest character string in host character string, in path containing referring to
Determine number in the number, pathname length, pathname of character containing designated character, the length of longest character string, road in pathname
Diameter depth, query argument field length, in URL whether string containing designated character.A4, the method as described in any one of A1-3,
Middle the step of first eigenvector is generated according to fisrt feature collection further include: numerical value is carried out to each feature that fisrt feature is concentrated
Change obtains characteristic value, by the eigenvalue cluster at a feature vector;And every one-dimensional characteristic value in feature vector is carried out
Normalized generates first eigenvector.A5, the method as described in A4, wherein normalized step include: by feature to
Every one-dimensional characteristic value of amount normalizes between [- 1,1]:
Wherein, FiFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, Fi,maxFor the maximum of i-th dimension characteristic value
Value, Fi,minFor the minimum value of i-th dimension characteristic value.
A6, the method as described in any one of A1-5 further include the steps that trained fisrt feature model: choosing largely
Labeled as the URL of the webpage not comprising hostile content and the webpage comprising hostile content as sample data, and according to above-mentioned URL
Form fisrt feature collection;Corresponding first eigenvector is generated according to the fisrt feature collection of sample data, as training parameter;With
And using machine learning algorithm training training parameter, obtain fisrt feature model.A7, the method as described in A6, further comprise the steps of:
Sample data is updated in the given time, generates the first eigenvector of new sample data;And by the fisrt feature of update
Vector input fisrt feature model is trained, to regularly update fisrt feature model.A8, the method as described in A7, wherein giving birth to
The step of first eigenvector of the sample data of Cheng Xin further include: by increasing, deleting the feature of fisrt feature concentration, to change
Become the dimension of first eigenvector.A9, the method as described in any one of A1-8, wherein it is to be identified to characterize to export the first result
The step of whether webpage includes hostile content includes: to indicate that webpage to be identified includes hostile content if exporting the first result and being 1;
Indicate that webpage to be identified does not include hostile content with if exporting the first result and being 0.A10, the side as described in any one of A6-9
Method, wherein machine learning algorithm is support vector machine method.
B13, the method as described in B11 or 12, wherein constructing second feature according to whether there is Feature Words in word sequence
The step of vector includes: each Feature Words concentrated for second feature, and sequentially searching whether there is the specific word in word sequence;
If there are some Feature Words in word sequence, the value of corresponding position is concentrated to be assigned to 1 in second feature the specific word;If word order
Certain Feature Words is not present in column, then concentrates the value of corresponding position to be assigned to 0 in second feature the specific word;And according to feature
The assignment of word corresponding position generates the second feature vector that dimension is the first predetermined number.B14, such as any one of B11-13 institute
The method stated, wherein second feature collection is generated using the following steps: the web page contents of preset webpage is obtained, to acquired webpage
Content carries out word segmentation processing and obtains word sequence;To each word in word sequence, the second of the computational representation word importance is special
Value indicative;And the first predetermined number word is chosen as Feature Words according to Second Eigenvalue, form second feature collection.B15, such as
Whether method described in B14, wherein Second Eigenvalue is defined as under conditions of there is certain word, comprising in malice in webpage
The probability distribution and webpage of appearance whether include hostile content probability distribution distance.B16, the method as described in B15, wherein second
Characteristic value is the expectation cross entropy CE (w) of word w:
Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, P
(phish) refer to the probability of fishing webpage, P (nophish | w) refers to that webpage to be identified is not Fishing net under conditions of word w occurs
The probability of page, P (nophish) refer to the probability of non-fishing webpage.
B17, the method as described in any one of B14-16, wherein choosing the first predetermined number word according to Second Eigenvalue
The step of language composition second feature set includes: to choose the first predetermined number word according to the sequence of Second Eigenvalue from high to low
Language constitutes second feature collection as Feature Words.B18, the method as described in any one of B11-17 further include trained second feature
The step of model: the webpage for largely having been labeled as the webpage comprising hostile content and the webpage not comprising hostile content is chosen
Content is as sample data;According to the Feature Words that second feature is concentrated, generate the web page contents as sample data second is special
Vector is levied as training parameter;And using the machine learning algorithm training training parameter, obtain second feature model.B19,
Method as claimed in claim 18 further comprises the steps of: and updates sample data in the given time, repetition training step, with fixed
Phase updates second feature model.B20, the method as described in any one of B11-19, wherein the first predetermined number 450-550 it
Between.B21, the method as described in any one of B11-20, wherein exporting the second result to characterize whether webpage to be identified includes to dislike
The step of content of anticipating includes: to indicate that webpage to be identified includes hostile content if exporting the second result and being 1;If with the second knot of output
Fruit, which is 0, indicates that webpage to be identified does not include hostile content.B22, the method as described in any one of B18-21, wherein engineering
Practising algorithm is support vector machine method.
C24, the method as described in C23, wherein the step of extracting the first identity information of webpage to be identified include: parsing to
The URL for identifying webpage, obtains the domain name of webpage to be identified;And using domain name as the first identity information of the webpage to be identified.
C25, the method as described in C23 or 24, wherein the step of determining the second identity information according to outer link includes: that this is to be identified for statistics
The corresponding all exterior chains of webpage pick out existing number;And the domain name of the most outer link of frequency of occurrence is chosen as the second identity
Information.C26, the method as described in any one of C23-25, wherein compare the first identity information and the second identity information, output the
If the step of three results includes: that the second identity information is not consistent with the first identity information, exporting third result is 1, and indicating should
It include hostile content in webpage to be identified;And if the second identity information is consistent with the first identity information, exports third result
It is 0, indicates not including hostile content in the webpage to be identified.
D28, the method as described in D27, wherein the corresponding weight factor difference of the first result, the second result, third result
It is 0.4,0.4 and 0.2;And threshold value is 0.5.
E30, the system as described in E29, further includes: filter element is judged, suitable for judging webpage URL to be identified and prestoring number
It is whether consistent according to the URL in library, if webpage URL to be identified in the first pre-stored data library, judges that the webpage to be identified includes
Hostile content;And if webpage URL to be identified judges that the webpage to be identified does not include in malice in the second pre-stored data library
Hold.E31, the system as described in E29 or 30, wherein fisrt feature collection includes one or more in following: URL length,
Whether http agreement access times, top level domain are legal, whether include the number in IP address, URL containing designated character, host word
Accord with string length, the number in host character string containing designated character, the length of longest character string in host character string, in path containing referring to
Determine number in the number, pathname length, pathname of character containing designated character, the length of longest character string, road in pathname
Diameter depth, query argument field length, in URL whether string containing designated character.E32, the system as described in any one of E29-31,
Wherein fisrt feature extractor includes: numeralization subelement, suitable for quantize to each feature that fisrt feature is concentrated
To characteristic value, by eigenvalue cluster at a feature vector;And normalization subelement, suitable in the feature vector after logarithm value
Every one-dimensional characteristic value be normalized, generate first eigenvector.E33, the system as described in E32, wherein normalizing
Subelement is configured as normalizing to every one-dimensional characteristic value of feature vector between [- 1,1]:
Wherein, FiFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, Fi,maxFor the maximum of i-th dimension characteristic value
Value, Fi,minFor the minimum value of i-th dimension characteristic value.
E34, the system as described in any one of E29-33, wherein URL extractor, which is further adapted for extracting, largely to be had been marked as
The URL of webpage not comprising hostile content and the webpage comprising hostile content is as sample data;Fisrt feature extractor is also suitable
In forming fisrt feature collection according to above-mentioned URL, and corresponding first eigenvector is generated according to fisrt feature collection, joined as training
Number;And system further includes the first training unit, is suitable for obtaining fisrt feature mould using machine learning algorithm training training parameter
Type.E35, the system as described in E34, further includes: the first updating unit is generated suitable for updating sample data in the given time
The first eigenvector of new sample data and by the first eigenvector of update input fisrt feature model be trained,
To regularly update fisrt feature model.E36, the system as described in E35, wherein the first updating unit is further adapted for by increasing, deleting
Except the feature that fisrt feature is concentrated, change the dimension of first eigenvector, to generate new first eigenvector.E37, such as E29-
System described in any one of 36, wherein if the first result of output is 1, then it represents that webpage to be identified includes hostile content;With
If the first result of output is 0, then it represents that webpage to be identified does not include hostile content.E38, as described in any one of E34-37
System, wherein machine learning algorithm is support vector machine method.
F40, the system as described in F39, wherein page analyzer further include: segmenter, suitable for using point based on dictionary
Word algorithm carries out word segmentation processing to web page contents, and wherein segmentation methods include a dictionary, two kinds of matching algorithms and four eliminations
The rule of ambiguity.F41, the system as described in F39 or 40, wherein second feature extractor includes: coupling subelement, is suitable for the
Each Feature Words in two feature sets, sequentially searching whether there is the specific word in word sequence, if being matched to certain in word sequence
The specific word is then concentrated the value of corresponding position to be assigned to 1 by Feature Words in second feature, if not being matched to certain spy in word sequence
Word is levied, then concentrates the value of corresponding position to be assigned to 0 in second feature the specific word;And second feature extractor is further adapted for root
The second feature vector that dimension is the first predetermined number is generated according to the assignment of Feature Words corresponding position.Appoint in F42, such as F39-41
System described in one, wherein page analyzer is further adapted for obtaining the web page contents of preset webpage, and in acquired webpage
Hold progress word segmentation processing and obtains word sequence;System further include: feature set generation unit, suitable for each word in word sequence,
The Second Eigenvalue of the computational representation word importance simultaneously chooses the first predetermined number word as special according to Second Eigenvalue
Word is levied, second feature collection is formed.F43, the system as described in F42, wherein Second Eigenvalue is defined as certain word occurring
Under the conditions of, in webpage whether comprising hostile content probability distribution and webpage whether include hostile content probability distribution away from
From.F44, the system as described in F43, wherein Second Eigenvalue is the expectation cross entropy CE (w) of word w:
Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, P
(phish) refer to the probability of fishing webpage, P (nophish | w) refers to that webpage to be identified is not Fishing net under conditions of word w occurs
The probability of page, P (nophish) refer to the probability of non-fishing webpage.
F45, the system as described in any one of F42-44, wherein feature set generation unit is configured as according to second feature
The sequence of value from high to low chooses the first predetermined number word as Feature Words, constitutes second feature collection.F46, such as F39-45
Any one of described in system, wherein page analyzer is further adapted for crawl and largely has been marked as the webpages not comprising hostile content
Web page contents with the webpage comprising hostile content are as sample data;Second feature extractor is further adapted for according to second feature collection
In Feature Words, generate as sample data web page contents second feature vector as training parameter;And system is also wrapped
The second training unit is included, is suitable for obtaining second feature model using machine learning algorithm training training parameter.F47, such as F46 institute
The system stated, further includes: the second updating unit, suitable for updating sample data, repetition training step, with regular in the given time
Update second feature model.F48, the system as described in any one of F39-47, wherein the first predetermined number 450-550 it
Between.F49, the system as described in any one of F39-48, wherein if the second result of output is 1, then it represents that webpage packet to be identified
Containing hostile content;If the second result with output is 0, then it represents that webpage to be identified does not include hostile content.F50, such as F46-49
Any one of described in system, wherein machine learning algorithm is support vector machine method.
G52, the system as described in G51, wherein first information acquiring unit is further adapted for parsing the URL of webpage to be identified, obtains
Take the domain name of webpage to be identified and using domain name as the first identity information of the webpage to be identified.G53, as described in G51 or 52
System, wherein the second information acquisition unit further include: statistics subelement, suitable for counting the institute of the webpage to be identified extracted
There is exterior chain to pick out existing number;And second information acquisition unit be further adapted for choosing the domain name of the most outer link of frequency of occurrence and make
For the second identity information.G54, the system as described in any one of G51-53, wherein third recognition unit is suitable for: if the second identity
Information is not consistent with the first identity information, then exporting third result is 1, indicates in the webpage to be identified comprising hostile content;With
And if the second identity information is consistent with the first identity information, export third result be 0, indicate not including in the webpage to be identified
Hostile content.
H56, the system as described in H55, wherein the corresponding weight factor difference of the first result, the second result, third result
It is 0.4,0.4 and 0.2;And threshold value is 0.5.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
Meaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment
The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method
The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice
Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by
Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc.
Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must
Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from
It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that
Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit
Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this
Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this
Invent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.
Claims (48)
1. it is a kind of identification webpage in whether include hostile content method, the method includes the steps:
Execute the first recognition methods with export first as a result, wherein the first recognition methods include:
The URL of webpage to be identified is parsed to extract URL feature from the URL to generate fisrt feature collection;
First eigenvector is generated according to the fisrt feature collection, wherein numerical value is carried out to each feature that fisrt feature is concentrated
Change obtains characteristic value, by the eigenvalue cluster at a feature vector, carries out to every one-dimensional characteristic value in described eigenvector
Normalized generates first eigenvector;And
The first eigenvector is handled using fisrt feature model, and exports the first result to characterize the webpage to be identified
It whether include hostile content;
Execute the second recognition methods with export second as a result, wherein the second recognition methods include:
Web page contents to be identified are grabbed, word segmentation processing is carried out to the web page contents grabbed and obtains word sequence;
That dimension is the first predetermined number is constructed according to whether there is the Feature Words that second feature is concentrated in the word sequence
Two feature vectors, wherein second feature concentration has prestored the first predetermined number Feature Words;
Using second feature vector described in second feature model treatment and second is exported as a result, being to characterize the webpage to be identified
No includes hostile content;
Third recognition methods is executed to export third as a result, wherein third recognition methods includes:
The first identity information of the webpage to be identified is extracted according to the URL of webpage to be identified;
Extract all outer links of the webpage to be identified;
According to outer the second identity information for linking the determining webpage to be identified;
Compare the first identity information and the second identity information and exports third as a result, to characterize whether the webpage to be identified includes to dislike
Meaning content;
Algorithm is weighted to first result, the second result and third result, obtains final result;
If the final result is greater than threshold value, determine in the webpage to be identified comprising hostile content;And
If the final result is not more than threshold value, determine not including hostile content in the webpage to be identified.
2. the method as described in claim 1, wherein first result, the second result, the corresponding weight factor of third result
Respectively 0.4,0.4 and 0.2;And
The threshold value is 0.5.
3. method according to claim 2 further includes pre-treatment step:
The URL for extracting webpage to be identified judges whether the webpage URL to be identified and the URL in pre-stored data library are consistent,
If the webpage URL to be identified in the first pre-stored data library, judges that the webpage to be identified includes hostile content;And
If the webpage URL to be identified in the second pre-stored data library, judges that the webpage to be identified does not include hostile content.
4. method as claimed in claim 3, wherein the fisrt feature collection includes one or more in following: URL long
Whether degree, http agreement access times, top level domain are legal, whether include the number in IP address, URL containing designated character, master
The length of longest character string in number, host character string in machine string length, host character string containing designated character, in path
The length of longest character string in number, pathname in number, pathname length, pathname containing designated character containing designated character
Degree, pathdepth, query argument field length, in URL whether string containing designated character.
5. method as claimed in claim 4, wherein the normalized step includes:
Every one-dimensional characteristic value of feature vector is normalized between [- 1,1]:
Wherein, FiFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, Fi,maxFor the maximum value of i-th dimension characteristic value,
Fi,minFor the minimum value of i-th dimension characteristic value.
6. method as claimed in claim 5 further includes the steps that trained fisrt feature model:
It chooses and largely has been marked as the URL of the webpage not comprising hostile content and the webpage comprising hostile content as sample number
According to, and fisrt feature collection is formed according to above-mentioned URL;
Corresponding first eigenvector is generated according to the fisrt feature collection of the sample data, as training parameter;And
Using the machine learning algorithm training training parameter, fisrt feature model is obtained.
7. method as claimed in claim 6, further comprising the steps of:
Sample data is updated in the given time, generates the first eigenvector of new sample data;And
The first eigenvector input fisrt feature model of the update is trained, to regularly update fisrt feature model.
8. it is the method for claim 7, wherein the step of first eigenvector for generating new sample data also wraps
It includes:
By increasing, deleting the feature of fisrt feature concentration, to change the dimension of first eigenvector.
9. method according to claim 8, wherein the first result of the output is to characterize whether webpage to be identified includes malice
The step of content includes:
Indicate that webpage to be identified includes hostile content if exporting the first result and being 1;With
Indicate that webpage to be identified does not include hostile content if exporting the first result and being 0.
10. method as claimed in claim 9, wherein the machine learning algorithm is support vector machine method.
11. method as claimed in claim 10, wherein the step of carrying out word segmentation processing to web page contents includes:
Word segmentation processing is carried out using the segmentation methods based on dictionary, wherein the segmentation methods include a dictionary, two kinds of matchings
The rule of algorithm and four disambiguations.
12. method as claimed in claim 11, wherein constructed according to whether there is Feature Words in word sequence second feature to
The step of amount includes:
For each Feature Words that second feature is concentrated, sequentially searching whether there is the specific word in word sequence;
If there are some Feature Words in the word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word
1;
If certain Feature Words is not present in the word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word
0;And
The second feature vector that dimension is the first predetermined number is generated according to the assignment of Feature Words corresponding position.
13. method as claimed in claim 12, wherein the second feature collection is generated using the following steps:
The web page contents for obtaining preset webpage carry out word segmentation processing to acquired web page contents and obtain word sequence;
To each word in word sequence, the Second Eigenvalue of the computational representation word importance;And
The first predetermined number word is chosen as Feature Words according to the Second Eigenvalue, forms second feature collection.
14. method as claimed in claim 13, wherein the Second Eigenvalue is defined as under conditions of there is certain word,
In webpage whether comprising hostile content probability distribution and webpage whether include hostile content probability distribution distance.
15. method as claimed in claim 14, wherein the Second Eigenvalue is the expectation cross entropy CE (w) of word w:
Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, and P (phish) refers to
The probability of fishing webpage, P (nophish | w) refer to that under conditions of word w occurs webpage to be identified is not the probability of fishing webpage,
P (nophish) refers to the probability of non-fishing webpage.
16. method as claimed in claim 15, wherein described choose the first predetermined number word group according to Second Eigenvalue
Include: at the step of second feature set
The first predetermined number word is chosen as Feature Words according to the sequence of Second Eigenvalue from high to low, constitutes second feature
Collection.
17. the method described in claim 16 further includes the steps that trained second feature model:
Choose the web page contents work for largely having been labeled as the webpage comprising hostile content and the webpage not comprising hostile content
For sample data;
According to the Feature Words that second feature is concentrated, the second feature vector of the web page contents as sample data is generated as training
Parameter;And
Using the machine learning algorithm training training parameter, second feature model is obtained.
18. method as claimed in claim 17, further comprises the steps of:
Sample data, repetition training step, to regularly update second feature model are updated in the given time.
19. method as claimed in claim 18, wherein first predetermined number is between 450-550.
20. method as claimed in claim 19, wherein the second result of the output is to characterize whether webpage to be identified includes to dislike
Anticipate content the step of include:
Indicate that webpage to be identified includes hostile content if exporting the second result and being 1;With
Indicate that webpage to be identified does not include hostile content if exporting the second result and being 0.
21. method as claimed in claim 20, wherein the machine learning algorithm is support vector machine method.
22. method as claimed in claim 21, wherein the step of first identity information for extracting webpage to be identified includes:
The URL of webpage to be identified is parsed, the domain name of the webpage to be identified is obtained;And
Using domain name as the first identity information of the webpage to be identified.
23. method as claimed in claim 22, wherein the step of second identity information determining according to outer link includes:
It counts the corresponding all exterior chains of the webpage to be identified and picks out existing number;And
The domain name of the most outer link of frequency of occurrence is chosen as the second identity information.
24. method as claimed in claim 23 exports third result wherein comparing the first identity information and the second identity information
The step of include:
If second identity information is not consistent with the first identity information, exporting third result is 1, indicates the webpage to be identified
In include hostile content;And
If second identity information is consistent with the first identity information, exporting third result is 0, is indicated in the webpage to be identified
Not comprising hostile content.
25. it is a kind of identification webpage in whether include hostile content system, the system comprises:
URL extractor, suitable for parsing the URL of webpage to be identified;
Fisrt feature extractor is further adapted for suitable for extracting URL feature from the URL to generate fisrt feature collection according to first
Feature set generates first eigenvector, and the fisrt feature extractor includes:
Quantize subelement, suitable for being quantized to obtain characteristic value to each feature that fisrt feature is concentrated, by the feature
Value one feature vector of composition;
Subelement is normalized, is normalized suitable for every one-dimensional characteristic value in the feature vector after logarithm value, is generated
First eigenvector;And
First recognition unit exports the first result suitable for handling the first eigenvector using fisrt feature model with table
Levy whether the webpage to be identified includes hostile content;
Page analyzer is suitable for grabbing web page contents to be identified, carries out word segmentation processing to the web page contents grabbed and obtains word order
Column,
Second feature extractor, suitable for constructing dimension according to whether there is the Feature Words that second feature is concentrated in the word sequence
For the second feature vector of the first predetermined number, wherein second feature concentration has prestored the first predetermined number Feature Words,
Second recognition unit is suitable for using second feature vector described in second feature model treatment and exports the second result;With
First information acquiring unit, suitable for extracting the first identity information of the webpage to be identified according to the URL of webpage to be identified,
Second information acquisition unit, suitable for extracting all outer links of the webpage to be identified, and being determined according to outer link should be wait know
Second identity information of other webpage,
Third recognition unit is adapted to compare the first identity information and the second identity information and exports third result;
Weighted units obtain final result suitable for being weighted algorithm to first result, the second result and third result;
And
4th recognition unit identifies in the webpage to be identified if being suitable for the final result is greater than threshold value comprising in malice
Hold, if the final result is not more than threshold value, identifies and do not include hostile content in the webpage to be identified.
26. system as claimed in claim 25, wherein first result, the second result, the corresponding weight of third result because
Son is respectively 0.4,0.4 and 0.2;And
The threshold value is 0.5.
27. system as claimed in claim 26, further includes:
Judge filter element, suitable for judging whether the webpage URL to be identified and the URL in pre-stored data library are consistent,
If the webpage URL to be identified in the first pre-stored data library, judges that the webpage to be identified includes hostile content;And
If the webpage URL to be identified in the second pre-stored data library, judges that the webpage to be identified does not include hostile content.
28. system as claimed in claim 27, wherein the fisrt feature collection includes one or more in following: URL
Whether length, http agreement access times, top level domain legal, whether comprising in IP address, URL containing designated character number,
The length of longest character string, path in number, host character string in host string length, host character string containing designated character
In number containing designated character, pathname length, the number in pathname containing designated character, in pathname longest character string length
Degree, pathdepth, query argument field length, in URL whether string containing designated character.
29. system as claimed in claim 28, wherein the normalization subelement is configured as feature vector per one-dimensional
Characteristic value normalization is between [- 1,1]:
Wherein, FiFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, Fi,maxFor the maximum value of i-th dimension characteristic value,
Fi,minFor the minimum value of i-th dimension characteristic value.
30. system as claimed in claim 29, wherein
The URL extractor is further adapted for the webpage extracted largely have been marked as not comprising hostile content and comprising hostile content
The URL of webpage is as sample data;
The fisrt feature extractor is further adapted for forming fisrt feature collection according to above-mentioned URL, and raw according to the fisrt feature collection
At corresponding first eigenvector, as training parameter;And
The system also includes the first training unit, it is suitable for obtaining first using the machine learning algorithm training training parameter
Characteristic model.
31. system as claimed in claim 30, further includes:
First updating unit, suitable for updating sample data in the given time, generate new sample data first eigenvector,
And the first eigenvector of update input fisrt feature model is trained, to regularly update fisrt feature model.
32. system as claimed in claim 31, wherein
First updating unit is further adapted for changing first eigenvector by the feature for increasing, deleting fisrt feature concentration
Dimension, to generate new first eigenvector.
33. system as claimed in claim 32, wherein
If the first result of the output is 1, then it represents that webpage to be identified includes hostile content;With
If the first result of the output is 0, then it represents that webpage to be identified does not include hostile content.
34. the system as described in any one of claim 30-33, wherein the machine learning algorithm is support vector machines side
Method.
35. system as claimed in claim 34, wherein the page analyzer further include:
Segmenter, suitable for carrying out word segmentation processing to web page contents using the segmentation methods based on dictionary, wherein the segmentation methods
Rule comprising a dictionary, two kinds of matching algorithms and four disambiguations.
36. system as claimed in claim 35, wherein the second feature extractor includes:
Coupling subelement, suitable for each Feature Words concentrated to second feature, sequentially searching whether there is this feature in word sequence
Word,
If being matched to certain Feature Words in word sequence, the value of corresponding position is concentrated to be assigned to 1 in second feature the specific word,
If not being matched to certain Feature Words in word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word
0;And
The second feature extractor is further adapted for generating dimension according to the assignment of Feature Words corresponding position being the first predetermined number
Second feature vector.
37. system as claimed in claim 36, wherein
The page analyzer is further adapted for obtaining the web page contents of preset webpage, and carries out at participle to acquired web page contents
Reason obtains word sequence;
The system also includes:
Feature set generation unit, suitable for each word in word sequence, the Second Eigenvalue of the computational representation word importance,
And the first predetermined number word is chosen as Feature Words according to the Second Eigenvalue, form second feature collection.
38. system as claimed in claim 37, wherein the Second Eigenvalue is defined as under conditions of there is certain word,
In webpage whether comprising hostile content probability distribution and webpage whether include hostile content probability distribution distance.
39. system as claimed in claim 38, wherein the Second Eigenvalue is the expectation cross entropy CE (w) of word w:
Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, and P (phish) refers to
The probability of fishing webpage, P (nophish | w) refer to that under conditions of word w occurs webpage to be identified is not the probability of fishing webpage,
P (nophish) refers to the probability of non-fishing webpage.
40. system as claimed in claim 39, wherein
The feature set generation unit is configured as choosing the first predetermined number according to the sequence of Second Eigenvalue from high to low
Word constitutes second feature collection as Feature Words.
41. system as claimed in claim 40, wherein
The page analyzer is further adapted for crawl and largely has been marked as not including the webpage of hostile content and comprising hostile content
Webpage web page contents as sample data;
The second feature extractor is further adapted for the Feature Words concentrated according to second feature, generates as in the webpage of sample data
The second feature vector of appearance is as training parameter;And
The system also includes the second training unit, it is suitable for obtaining second using the machine learning algorithm training training parameter
Characteristic model.
42. system as claimed in claim 41, further includes:
Second updating unit, suitable for updating sample data, repetition training step, to regularly update second feature in the given time
Model.
43. system as claimed in claim 42, wherein first predetermined number is between 450-550.
44. system as claimed in claim 43, wherein
If the second result of the output is 1, then it represents that webpage to be identified includes hostile content;With
If the second result of the output is 0, then it represents that webpage to be identified does not include hostile content.
45. system as claimed in claim 44, wherein the machine learning algorithm is support vector machine method.
46. system as claimed in claim 45, wherein
The first information acquiring unit is further adapted for parsing the URL of webpage to be identified, obtains the domain name, simultaneously of the webpage to be identified
And using domain name as the first identity information of the webpage to be identified.
47. system as claimed in claim 46, wherein second information acquisition unit further include:
Subelement is counted, all exterior chains suitable for counting the webpage to be identified extracted pick out existing number;And
Second information acquisition unit is further adapted for choosing the domain name of the most outer link of frequency of occurrence as the second identity information.
48. system as claimed in claim 47, wherein the third recognition unit is suitable for:
If second identity information is not consistent with the first identity information, exporting third result is 1, indicates the webpage to be identified
In include hostile content;And
If second identity information is consistent with the first identity information, exporting third result is 0, is indicated in the webpage to be identified
Not comprising hostile content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610313359.3A CN105956472B (en) | 2016-05-12 | 2016-05-12 | Identify webpage in whether include hostile content method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610313359.3A CN105956472B (en) | 2016-05-12 | 2016-05-12 | Identify webpage in whether include hostile content method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105956472A CN105956472A (en) | 2016-09-21 |
CN105956472B true CN105956472B (en) | 2019-10-18 |
Family
ID=56912414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610313359.3A Active CN105956472B (en) | 2016-05-12 | 2016-05-12 | Identify webpage in whether include hostile content method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105956472B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107979560A (en) * | 2016-10-21 | 2018-05-01 | 北京计算机技术及应用研究所 | It is a kind of that attack defense method is applied based on Multiple detection |
CN106776958A (en) * | 2016-12-05 | 2017-05-31 | 公安部第三研究所 | Illegal website identifying system and its method based on critical path |
EP3593508A4 (en) | 2017-03-10 | 2020-02-26 | Visa International Service Association | Identifying malicious network devices |
CN107644162A (en) * | 2017-09-04 | 2018-01-30 | 北京知道未来信息技术有限公司 | A kind of Web attack recognitions method and apparatus |
CN107679401A (en) * | 2017-09-04 | 2018-02-09 | 北京知道未来信息技术有限公司 | A kind of malicious web pages recognition methods and device |
CN107992469A (en) * | 2017-10-13 | 2018-05-04 | 中国科学院信息工程研究所 | A kind of fishing URL detection methods and system based on word sequence |
CN107992741B (en) * | 2017-10-24 | 2020-08-28 | 阿里巴巴集团控股有限公司 | Model training method, URL detection method and device |
CN108881138B (en) * | 2017-10-26 | 2020-06-26 | 新华三信息安全技术有限公司 | Webpage request identification method and device |
CN107807987B (en) * | 2017-10-31 | 2021-07-02 | 广东工业大学 | Character string classification method and system and character string classification equipment |
CN108111478A (en) * | 2017-11-07 | 2018-06-01 | 中国互联网络信息中心 | A kind of phishing recognition methods and device based on semantic understanding |
CN107888616B (en) * | 2017-12-06 | 2020-06-05 | 北京知道创宇信息技术股份有限公司 | Construction method of classification model based on URI and detection method of Webshell attack website |
CN107896225A (en) * | 2017-12-08 | 2018-04-10 | 深信服科技股份有限公司 | Fishing website decision method, server and storage medium |
CN108718296A (en) * | 2018-04-27 | 2018-10-30 | 广州西麦科技股份有限公司 | Network management-control method, device and computer readable storage medium based on SDN network |
CN109104429B (en) * | 2018-09-05 | 2021-09-28 | 广东石油化工学院 | Detection method for phishing information |
CN110427755A (en) * | 2018-10-16 | 2019-11-08 | 新华三信息安全技术有限公司 | A kind of method and device identifying script file |
CN110365691B (en) * | 2019-07-22 | 2021-12-28 | 云南财经大学 | Phishing website distinguishing method and device based on deep learning |
CN110580408B (en) * | 2019-09-19 | 2022-03-11 | 北京天融信网络安全技术有限公司 | Data processing method and electronic equipment |
CN111222031A (en) * | 2019-11-22 | 2020-06-02 | 成都市映潮科技股份有限公司 | Website distinguishing method and system |
CN111091019B (en) * | 2019-12-23 | 2024-03-01 | 支付宝(杭州)信息技术有限公司 | Information prompting method, device and equipment |
CN111556036A (en) * | 2020-04-20 | 2020-08-18 | 杭州安恒信息技术股份有限公司 | Detection method, device and equipment for phishing attack |
CN114885334B (en) * | 2022-07-13 | 2022-09-27 | 安徽创瑞信息技术有限公司 | High-concurrency short message processing method |
CN116527373B (en) * | 2023-05-18 | 2023-10-20 | 清华大学 | Back door attack method and device for malicious URL detection system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101571934A (en) * | 2009-05-26 | 2009-11-04 | 北京航空航天大学 | Enterprise independent innovation ability prediction method based on support vector machine |
CN102708186A (en) * | 2012-05-11 | 2012-10-03 | 上海交通大学 | Identification method of phishing sites |
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN104965867A (en) * | 2015-06-08 | 2015-10-07 | 南京师范大学 | Text event classification method based on CHI feature selection |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763431A (en) * | 2010-01-06 | 2010-06-30 | 电子科技大学 | PL clustering method based on massive network public sentiment information |
CN102880622A (en) * | 2011-07-15 | 2013-01-16 | 祁勇 | Method and system for determining user characteristics on internet |
CN102663000B (en) * | 2012-03-15 | 2016-08-03 | 北京百度网讯科技有限公司 | The maliciously recognition methods of the method for building up of network address database, maliciously network address and device |
CN103530367B (en) * | 2013-10-12 | 2017-07-18 | 深圳先进技术研究院 | A kind of fishing website identification system and method |
-
2016
- 2016-05-12 CN CN201610313359.3A patent/CN105956472B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101571934A (en) * | 2009-05-26 | 2009-11-04 | 北京航空航天大学 | Enterprise independent innovation ability prediction method based on support vector machine |
CN102708186A (en) * | 2012-05-11 | 2012-10-03 | 上海交通大学 | Identification method of phishing sites |
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN104965867A (en) * | 2015-06-08 | 2015-10-07 | 南京师范大学 | Text event classification method based on CHI feature selection |
Non-Patent Citations (1)
Title |
---|
恶意网页识别研究综述;沙泓州等;《计算机学报》;20160331;第3.1、4.3节及图4 * |
Also Published As
Publication number | Publication date |
---|---|
CN105956472A (en) | 2016-09-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105956472B (en) | Identify webpage in whether include hostile content method and system | |
Rodriguez et al. | Automatic detection of hate speech on facebook using sentiment and emotion analysis | |
WO2019085275A1 (en) | Character string classification method and system, and character string classification device | |
Ito et al. | Web application firewall using character-level convolutional neural network | |
CN112073551B (en) | DGA domain name detection system based on character-level sliding window and depth residual error network | |
CN109873810B (en) | Network fishing detection method based on goblet sea squirt group algorithm support vector machine | |
CN106375345B (en) | It is a kind of based on the Malware domain name detection method being periodically detected and system | |
CN111310476B (en) | Public opinion monitoring method and system using aspect-based emotion analysis method | |
CN104679825B (en) | Macroscopic abnormity of earthquake acquisition of information based on network text and screening technique | |
CN104156490A (en) | Method and device for detecting suspicious fishing webpage based on character recognition | |
CN110909531B (en) | Information security screening method, device, equipment and storage medium | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
CN111758098B (en) | Named entity identification and extraction using genetic programming | |
CN107341399A (en) | Assess the method and device of code file security | |
CN110572359A (en) | Phishing webpage detection method based on machine learning | |
CN112217787A (en) | Method and system for generating mock domain name training data based on ED-GAN | |
CN115757991A (en) | Webpage identification method and device, electronic equipment and storage medium | |
CN113438209B (en) | Phishing website detection method based on improved Stacking strategy | |
CN111460100A (en) | Criminal legal document and criminal name recommendation method and system | |
Mvula et al. | COVID-19 malicious domain names classification | |
Pham et al. | Exploring efficiency of GAN-based generated URLs for phishing URL detection | |
CN110704611B (en) | Illegal text recognition method and device based on feature de-interleaving | |
CN112966507A (en) | Method, device, equipment and storage medium for constructing recognition model and identifying attack | |
Rayyan et al. | Uniform resource locator classification using classical machine learning & deep learning techniques | |
CN116684144A (en) | Malicious domain name detection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200122 Address after: 100094 west side of the first floor of Building 1, yard 68, Beiqing Road, Haidian District, Beijing Patentee after: Quantum innovation (Beijing) Information Technology Co., Ltd Address before: 100086, A, building 1, building 48, No. 3 West Third Ring Road, Haidian District, Beijing, 23E Patentee before: Baoli Nine Chapters (Beijing) Data Technology Co., Ltd. |