CN107948168A - Page detection method and device - Google Patents
Page detection method and device Download PDFInfo
- Publication number
- CN107948168A CN107948168A CN201711230916.6A CN201711230916A CN107948168A CN 107948168 A CN107948168 A CN 107948168A CN 201711230916 A CN201711230916 A CN 201711230916A CN 107948168 A CN107948168 A CN 107948168A
- Authority
- CN
- China
- Prior art keywords
- link
- blacklist
- webpage
- database
- white list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/101—Access control lists [ACL]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1466—Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks
Abstract
The present invention relates to Internet technical field, and in particular to a kind of page detection method and device, applied to the terminal device including database, method includes:The malicious link address on internet is obtained to update the data the link of the blacklist in storehouse, the web page interlinkage of webpage to be detected is linked with the blacklist in the database after renewal and white list is linked and matched respectively, unknown webpage is labeled as if all mismatching, and according to blacklist link and white list link generation grader, extract the scripted code fragment of unknown webpage, and feature vector is extracted from scripted code fragment according to feature critical word, whether use grader that feature vector and default vector are carried out classification judges unknown webpage for malicious web pages, and labeled as blacklist network address and preserved when the unknown webpage is malicious web pages to the database after renewal.By the above method be effectively ensured webpage detect accuracy and high efficiency, the real-time update database during webpage is detected.
Description
Technical field
The present invention relates to Internet technical field, in particular to a kind of page detection method and device.
Background technology
With the development of network technology, network personal information also has been to be concerned by more and more people safely.The account of user
And password, it is critically important in network personal information one, in the prior art, net is often cheated in illegal website by account
Page gains the account of user and password by cheating to carry out unlawful activities, seriously endangers the rights and interests of user.Therefore, how webpage is carried out
Detection, it is particularly important to network security to determine whether it is account fraud webpage.
At present, common several detection modes are respectively:Pattern match detection, heuristic analysis and Honeypot Techniques.But
The shortcomings that pattern match is the attack that cannot detect the malicious web pages containing obfuscated codes and UNKNOWN TYPE, heuristic analysis skill
The shortcomings that art is that False Rate and misdetection rate are higher, the detection cycle length of honey jar formula technology.
The content of the invention
In view of this, it is an object of the invention to provide a kind of page detection method and device, webpage inspection is effectively ensured
The accuracy and high efficiency of survey, the real-time update database during webpage is detected.
In order to achieve the above object, present pre-ferred embodiments provide a kind of page detection method, applied to including data
The terminal device in storehouse, blacklist link and white list link are stored with the database, the described method includes:
After obtaining the malicious link address on internet and linking and updated to update the blacklist in the database
Database;
The web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links
Matched respectively, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, if being linked with white list
The Web Page Tags to be detected are then secure web-page by matching, and unknown webpage is labeled as if all mismatching;
According to blacklist link and white list link generation grader;
The scripted code fragment of the unknown webpage is extracted, and is carried according to feature critical word from the scripted code fragment
Take feature vector;
Described eigenvector and default vector are carried out whether classification judgement obtains the unknown webpage using the grader
For malicious web pages, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage and be labeled as blacklist network address
And preserve to the database after the renewal.
In the preferred embodiment, in above-mentioned page detection method, according to blacklist link and white name
The step of singular link generation grader, includes:
Obtain total quantity and be used as initial sample for the link of multiple blacklists and the link of multiple white lists of setting value;
Take identical weight for each code characteristic in the initial sample, and carry out multiple svm classifier study with
Generate preliminary classification device;
The link not being labeled on multiple internets is obtained as candidate samples;
Present count is chosen after classifying from the preliminary classification device to all unlabelled links in the candidate samples
The unlabelled link of amount, wherein, multiple unlabelled links of the default quantity surpass close to the classification of the preliminary classification device
Plane;
The mark that user receives and sends for each unlabelled chain in the unlabelled link of the default quantity is received to refer to
Order, and the link after being marked is marked to corresponding unlabelled link according to the mark instructions, after the mark
It is linked as blacklist link or white list link;
Link after the mark of default quantity is added in the initial sample, with again in the initial sample
Each code characteristic take identical weight, and carry out multiple svm classifier and learn to obtain grader.
In the preferred embodiment, in above-mentioned page detection method, by the web page interlinkage of webpage to be detected with
The step of blacklist link and white list link in database after the renewal are matched respectively includes:
Using black in the database after the Main Domain of the regular expression extraction web page address to be detected, the renewal
The Main Domain of the single address Main Domain of name and white list;
MD5 values to be detected are obtained according to the Main Domain of the web page address to be detected, according to the database after the renewal
In the Main Domain of blacklist address obtain blacklist MD5 values, and white name is obtained according to the Main Domain that the white list links
Single MD5 values;
The MD5 values to be detected are matched respectively with the blacklist MD5 values and white list MD5 values.
In the preferred embodiment, in above-mentioned page detection method, malice generation is prestored in the database
Code feature, the web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links point
After not matched, the method further includes:
The source code of the unknown webpage is matched with the malicious code feature, by the unknown webpage mark if matching
Malicious web pages are denoted as, and the network address of the unknown webpage labeled as blacklist network address and is preserved to the database after the renewal
In.
In the preferred embodiment, in above-mentioned page detection method, according to feature critical word from the script
The step of feature vector is extracted in code snippet includes:
Number, string length and the information entropy occurred according to feature critical word is carried from the scripted code fragment
Take feature vector.
The present invention also provides a kind of webpage detection device, applied to the terminal device including database, described device includes:
Update module:Linked for obtaining the malicious link address on internet with updating the blacklist in the database
Database after being updated;
First detection module:For by the blacklist in the database after the web page interlinkage of webpage to be detected and the renewal
Link and white list link are matched respectively, if linked with blacklist match if by the Web Page Tags to be detected be malice net
Page, if linked with white list match if by the Web Page Tags to be detected be secure web-page, be labeled as non-Hownet if all mismatching
Page;
Grader generation module:For according to blacklist link and white list link generation grader;
Extraction module:For extracting the scripted code fragment of the unknown webpage, and according to feature critical word from the foot
Feature vector is extracted in this code snippet;
Second detection module:Judge for described eigenvector and default vector to be carried out classification using the grader
Whether it is malicious web pages to the unknown webpage, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage simultaneously
Labeled as blacklist network address and preserve to the database after the renewal.
In the preferred embodiment, in above-mentioned webpage detection device, the grader generation module includes:
First acquisition submodule:For obtaining the link of multiple blacklists and multiple white list chains that total quantity is setting value
Connect as initial sample;
First generation submodule:For taking identical weight for each code characteristic in the initial sample, go forward side by side
The multiple svm classifier of row learns to generate preliminary classification device;
Second acquisition submodule:For obtaining the link not being labeled on multiple internets as candidate samples;
Classification submodule:For dividing from the preliminary classification device all unlabelled links in the candidate samples
The unlabelled link of default quantity is chosen after class, wherein, multiple unlabelled links of the default quantity are close to described initial
The Optimal Separating Hyperplane of grader;
Receiving submodule:For receiving user for each unlabelled chain in the unlabelled link of the default quantity
The mark instructions received and sent, and the chain after being marked is marked to corresponding unlabelled link according to the mark instructions
Connect, blacklist link or white list link are linked as after the mark;
Second generation submodule:Added for the link after the mark by default quantity in the initial sample, with
Identical weight is taken again for each code characteristic in the initial sample, and carries out multiple svm classifier and learns to be classified
Device.
In the preferred embodiment, in above-mentioned webpage detection device, the first detection module includes:
Main Domain acquisition submodule:For extracting Main Domain, the institute of the web page address to be detected using regular expression
State the Main Domain of blacklist address and the Main Domain of white list in the database after renewal;
Detection sub-module:For obtaining MD5 values to be detected according to the Main Domain of the web page address to be detected, according to described
The Main Domain of blacklist address in database after renewal obtains blacklist MD5 values, and linked according to the white list
Main Domain obtains white list MD5 values;
Matched sub-block:For by the MD5 values to be detected and the blacklist MD5 values and white list MD5 values respectively into
Row matching.
In the preferred embodiment, in above-mentioned webpage detection device, malice generation is prestored in the database
Code feature, described device further include:
3rd detection module:For the source code of the unknown webpage to be matched with the malicious code feature, if
With by the unknown Web Page Tags being then malicious web pages, and the network address of the unknown webpage labeled as blacklist network address and is preserved to institute
State in the database after renewal.
In the preferred embodiment, in above-mentioned webpage detection device, the extraction module is additionally operable to:According to spy
Number, string length and the information entropy that sign keyword occurs extract feature vector from the scripted code fragment.
A kind of page detection method and device provided by the invention, applied to the terminal device including database, method bag
Include:The malicious link address on internet is obtained to update the data the link of the blacklist in storehouse, by the webpage chain of webpage to be detected
Connect and link and white list is linked and matched respectively with the blacklist in the database after renewal, is labeled as not if all mismatching
Hownet page, and according to blacklist link and white list link generation grader, the scripted code fragment of unknown webpage is extracted, and press
Feature vector is extracted from scripted code fragment according to feature critical word, is divided feature vector and default vector using grader
Class judges whether the unknown webpage is malicious web pages, and labeled as blacklist network address and is protected when the unknown webpage is malicious web pages
Deposit the database to renewal.By the above method the accuracy and high efficiency that webpage detects is effectively ensured, in detection webpage
During real-time update database.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair
The restriction of scope, for those of ordinary skill in the art, without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of connection block diagram of terminal device provided in an embodiment of the present invention.
Fig. 2 is a kind of flow diagram of page detection method provided in an embodiment of the present invention.
Fig. 3 is the sub-step schematic diagram of step S120 in Fig. 2.
Fig. 4 is the sub-step schematic diagram of step S130 in Fig. 2.
Fig. 5 is a kind of connection block diagram of webpage detection device provided in an embodiment of the present invention.
Fig. 6 is the connection block diagram of first detection module provided in an embodiment of the present invention.
Fig. 7 is the connection block diagram of grader generation module provided in an embodiment of the present invention.
Icon:10- terminal devices;12- memories;14- processors;100- webpage detection devices;110- update modules;
120- first detection modules;121- Main Domain acquisition submodules;123- detection sub-modules;125- matched sub-blocks;130- classifies
Device generation module;The first acquisition submodules of 131-;132- first generates submodule;The second acquisition submodules of 133-;134- classifies
Submodule;135- receiving submodules;136- second generates submodule;140- extraction modules;The second detection modules of 150-.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
Part of the embodiment of the present invention, instead of all the embodiments.The present invention implementation being usually described and illustrated herein in the accompanying drawings
The component of example can be arranged and designed with a variety of configurations.
Therefore, below the detailed description of the embodiment of the present invention to providing in the accompanying drawings be not intended to limit it is claimed
The scope of the present invention, but be merely representative of the present invention selected embodiment.Based on the embodiments of the present invention, this area is common
Technical staff's all other embodiments obtained without creative efforts, belong to the model that the present invention protects
Enclose.
It should be noted that:Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.
As shown in Figure 1, it is the block diagram for the terminal device 10 that present pre-ferred embodiments provide.The embodiment of the present invention
In terminal device 10 can be equipment that server, computer etc. possess data-handling capacity, the terminal device 10 includes
Database is associated with database.As shown in Figure 1, terminal device 10 includes:Memory 12 and processor 14.
The memory 12 is directly or indirectly electrically connected between each other with processor 14, with realize the transmission of data or
Interaction.It is electrically connected for example, these elements can be realized by one or more communication bus or signal wire between each other.Memory
The software function module being stored in the form of software or firmware (Firmware) in the memory 12 is stored with 12, it is described
Processor 14 is stored in software program and module in memory 12 by operation, such as the webpage detection in the embodiment of the present invention
Device 100, so as to perform various functions application and data processing, that is, realizes the page detection method in the embodiment of the present invention.
Wherein, the memory 12 may be, but not limited to, random access memory (Random Access Memory,
RAM), read-only storage (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only
Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM),
Electricallyerasable ROM (EEROM) (Electrically Erasable Programmable Read-Only Memory,
EEPROM) etc..Wherein, memory 12 is used for storage program, and the processor 14 performs the journey after execute instruction is received
Sequence.
The processor 14 is probably a kind of IC chip, has the disposal ability of signal.Above-mentioned processor 14
Can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit
(Network Processor, NP) etc..It can also be digital signal processor (DSP), application-specific integrated circuit (ASIC), scene
Programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware group
Part.It can realize or perform disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be with
It is microprocessor or the processor can also be any conventional processor etc..
It is appreciated that structure shown in Fig. 1 is only to illustrate, terminal device 10 may also include it is more more than shown in Fig. 1 or
Less component, or there is the configuration different from shown in Fig. 1.Each component shown in Fig. 1 can use hardware, software or its
Combination is realized.
Referring to Fig. 2, the present invention provides a kind of page detection method, the method is applied to above-mentioned terminal device 10,
The terminal device 10 includes database, and blacklist link and white list link are stored with the database.The method bag
Include five steps of step S110-S150.
S110:The malicious link address obtained on internet is updated with updating the blacklist in the database and linking
Database afterwards.
Wherein, the mode for obtaining the malicious link address on the internet can be the malice chain obtained in preset period of time
It is grounded location.
Specifically acquisition modes can be:By given initial network chained address, creeped by the page to obtain
The webpage source code of the chained address.The scripted code fragment in webpage source code is extracted, and the fragment of extraction is put into scripted code
Fragment is concentrated.The chained address of webpage source code is extracted, if the chained address of extraction is with being stored in creeping in the database
The chained address of webpage is identical, then ignores the chained address;Conversely, this chained address is then put into chained address queue to be creeped
In, wait the page crawling operations of next step.It can also be the nearest malice net announced using regular expression from internet
Stand list webpage source code in extract the Main Domain of all malicious link addresses, and calculate its MD5 value.And in the database
Search in blacklist link and search whether that there are identical chained address.If it is present as repeat malicious link address and
Neglect;Conversely, then as new malicious link, and preserve to the database.It is also possible that realize net using Scrapy
Network is creeped the correlation function of submodule.Scrapy is that a Web realized with Python captures frame, for quick and high-level
Crawl Web site and extract corresponding data from the page grabbed.It is not specifically limited, carries out according to the actual requirements herein
Selection.
Step S120:The web page interlinkage of webpage to be detected is linked and white with the blacklist in the database after the renewal
Name singular link is matched respectively, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, if with it is white
The Web Page Tags to be detected are then secure web-page by name singular link matching, and unknown webpage is labeled as if all mismatching.
Specifically, incorporated by reference to Fig. 3, optionally, in the present embodiment, by the web page interlinkage of webpage to be detected and the renewal
Blacklist link and white list link in database afterwards carry out matching respectively includes following sub-step:
S121:Using the database after the Main Domain of the regular expression extraction web page address to be detected, the renewal
In the Main Domain of blacklist address and the Main Domain of white list.
S123:MD5 values to be detected are obtained according to the Main Domain of the web page address to be detected, according to the number after the renewal
Blacklist MD5 values are obtained according to the Main Domain of the blacklist address in storehouse, and are obtained according to the Main Domain that the white list links
White list MD5 values.
S125:The MD5 values to be detected are matched respectively with the blacklist MD5 values and white list MD5 values.
Realize the detection that rapid preliminary is carried out to the webpage to be detected by above-mentioned setting, and according to testing result into
Line flag will be examined no longer, it is necessary to explanation when the webpage to be detected is marked as malicious web pages and secure web-page
Survey, continue the detection of next step when the webpage to be detected is marked as unknown webpage.
Step S130:According to blacklist link and white list link generation grader.
It is optionally, in the present embodiment, described according to the blacklist address and the white list address incorporated by reference to Fig. 4
The step of generating grader is as follows:
S131:Obtain total quantity and be used as initial sample for the link of multiple blacklists and the link of multiple white lists of setting value
This.
Wherein, which can be but not limited to 10,20,50 or 100, be configured according to the actual requirements, described black
The quantity of name singular link can be identical or different from the quantity that the white list links, and not do again specific
Limit.
S132:Identical weight is taken for each code characteristic in the initial sample, and carries out multiple svm classifier
Practise to generate preliminary classification device
For example, when the setting value is n, the weight of each feature samples takes identical value 1/n, then carries out the i-th (i=
1,2,3 ...) secondary svm classifier study, and the sample being located in the initial sample in supporting plane interval is searched, make these samples
Weighted value be more than sample weights value outside supporting plane, to realize according to the initial sample generation preliminary classification device.
S133:The link not being labeled on multiple internets is obtained as candidate samples.
Wherein, the quantity of the link not being labeled can be but not limited to 20,40,100 or all candidate samples,
This is not specifically limited, and is configured according to the actual requirements.
S134:Chosen after classifying from the preliminary classification device to all unlabelled links in the candidate samples pre-
If the unlabelled link of quantity, wherein, multiple unlabelled link dividing close to the preliminary classification device of the default quantity
Class hyperplane.
S135:Receive the mark that user receives and sends for each unlabelled chain in the unlabelled link of the default quantity
Note instruction, and the link after being marked is marked to corresponding unlabelled link according to the mark instructions, the mark
Afterwards be linked as blacklist link or white list link.
S136:Link after the mark of default quantity is added in the initial sample, with initial again for this
Each code characteristic in sample takes identical weight, and carries out multiple svm classifier and learn to obtain grader.
By above-mentioned setting so that when being detected to the unknown webpage, by using above-mentioned grader so that inspection
It is more preferably accurate to survey result.
Step S140:Extract the scripted code fragment of the unknown webpage, and according to feature critical word from the script generation
Feature vector is extracted in chip segment.
In the present embodiment, the method for extracting feature vector from the scripted code fragment according to feature critical word can be with
It is:Number, string length and the information entropy occurred according to feature critical word extracts spy from the scripted code fragment
Sign vector.
Wherein, the feature critical word can include but is not limited to frame extension horse, JavaScript encryptions are obscured, DOM belongs to
Property or method, JavaScript perform function, string operation function, CLSID marks, space character, special string or exe
Suffix etc..The string length is one section of overlength character string that attacker carries out interference addition in malicious web pages, this super
Long character string is seldom appeared in normal webpage, therefore the length of most long character string can be used as detection feature.Described information
Entropy, can be by malicious web pages sample and normal webpage sample data respectively into row information frequently as a kind of feature of machine learning
Entropy calculates, and larger difference, comentropy is distributed with due to malicious web pages sample and normal webpage sample information entropy
Value can also be as the characteristic of division of machine learning algorithm.
Step S150:Described eigenvector and default vector are carried out classification judgement using the grader, and to obtain this unknown
Whether webpage is malicious web pages, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage and labeled as black
List network address is simultaneously preserved to the database after the renewal.
Substituted into respectively by number, string length and information entropy that the feature critical word of webpage to be detected occur
Compare in the grader of grader Active Learning structure, and count final classification, that is, obtain malicious websites either normal net
Stand, and result is stored in database.
By the above method the accuracy and high efficiency that webpage detects is effectively ensured, during webpage is detected in real time more
New database.
Further to make webpage detection more efficient, optionally, in the present embodiment, malice is prestored in the database
Code characteristic, the web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links
After being matched respectively, the method further includes:
The source code of the unknown webpage is matched with the malicious code feature, by the unknown webpage mark if matching
Malicious web pages are denoted as, and the network address of the unknown webpage labeled as blacklist network address and is preserved to the database after the renewal
In.
Referring to Fig. 5, on the basis of the above, the present invention also provides a kind of webpage detection device 100, the webpage detection dress
Putting 100 includes:Update module 110, first detection module 120, grader generation module 130, extraction module 140 and second are examined
Survey module 150.
The update module 110 is used to obtain the malicious link address on internet to update the black name in the database
Singular link updated after database.Specifically, the update module 110 can be used for performing the step S110 shown in Fig. 2,
Specific operating method refers to the detailed description of step S110.
The first detection module 120 is used in the database after the web page interlinkage of webpage to be detected and the renewal
Blacklist link and white list link is matched respectively, if link with blacklist match if by the Web Page Tags to be detected be evil
Anticipate webpage, if link with white list match if by the Web Page Tags to be detected be secure web-page, be labeled as not if all mismatching
Hownet page.Specifically, the first detection module 120 can be used for performing the step S120 shown in Fig. 2, specific operating method
Refer to the detailed description of step S120.
Incorporated by reference to Fig. 6, optionally, in the present embodiment, the first detection module 120 includes:Main Domain obtains submodule
Block 121, detection sub-module 123 and matched sub-block 125.
The Main Domain acquisition submodule 121 is used for the main domain that the web page address to be detected is extracted using regular expression
The Main Domain of blacklist address and the Main Domain of white list in database after name, the renewal.Specifically, the Main Domain
Acquisition submodule 121 can be used for performing the sub-step S121 shown in Fig. 3, and specific operating method refers to sub-step S121's
It is described in detail.
The detection sub-module 123 is used to obtain MD5 values to be detected, root according to the Main Domain of the web page address to be detected
Blacklist MD5 values are obtained according to the Main Domain of the blacklist address in the database after the renewal, and according to the white list
The Main Domain of link obtains white list MD5 values.Specifically, the detection sub-module 123 can be used for performing the sub-step shown in Fig. 3
Rapid S123, specific operating method refer to the detailed description of sub-step S123.
The matched sub-block 125 is used for the MD5 values to be detected and the blacklist MD5 values and white list MD5 values
Matched respectively.Specifically, the matched sub-block 125 can be used for performing the sub-step S125 shown in Fig. 3, specific behaviour
The detailed description of sub-step S125 is referred to as method.
The grader generation module 130 is used for according to blacklist link and white list link generation grader.Tool
Body, the grader generation module 130 can be used for performing the step S130 shown in Fig. 2, and specific operating method refers to
The detailed description of step S130.
Incorporated by reference to Fig. 7, optionally, in the present embodiment, the grader generation module 130 includes:First obtains submodule
Block 131, first generates submodule 132, the second acquisition submodule 133, classification submodule 134, receiving submodule 135 and second and gives birth to
Into submodule 136.
First acquisition submodule 131 is used to obtaining multiple blacklists link that total quantity is setting value and multiple white
Name singular link is as initial sample.Specifically, first acquisition submodule 131 can be used for performing the sub-step shown in Fig. 4
S131, specific operating method refer to the detailed description of sub-step S131.
The first generation submodule 132 is used to take identical power for each code characteristic in the initial sample
Weight, and multiple svm classifier study is carried out to generate preliminary classification device.Specifically, the first generation submodule 132 can be used for holding
Sub-step S132 shown in row Fig. 4, specific operating method refer to the detailed description of sub-step S132.
Second acquisition submodule 133 is used to obtain on multiple internets not labeled link as candidate samples.
Specifically, second acquisition submodule 133 can be used for performing the sub-step S133 shown in Fig. 4, and specific operating method can
With reference to the detailed description of sub-step S133.
The classification submodule 134 is used for from the preliminary classification device to all unlabelled links in the candidate samples
The unlabelled link of default quantity is chosen after being classified, wherein, multiple unlabelled links of the default quantity are close to institute
State the Optimal Separating Hyperplane of preliminary classification device.Specifically, the classification submodule 134 can be used for performing the sub-step shown in Fig. 4
S134, specific operating method refer to the detailed description of sub-step S134.
The receiving submodule 135 is used to receive user for not marking each in the unlabelled link of the default quantity
The mark instructions that the chain of note receives and sends, and corresponding unlabelled link is marked according to the mark instructions and is marked
Link afterwards, is linked as blacklist link or white list link after the mark.Specifically, the receiving submodule 135 can be used
In performing the sub-step S135 shown in Fig. 4, specific operating method refers to the detailed description of sub-step S135.
The second generation submodule 136 is used to the link after the mark of default quantity adding the initial sample
In, to take identical weight again for each code characteristic in the initial sample, and carry out multiple svm classifier and learn to obtain
Grader.Specifically, the second generation submodule 136 can be used for performing the sub-step S136 shown in Fig. 4, specific operation
Method refers to the detailed description of sub-step S136.
The extraction module 140 is used to extracting the scripted code fragment of the unknown webpage, and according to feature critical word from
Feature vector is extracted in the scripted code fragment.Specifically, the extraction module 140 can be used for performing the step shown in Fig. 2
Rapid S140, specific operating method refer to the detailed description of step S140.
Optionally, in the present embodiment, the extraction module 140 is additionally operable to:Number, the word occurred according to feature critical word
Symbol string length and information entropy extract feature vector from the scripted code fragment.
Second detection module 150 is used to described eigenvector and default vector be classified using the grader
Judgement obtains whether the unknown webpage is malicious web pages, and when the unknown webpage is malicious web pages, obtains the unknown webpage
Network address is simultaneously labeled as blacklist network address and preserves to the database after the renewal.Specifically, second detection module 150 can
For performing the step S150 shown in Fig. 2, specific operating method refers to the detailed description of step S150.
Optionally, malicious code feature is prestored in the database, the webpage detection device 100 further includes the 3rd inspection
Module is surveyed, the 3rd detection module is used to be matched the source code of the unknown webpage with the malicious code feature, if
The unknown Web Page Tags are then malicious web pages by matching, and the network address of the unknown webpage labeled as blacklist network address and is preserved extremely
In database after the renewal.
To sum up, a kind of page detection method and device provided by the invention, applied to the terminal device 10 including database,
Method includes:The malicious link address on internet is obtained to update the data the link of the blacklist in storehouse, by webpage to be detected
Web page interlinkage is linked with the blacklist in the database after renewal and white list is linked and matched respectively, is marked if all mismatching
Unknown webpage is denoted as, and grader is generated according to blacklist link and white list link, extracts the scripted code piece of unknown webpage
Section, and extracts feature vector according to feature critical word from scripted code fragment, using grader by feature vector and preset to
Amount carries out classification and judges whether the unknown webpage is malicious web pages, and is labeled as blacklist when the unknown webpage is malicious web pages
Network address and the database preserved to after updating.By the above method the accuracy and high efficiency that webpage detects is effectively ensured,
Real-time update database during detection webpage.
In several embodiments that the embodiment of the present invention is provided, it should be understood that disclosed apparatus and method, also may be used
To realize by another way.Apparatus and method embodiment described above is only schematical, for example, in attached drawing
Flow chart and block diagram show the device of multiple embodiments according to the present invention, the possibility of method and computer program product is realized
Architectural framework, function and operation.At this point, each square frame in flow chart or block diagram can represent module, a program
A part for section or code, a part for the module, program segment or code are used for realization defined patrol comprising one or more
Collect the executable instruction of function.It should also be noted that at some as the function of in the implementation replaced, being marked in square frame
Can be with different from the order marked in attached drawing generation.It is also noted that each square frame in block diagram and/or flow chart,
And the combination of the square frame in block diagram and/or flow chart, function or the dedicated of action as defined in performing can be used to be based on hardware
Device realize, or can be realized with the combination of specialized hardware and computer instruction.In addition, in each implementation of the present invention
Each function module in example can integrate to form an independent part or modules individualism,
An independent part can be integrated to form with two or more modules.
If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words
The part to contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be
People's computer, electronic equipment, or network equipment etc.) perform all or part of step of each embodiment the method for the present invention
Suddenly.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), deposit at random
Access to memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
It should be noted that herein, term " comprising ", "comprising" or its any other variant are intended to the bag of nonexcludability
Contain, so that process, method, article or equipment including a series of elements not only include those key elements, but also including
Other elements that are not explicitly listed, or further include as elements inherent to such a process, method, article, or device.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the invention, for the skill of this area
For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of page detection method, applied to the terminal device including database, it is characterised in that stored in the database
There are blacklist link and white list link, the described method includes:
The malicious link address obtained on internet links the data after being updated to update the blacklist in the database
Storehouse;
The web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links difference
Matched, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, matched if being linked with white list
It is then secure web-page by the Web Page Tags to be detected, unknown webpage is labeled as if all mismatching;
According to blacklist link and white list link generation grader;
The scripted code fragment of the unknown webpage is extracted, and extracts spy from the scripted code fragment according to feature critical word
Sign vector;
Use the grader that described eigenvector and default vector are carried out classification and whether judge to obtain the unknown webpage as evil
Meaning webpage, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage and be labeled as blacklist network address and protect
Deposit to the database after the renewal.
2. page detection method according to claim 1, it is characterised in that according to blacklist link and white list chain
Deliver a child constituent class device the step of include:
Obtain total quantity and be used as initial sample for the link of multiple blacklists and the link of multiple white lists of setting value;
Identical weight is taken for each code characteristic in the initial sample, and carries out multiple svm classifier study to generate
Preliminary classification device;
The link not being labeled on multiple internets is obtained as candidate samples;
Default quantity is chosen after classifying from the preliminary classification device to all unlabelled links in the candidate samples
Unlabelled link, wherein, the Optimal Separating Hyperplane of multiple unlabelled close preliminary classification devices of link of the default quantity;
The mark instructions that user receives and sends for each unlabelled chain in the unlabelled link of the default quantity are received, and
The link after being marked is marked to corresponding unlabelled link according to the mark instructions, is linked as after the mark
Blacklist links or white list link;
Link after the mark of default quantity is added in the initial sample, with again for every in the initial sample
A code characteristic takes identical weight, and carries out multiple svm classifier and learn to obtain grader.
3. page detection method according to claim 1, it is characterised in that by the web page interlinkage of webpage to be detected with it is described
The step of blacklist link and white list link in database after renewal are matched respectively includes:
Using the blacklist in the database after the Main Domain of the regular expression extraction web page address to be detected, the renewal
The Main Domain of address and the Main Domain of white list;
MD5 values to be detected are obtained according to the Main Domain of the web page address to be detected, according in the database after the renewal
The Main Domain of blacklist address obtains blacklist MD5 values, and obtains white list MD5 according to the Main Domain that the white list links
Value;
The MD5 values to be detected are matched respectively with the blacklist MD5 values and white list MD5 values.
4. page detection method according to claim 1, it is characterised in that malicious code spy is prestored in the database
Sign, by the web page interlinkage of webpage to be detected linked with the blacklist in the database after the renewal and white list link respectively into
After row matching, the method further includes:
The source code of the unknown webpage is matched with the malicious code feature, is by the unknown Web Page Tags if matching
Malicious web pages, and the network address of the unknown webpage labeled as blacklist network address and is preserved into the database after the renewal.
5. page detection method according to claim 1, it is characterised in that according to feature critical word from the scripted code
The step of feature vector is extracted in fragment includes:
Number, string length and the information entropy occurred according to feature critical word extracts spy from the scripted code fragment
Sign vector.
6. a kind of webpage detection device, it is characterised in that applied to the terminal device including database, described device includes:
Update module:Link to obtain to update the blacklist in the database for obtaining the malicious link address on internet
Database after renewal;
First detection module:For the web page interlinkage of webpage to be detected and the blacklist in the database after the renewal to be linked
Matched respectively with white list link, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, if
Link that to match be then secure web-page by the Web Page Tags to be detected with white list, unknown webpage is labeled as if all mismatching;
Grader generation module:For according to blacklist link and white list link generation grader;
Extraction module:For extracting the scripted code fragment of the unknown webpage, and according to feature critical word from the script generation
Feature vector is extracted in chip segment;
Second detection module:It is somebody's turn to do for described eigenvector and default vector to be carried out classification judgement using the grader
Whether unknown webpage is malicious web pages, and when the unknown webpage is malicious web pages, obtains the network address and mark of the unknown webpage
For blacklist network address and preserve to the database after the renewal.
7. webpage detection device according to claim 6, it is characterised in that the grader generation module includes:
First acquisition submodule:For obtaining the link of multiple blacklists and multiple white lists link work that total quantity is setting value
For initial sample;
First generation submodule:For taking identical weight for each code characteristic in the initial sample, and carry out more
Secondary svm classifier learns to generate preliminary classification device;
Second acquisition submodule:For obtaining the link not being labeled on multiple internets as candidate samples;
Classification submodule:After from the preliminary classification device to all unlabelled links are classified in the candidate samples
The unlabelled link of default quantity is chosen, wherein, multiple unlabelled links of the default quantity are close to the preliminary classification
The Optimal Separating Hyperplane of device;
Receiving submodule:For receiving user for each unlabelled chain sending and receiving in the unlabelled link of the default quantity
The mark instructions sent, and the link after being marked is marked to corresponding unlabelled link according to the mark instructions,
Blacklist link or white list link are linked as after the mark;
Second generation submodule:Added for the link after the mark by default quantity in the initial sample, with again
Identical weight is taken for each code characteristic in the initial sample, and carries out multiple svm classifier and learns to obtain grader.
8. webpage detection device according to claim 6, it is characterised in that the first detection module includes:
Main Domain acquisition submodule:For extracted using regular expression the web page address to be detected Main Domain, it is described more
The Main Domain of blacklist address and the Main Domain of white list in database after new;
Detection sub-module:For obtaining MD5 values to be detected according to the Main Domain of the web page address to be detected, according to the renewal
The Main Domain of blacklist address in database afterwards obtains blacklist MD5 values, and the main domain linked according to the white list
Name obtains white list MD5 values;
Matched sub-block:For the MD5 values to be detected and the blacklist MD5 values and white list MD5 values to be carried out respectively
Match somebody with somebody.
9. webpage detection device according to claim 6, it is characterised in that malicious code spy is prestored in the database
Sign, described device further include:
3rd detection module:For the source code of the unknown webpage to be matched with the malicious code feature, if matching
Be malicious web pages by the unknown Web Page Tags, and by the network address of the unknown webpage labeled as blacklist network address and preserve to it is described more
In database after new.
10. webpage detection device according to claim 6, it is characterised in that the extraction module is additionally operable to:According to feature
Number, string length and the information entropy that keyword occurs extract feature vector from the scripted code fragment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711230916.6A CN107948168A (en) | 2017-11-29 | 2017-11-29 | Page detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711230916.6A CN107948168A (en) | 2017-11-29 | 2017-11-29 | Page detection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107948168A true CN107948168A (en) | 2018-04-20 |
Family
ID=61946846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711230916.6A Pending CN107948168A (en) | 2017-11-29 | 2017-11-29 | Page detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107948168A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108683649A (en) * | 2018-04-28 | 2018-10-19 | 四川大学 | A kind of malice domain name detection method based on text feature |
CN108710797A (en) * | 2018-06-15 | 2018-10-26 | 四川大学 | A kind of malice document detection method based on entropy information distribution |
CN109190372A (en) * | 2018-07-09 | 2019-01-11 | 四川大学 | A kind of JavaScript Malicious Code Detection model based on bytecode |
CN109241462A (en) * | 2018-08-13 | 2019-01-18 | 中国联合网络通信集团有限公司 | The black word treatment method of webpage, device, equipment and storage medium |
CN109299346A (en) * | 2018-10-31 | 2019-02-01 | 长春理工大学 | A kind of recognition methods and system of invalid address webpage |
CN109413041A (en) * | 2018-09-25 | 2019-03-01 | 东莞海露欧信息科技有限公司 | Information Security verification method and device |
CN109784038A (en) * | 2018-12-29 | 2019-05-21 | 北京奇安信科技有限公司 | Detecting black chain method, apparatus, system and computer readable storage medium |
CN110661680A (en) * | 2019-09-11 | 2020-01-07 | 深圳市永达电子信息股份有限公司 | Method and system for detecting data stream white list based on regular expression |
CN110781497A (en) * | 2019-10-21 | 2020-02-11 | 新华三信息安全技术有限公司 | Method for detecting web page link and storage medium |
CN111988265A (en) * | 2019-05-23 | 2020-11-24 | 深信服科技股份有限公司 | Network traffic attack identification method, firewall system and related components |
CN112202807A (en) * | 2020-10-13 | 2021-01-08 | 北京明略昭辉科技有限公司 | Grayscale replacement method and device for IP (Internet protocol) blacklist, electronic equipment and storage medium |
CN112511525A (en) * | 2020-11-24 | 2021-03-16 | 山西三友和智慧信息技术股份有限公司 | Website malicious third-party content detection method and system |
CN115001789A (en) * | 2022-05-27 | 2022-09-02 | 绿盟科技集团股份有限公司 | Method, device, equipment and medium for detecting defect-losing equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101820366A (en) * | 2010-01-27 | 2010-09-01 | 南京邮电大学 | Pre-fetching-based phishing web page detection method |
CN101826105A (en) * | 2010-04-02 | 2010-09-08 | 南京邮电大学 | Phishing webpage detection method based on Hungary matching algorithm |
CN102801697A (en) * | 2011-12-20 | 2012-11-28 | 北京安天电子设备有限公司 | Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator) |
CN103544436A (en) * | 2013-10-12 | 2014-01-29 | 深圳先进技术研究院 | System and method for distinguishing phishing websites |
CN103577755A (en) * | 2013-11-01 | 2014-02-12 | 浙江工业大学 | Malicious script static detection method based on SVM (support vector machine) |
CN104217160A (en) * | 2014-09-19 | 2014-12-17 | 中国科学院深圳先进技术研究院 | Method and system for detecting Chinese phishing website |
CN105824822A (en) * | 2015-01-05 | 2016-08-03 | 任子行网络技术股份有限公司 | Method clustering phishing page to locate target page |
WO2016173327A1 (en) * | 2015-04-28 | 2016-11-03 | 北京瀚思安信科技有限公司 | Method and device for detecting website attack |
CN106357618A (en) * | 2016-08-26 | 2017-01-25 | 北京奇虎科技有限公司 | Web abnormality detection method and device |
-
2017
- 2017-11-29 CN CN201711230916.6A patent/CN107948168A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101820366A (en) * | 2010-01-27 | 2010-09-01 | 南京邮电大学 | Pre-fetching-based phishing web page detection method |
CN101826105A (en) * | 2010-04-02 | 2010-09-08 | 南京邮电大学 | Phishing webpage detection method based on Hungary matching algorithm |
CN102801697A (en) * | 2011-12-20 | 2012-11-28 | 北京安天电子设备有限公司 | Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator) |
CN103544436A (en) * | 2013-10-12 | 2014-01-29 | 深圳先进技术研究院 | System and method for distinguishing phishing websites |
CN103577755A (en) * | 2013-11-01 | 2014-02-12 | 浙江工业大学 | Malicious script static detection method based on SVM (support vector machine) |
CN104217160A (en) * | 2014-09-19 | 2014-12-17 | 中国科学院深圳先进技术研究院 | Method and system for detecting Chinese phishing website |
CN105824822A (en) * | 2015-01-05 | 2016-08-03 | 任子行网络技术股份有限公司 | Method clustering phishing page to locate target page |
WO2016173327A1 (en) * | 2015-04-28 | 2016-11-03 | 北京瀚思安信科技有限公司 | Method and device for detecting website attack |
CN106357618A (en) * | 2016-08-26 | 2017-01-25 | 北京奇虎科技有限公司 | Web abnormality detection method and device |
Non-Patent Citations (2)
Title |
---|
周辉著: "第三章 第3.2.1节 网络爬虫", 《互联网信息监测系统研究》 * |
许伟编著: "第六章 第五节 常见反钓鱼技术", 《网络营销实验指导》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108683649A (en) * | 2018-04-28 | 2018-10-19 | 四川大学 | A kind of malice domain name detection method based on text feature |
CN108710797A (en) * | 2018-06-15 | 2018-10-26 | 四川大学 | A kind of malice document detection method based on entropy information distribution |
CN109190372A (en) * | 2018-07-09 | 2019-01-11 | 四川大学 | A kind of JavaScript Malicious Code Detection model based on bytecode |
CN109241462B (en) * | 2018-08-13 | 2021-12-14 | 中国联合网络通信集团有限公司 | Webpage black word processing method, device, equipment and storage medium |
CN109241462A (en) * | 2018-08-13 | 2019-01-18 | 中国联合网络通信集团有限公司 | The black word treatment method of webpage, device, equipment and storage medium |
CN109413041A (en) * | 2018-09-25 | 2019-03-01 | 东莞海露欧信息科技有限公司 | Information Security verification method and device |
CN109299346A (en) * | 2018-10-31 | 2019-02-01 | 长春理工大学 | A kind of recognition methods and system of invalid address webpage |
CN109784038A (en) * | 2018-12-29 | 2019-05-21 | 北京奇安信科技有限公司 | Detecting black chain method, apparatus, system and computer readable storage medium |
CN111988265A (en) * | 2019-05-23 | 2020-11-24 | 深信服科技股份有限公司 | Network traffic attack identification method, firewall system and related components |
CN110661680A (en) * | 2019-09-11 | 2020-01-07 | 深圳市永达电子信息股份有限公司 | Method and system for detecting data stream white list based on regular expression |
CN110661680B (en) * | 2019-09-11 | 2023-03-14 | 深圳市永达电子信息股份有限公司 | Method and system for detecting data stream white list based on regular expression |
CN110781497A (en) * | 2019-10-21 | 2020-02-11 | 新华三信息安全技术有限公司 | Method for detecting web page link and storage medium |
CN112202807A (en) * | 2020-10-13 | 2021-01-08 | 北京明略昭辉科技有限公司 | Grayscale replacement method and device for IP (Internet protocol) blacklist, electronic equipment and storage medium |
CN112511525A (en) * | 2020-11-24 | 2021-03-16 | 山西三友和智慧信息技术股份有限公司 | Website malicious third-party content detection method and system |
CN115001789A (en) * | 2022-05-27 | 2022-09-02 | 绿盟科技集团股份有限公司 | Method, device, equipment and medium for detecting defect-losing equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107948168A (en) | Page detection method and device | |
CN104077396B (en) | Method and device for detecting phishing website | |
Blum et al. | Lexical feature based phishing URL detection using online learning | |
Azeez et al. | Identifying phishing attacks in communication networks using URL consistency features | |
Lakshmi et al. | Efficient prediction of phishing websites using supervised learning algorithms | |
CN103544436B (en) | System and method for distinguishing phishing websites | |
CN107659570A (en) | Webshell detection methods and system based on machine learning and static and dynamic analysis | |
Wardman et al. | High-performance content-based phishing attack detection | |
CN104899508A (en) | Multistage phishing website detecting method and system | |
Folorunso et al. | Ca-NIDS: A network intrusion detection system using combinatorial algorithm approach | |
Dadkhah et al. | An introduction to journal phishings and their detection approach | |
CN113098887A (en) | Phishing website detection method based on website joint characteristics | |
Liu et al. | An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment | |
Gu et al. | An efficient approach to detecting phishing web | |
CN108023868A (en) | Malice resource address detection method and device | |
CN107665164A (en) | Secure data detection method and device | |
CN105959328A (en) | Evidence graph and vulnerability reasoning combined network evidence collection method and system | |
Mythreya et al. | Prediction and prevention of malicious URL using ML and LR techniques for network security: machine learning | |
Akour et al. | Using classical machine learning for phishing websites detection from URLs | |
Salehi et al. | A novel approach for detecting DGA-based ransomwares | |
Zeydan et al. | Current state of anti-phishing approaches and revealing competencies | |
Alkawaz et al. | Identification and analysis of phishing website based on machine learning methods | |
CN109194605A (en) | A kind of suspected threat index Proactive authentication method and system based on open source information | |
WO2021229786A1 (en) | Learning device, detection device, learning method, detection method, learning program, and detection program | |
CN108694325A (en) | The condition discriminating apparatus of the discriminating conduct and specified type website of specified type website |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180420 |
|
RJ01 | Rejection of invention patent application after publication |