CN107948168A - Page detection method and device - Google Patents

Page detection method and device Download PDF

Info

Publication number
CN107948168A
CN107948168A CN201711230916.6A CN201711230916A CN107948168A CN 107948168 A CN107948168 A CN 107948168A CN 201711230916 A CN201711230916 A CN 201711230916A CN 107948168 A CN107948168 A CN 107948168A
Authority
CN
China
Prior art keywords
link
blacklist
webpage
database
white list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711230916.6A
Other languages
Chinese (zh)
Inventor
张磊
刘亮
陈航
陈纯
邹晓波
方勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Silent Information Technology Co Ltd
Sichuan University
Original Assignee
Sichuan Silent Information Technology Co Ltd
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Silent Information Technology Co Ltd, Sichuan University filed Critical Sichuan Silent Information Technology Co Ltd
Priority to CN201711230916.6A priority Critical patent/CN107948168A/en
Publication of CN107948168A publication Critical patent/CN107948168A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks

Abstract

The present invention relates to Internet technical field, and in particular to a kind of page detection method and device, applied to the terminal device including database, method includes:The malicious link address on internet is obtained to update the data the link of the blacklist in storehouse, the web page interlinkage of webpage to be detected is linked with the blacklist in the database after renewal and white list is linked and matched respectively, unknown webpage is labeled as if all mismatching, and according to blacklist link and white list link generation grader, extract the scripted code fragment of unknown webpage, and feature vector is extracted from scripted code fragment according to feature critical word, whether use grader that feature vector and default vector are carried out classification judges unknown webpage for malicious web pages, and labeled as blacklist network address and preserved when the unknown webpage is malicious web pages to the database after renewal.By the above method be effectively ensured webpage detect accuracy and high efficiency, the real-time update database during webpage is detected.

Description

Page detection method and device
Technical field
The present invention relates to Internet technical field, in particular to a kind of page detection method and device.
Background technology
With the development of network technology, network personal information also has been to be concerned by more and more people safely.The account of user And password, it is critically important in network personal information one, in the prior art, net is often cheated in illegal website by account Page gains the account of user and password by cheating to carry out unlawful activities, seriously endangers the rights and interests of user.Therefore, how webpage is carried out Detection, it is particularly important to network security to determine whether it is account fraud webpage.
At present, common several detection modes are respectively:Pattern match detection, heuristic analysis and Honeypot Techniques.But The shortcomings that pattern match is the attack that cannot detect the malicious web pages containing obfuscated codes and UNKNOWN TYPE, heuristic analysis skill The shortcomings that art is that False Rate and misdetection rate are higher, the detection cycle length of honey jar formula technology.
The content of the invention
In view of this, it is an object of the invention to provide a kind of page detection method and device, webpage inspection is effectively ensured The accuracy and high efficiency of survey, the real-time update database during webpage is detected.
In order to achieve the above object, present pre-ferred embodiments provide a kind of page detection method, applied to including data The terminal device in storehouse, blacklist link and white list link are stored with the database, the described method includes:
After obtaining the malicious link address on internet and linking and updated to update the blacklist in the database Database;
The web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links Matched respectively, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, if being linked with white list The Web Page Tags to be detected are then secure web-page by matching, and unknown webpage is labeled as if all mismatching;
According to blacklist link and white list link generation grader;
The scripted code fragment of the unknown webpage is extracted, and is carried according to feature critical word from the scripted code fragment Take feature vector;
Described eigenvector and default vector are carried out whether classification judgement obtains the unknown webpage using the grader For malicious web pages, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage and be labeled as blacklist network address And preserve to the database after the renewal.
In the preferred embodiment, in above-mentioned page detection method, according to blacklist link and white name The step of singular link generation grader, includes:
Obtain total quantity and be used as initial sample for the link of multiple blacklists and the link of multiple white lists of setting value;
Take identical weight for each code characteristic in the initial sample, and carry out multiple svm classifier study with Generate preliminary classification device;
The link not being labeled on multiple internets is obtained as candidate samples;
Present count is chosen after classifying from the preliminary classification device to all unlabelled links in the candidate samples The unlabelled link of amount, wherein, multiple unlabelled links of the default quantity surpass close to the classification of the preliminary classification device Plane;
The mark that user receives and sends for each unlabelled chain in the unlabelled link of the default quantity is received to refer to Order, and the link after being marked is marked to corresponding unlabelled link according to the mark instructions, after the mark It is linked as blacklist link or white list link;
Link after the mark of default quantity is added in the initial sample, with again in the initial sample Each code characteristic take identical weight, and carry out multiple svm classifier and learn to obtain grader.
In the preferred embodiment, in above-mentioned page detection method, by the web page interlinkage of webpage to be detected with The step of blacklist link and white list link in database after the renewal are matched respectively includes:
Using black in the database after the Main Domain of the regular expression extraction web page address to be detected, the renewal The Main Domain of the single address Main Domain of name and white list;
MD5 values to be detected are obtained according to the Main Domain of the web page address to be detected, according to the database after the renewal In the Main Domain of blacklist address obtain blacklist MD5 values, and white name is obtained according to the Main Domain that the white list links Single MD5 values;
The MD5 values to be detected are matched respectively with the blacklist MD5 values and white list MD5 values.
In the preferred embodiment, in above-mentioned page detection method, malice generation is prestored in the database Code feature, the web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links point After not matched, the method further includes:
The source code of the unknown webpage is matched with the malicious code feature, by the unknown webpage mark if matching Malicious web pages are denoted as, and the network address of the unknown webpage labeled as blacklist network address and is preserved to the database after the renewal In.
In the preferred embodiment, in above-mentioned page detection method, according to feature critical word from the script The step of feature vector is extracted in code snippet includes:
Number, string length and the information entropy occurred according to feature critical word is carried from the scripted code fragment Take feature vector.
The present invention also provides a kind of webpage detection device, applied to the terminal device including database, described device includes:
Update module:Linked for obtaining the malicious link address on internet with updating the blacklist in the database Database after being updated;
First detection module:For by the blacklist in the database after the web page interlinkage of webpage to be detected and the renewal Link and white list link are matched respectively, if linked with blacklist match if by the Web Page Tags to be detected be malice net Page, if linked with white list match if by the Web Page Tags to be detected be secure web-page, be labeled as non-Hownet if all mismatching Page;
Grader generation module:For according to blacklist link and white list link generation grader;
Extraction module:For extracting the scripted code fragment of the unknown webpage, and according to feature critical word from the foot Feature vector is extracted in this code snippet;
Second detection module:Judge for described eigenvector and default vector to be carried out classification using the grader Whether it is malicious web pages to the unknown webpage, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage simultaneously Labeled as blacklist network address and preserve to the database after the renewal.
In the preferred embodiment, in above-mentioned webpage detection device, the grader generation module includes:
First acquisition submodule:For obtaining the link of multiple blacklists and multiple white list chains that total quantity is setting value Connect as initial sample;
First generation submodule:For taking identical weight for each code characteristic in the initial sample, go forward side by side The multiple svm classifier of row learns to generate preliminary classification device;
Second acquisition submodule:For obtaining the link not being labeled on multiple internets as candidate samples;
Classification submodule:For dividing from the preliminary classification device all unlabelled links in the candidate samples The unlabelled link of default quantity is chosen after class, wherein, multiple unlabelled links of the default quantity are close to described initial The Optimal Separating Hyperplane of grader;
Receiving submodule:For receiving user for each unlabelled chain in the unlabelled link of the default quantity The mark instructions received and sent, and the chain after being marked is marked to corresponding unlabelled link according to the mark instructions Connect, blacklist link or white list link are linked as after the mark;
Second generation submodule:Added for the link after the mark by default quantity in the initial sample, with Identical weight is taken again for each code characteristic in the initial sample, and carries out multiple svm classifier and learns to be classified Device.
In the preferred embodiment, in above-mentioned webpage detection device, the first detection module includes:
Main Domain acquisition submodule:For extracting Main Domain, the institute of the web page address to be detected using regular expression State the Main Domain of blacklist address and the Main Domain of white list in the database after renewal;
Detection sub-module:For obtaining MD5 values to be detected according to the Main Domain of the web page address to be detected, according to described The Main Domain of blacklist address in database after renewal obtains blacklist MD5 values, and linked according to the white list Main Domain obtains white list MD5 values;
Matched sub-block:For by the MD5 values to be detected and the blacklist MD5 values and white list MD5 values respectively into Row matching.
In the preferred embodiment, in above-mentioned webpage detection device, malice generation is prestored in the database Code feature, described device further include:
3rd detection module:For the source code of the unknown webpage to be matched with the malicious code feature, if With by the unknown Web Page Tags being then malicious web pages, and the network address of the unknown webpage labeled as blacklist network address and is preserved to institute State in the database after renewal.
In the preferred embodiment, in above-mentioned webpage detection device, the extraction module is additionally operable to:According to spy Number, string length and the information entropy that sign keyword occurs extract feature vector from the scripted code fragment.
A kind of page detection method and device provided by the invention, applied to the terminal device including database, method bag Include:The malicious link address on internet is obtained to update the data the link of the blacklist in storehouse, by the webpage chain of webpage to be detected Connect and link and white list is linked and matched respectively with the blacklist in the database after renewal, is labeled as not if all mismatching Hownet page, and according to blacklist link and white list link generation grader, the scripted code fragment of unknown webpage is extracted, and press Feature vector is extracted from scripted code fragment according to feature critical word, is divided feature vector and default vector using grader Class judges whether the unknown webpage is malicious web pages, and labeled as blacklist network address and is protected when the unknown webpage is malicious web pages Deposit the database to renewal.By the above method the accuracy and high efficiency that webpage detects is effectively ensured, in detection webpage During real-time update database.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair The restriction of scope, for those of ordinary skill in the art, without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of connection block diagram of terminal device provided in an embodiment of the present invention.
Fig. 2 is a kind of flow diagram of page detection method provided in an embodiment of the present invention.
Fig. 3 is the sub-step schematic diagram of step S120 in Fig. 2.
Fig. 4 is the sub-step schematic diagram of step S130 in Fig. 2.
Fig. 5 is a kind of connection block diagram of webpage detection device provided in an embodiment of the present invention.
Fig. 6 is the connection block diagram of first detection module provided in an embodiment of the present invention.
Fig. 7 is the connection block diagram of grader generation module provided in an embodiment of the present invention.
Icon:10- terminal devices;12- memories;14- processors;100- webpage detection devices;110- update modules; 120- first detection modules;121- Main Domain acquisition submodules;123- detection sub-modules;125- matched sub-blocks;130- classifies Device generation module;The first acquisition submodules of 131-;132- first generates submodule;The second acquisition submodules of 133-;134- classifies Submodule;135- receiving submodules;136- second generates submodule;140- extraction modules;The second detection modules of 150-.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, instead of all the embodiments.The present invention implementation being usually described and illustrated herein in the accompanying drawings The component of example can be arranged and designed with a variety of configurations.
Therefore, below the detailed description of the embodiment of the present invention to providing in the accompanying drawings be not intended to limit it is claimed The scope of the present invention, but be merely representative of the present invention selected embodiment.Based on the embodiments of the present invention, this area is common Technical staff's all other embodiments obtained without creative efforts, belong to the model that the present invention protects Enclose.
It should be noted that:Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.
As shown in Figure 1, it is the block diagram for the terminal device 10 that present pre-ferred embodiments provide.The embodiment of the present invention In terminal device 10 can be equipment that server, computer etc. possess data-handling capacity, the terminal device 10 includes Database is associated with database.As shown in Figure 1, terminal device 10 includes:Memory 12 and processor 14.
The memory 12 is directly or indirectly electrically connected between each other with processor 14, with realize the transmission of data or Interaction.It is electrically connected for example, these elements can be realized by one or more communication bus or signal wire between each other.Memory The software function module being stored in the form of software or firmware (Firmware) in the memory 12 is stored with 12, it is described Processor 14 is stored in software program and module in memory 12 by operation, such as the webpage detection in the embodiment of the present invention Device 100, so as to perform various functions application and data processing, that is, realizes the page detection method in the embodiment of the present invention.
Wherein, the memory 12 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only storage (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electricallyerasable ROM (EEROM) (Electrically Erasable Programmable Read-Only Memory, EEPROM) etc..Wherein, memory 12 is used for storage program, and the processor 14 performs the journey after execute instruction is received Sequence.
The processor 14 is probably a kind of IC chip, has the disposal ability of signal.Above-mentioned processor 14 Can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc..It can also be digital signal processor (DSP), application-specific integrated circuit (ASIC), scene Programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware group Part.It can realize or perform disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be with It is microprocessor or the processor can also be any conventional processor etc..
It is appreciated that structure shown in Fig. 1 is only to illustrate, terminal device 10 may also include it is more more than shown in Fig. 1 or Less component, or there is the configuration different from shown in Fig. 1.Each component shown in Fig. 1 can use hardware, software or its Combination is realized.
Referring to Fig. 2, the present invention provides a kind of page detection method, the method is applied to above-mentioned terminal device 10, The terminal device 10 includes database, and blacklist link and white list link are stored with the database.The method bag Include five steps of step S110-S150.
S110:The malicious link address obtained on internet is updated with updating the blacklist in the database and linking Database afterwards.
Wherein, the mode for obtaining the malicious link address on the internet can be the malice chain obtained in preset period of time It is grounded location.
Specifically acquisition modes can be:By given initial network chained address, creeped by the page to obtain The webpage source code of the chained address.The scripted code fragment in webpage source code is extracted, and the fragment of extraction is put into scripted code Fragment is concentrated.The chained address of webpage source code is extracted, if the chained address of extraction is with being stored in creeping in the database The chained address of webpage is identical, then ignores the chained address;Conversely, this chained address is then put into chained address queue to be creeped In, wait the page crawling operations of next step.It can also be the nearest malice net announced using regular expression from internet Stand list webpage source code in extract the Main Domain of all malicious link addresses, and calculate its MD5 value.And in the database Search in blacklist link and search whether that there are identical chained address.If it is present as repeat malicious link address and Neglect;Conversely, then as new malicious link, and preserve to the database.It is also possible that realize net using Scrapy Network is creeped the correlation function of submodule.Scrapy is that a Web realized with Python captures frame, for quick and high-level Crawl Web site and extract corresponding data from the page grabbed.It is not specifically limited, carries out according to the actual requirements herein Selection.
Step S120:The web page interlinkage of webpage to be detected is linked and white with the blacklist in the database after the renewal Name singular link is matched respectively, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, if with it is white The Web Page Tags to be detected are then secure web-page by name singular link matching, and unknown webpage is labeled as if all mismatching.
Specifically, incorporated by reference to Fig. 3, optionally, in the present embodiment, by the web page interlinkage of webpage to be detected and the renewal Blacklist link and white list link in database afterwards carry out matching respectively includes following sub-step:
S121:Using the database after the Main Domain of the regular expression extraction web page address to be detected, the renewal In the Main Domain of blacklist address and the Main Domain of white list.
S123:MD5 values to be detected are obtained according to the Main Domain of the web page address to be detected, according to the number after the renewal Blacklist MD5 values are obtained according to the Main Domain of the blacklist address in storehouse, and are obtained according to the Main Domain that the white list links White list MD5 values.
S125:The MD5 values to be detected are matched respectively with the blacklist MD5 values and white list MD5 values.
Realize the detection that rapid preliminary is carried out to the webpage to be detected by above-mentioned setting, and according to testing result into Line flag will be examined no longer, it is necessary to explanation when the webpage to be detected is marked as malicious web pages and secure web-page Survey, continue the detection of next step when the webpage to be detected is marked as unknown webpage.
Step S130:According to blacklist link and white list link generation grader.
It is optionally, in the present embodiment, described according to the blacklist address and the white list address incorporated by reference to Fig. 4 The step of generating grader is as follows:
S131:Obtain total quantity and be used as initial sample for the link of multiple blacklists and the link of multiple white lists of setting value This.
Wherein, which can be but not limited to 10,20,50 or 100, be configured according to the actual requirements, described black The quantity of name singular link can be identical or different from the quantity that the white list links, and not do again specific Limit.
S132:Identical weight is taken for each code characteristic in the initial sample, and carries out multiple svm classifier Practise to generate preliminary classification device
For example, when the setting value is n, the weight of each feature samples takes identical value 1/n, then carries out the i-th (i= 1,2,3 ...) secondary svm classifier study, and the sample being located in the initial sample in supporting plane interval is searched, make these samples Weighted value be more than sample weights value outside supporting plane, to realize according to the initial sample generation preliminary classification device.
S133:The link not being labeled on multiple internets is obtained as candidate samples.
Wherein, the quantity of the link not being labeled can be but not limited to 20,40,100 or all candidate samples, This is not specifically limited, and is configured according to the actual requirements.
S134:Chosen after classifying from the preliminary classification device to all unlabelled links in the candidate samples pre- If the unlabelled link of quantity, wherein, multiple unlabelled link dividing close to the preliminary classification device of the default quantity Class hyperplane.
S135:Receive the mark that user receives and sends for each unlabelled chain in the unlabelled link of the default quantity Note instruction, and the link after being marked is marked to corresponding unlabelled link according to the mark instructions, the mark Afterwards be linked as blacklist link or white list link.
S136:Link after the mark of default quantity is added in the initial sample, with initial again for this Each code characteristic in sample takes identical weight, and carries out multiple svm classifier and learn to obtain grader.
By above-mentioned setting so that when being detected to the unknown webpage, by using above-mentioned grader so that inspection It is more preferably accurate to survey result.
Step S140:Extract the scripted code fragment of the unknown webpage, and according to feature critical word from the script generation Feature vector is extracted in chip segment.
In the present embodiment, the method for extracting feature vector from the scripted code fragment according to feature critical word can be with It is:Number, string length and the information entropy occurred according to feature critical word extracts spy from the scripted code fragment Sign vector.
Wherein, the feature critical word can include but is not limited to frame extension horse, JavaScript encryptions are obscured, DOM belongs to Property or method, JavaScript perform function, string operation function, CLSID marks, space character, special string or exe Suffix etc..The string length is one section of overlength character string that attacker carries out interference addition in malicious web pages, this super Long character string is seldom appeared in normal webpage, therefore the length of most long character string can be used as detection feature.Described information Entropy, can be by malicious web pages sample and normal webpage sample data respectively into row information frequently as a kind of feature of machine learning Entropy calculates, and larger difference, comentropy is distributed with due to malicious web pages sample and normal webpage sample information entropy Value can also be as the characteristic of division of machine learning algorithm.
Step S150:Described eigenvector and default vector are carried out classification judgement using the grader, and to obtain this unknown Whether webpage is malicious web pages, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage and labeled as black List network address is simultaneously preserved to the database after the renewal.
Substituted into respectively by number, string length and information entropy that the feature critical word of webpage to be detected occur Compare in the grader of grader Active Learning structure, and count final classification, that is, obtain malicious websites either normal net Stand, and result is stored in database.
By the above method the accuracy and high efficiency that webpage detects is effectively ensured, during webpage is detected in real time more New database.
Further to make webpage detection more efficient, optionally, in the present embodiment, malice is prestored in the database Code characteristic, the web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links After being matched respectively, the method further includes:
The source code of the unknown webpage is matched with the malicious code feature, by the unknown webpage mark if matching Malicious web pages are denoted as, and the network address of the unknown webpage labeled as blacklist network address and is preserved to the database after the renewal In.
Referring to Fig. 5, on the basis of the above, the present invention also provides a kind of webpage detection device 100, the webpage detection dress Putting 100 includes:Update module 110, first detection module 120, grader generation module 130, extraction module 140 and second are examined Survey module 150.
The update module 110 is used to obtain the malicious link address on internet to update the black name in the database Singular link updated after database.Specifically, the update module 110 can be used for performing the step S110 shown in Fig. 2, Specific operating method refers to the detailed description of step S110.
The first detection module 120 is used in the database after the web page interlinkage of webpage to be detected and the renewal Blacklist link and white list link is matched respectively, if link with blacklist match if by the Web Page Tags to be detected be evil Anticipate webpage, if link with white list match if by the Web Page Tags to be detected be secure web-page, be labeled as not if all mismatching Hownet page.Specifically, the first detection module 120 can be used for performing the step S120 shown in Fig. 2, specific operating method Refer to the detailed description of step S120.
Incorporated by reference to Fig. 6, optionally, in the present embodiment, the first detection module 120 includes:Main Domain obtains submodule Block 121, detection sub-module 123 and matched sub-block 125.
The Main Domain acquisition submodule 121 is used for the main domain that the web page address to be detected is extracted using regular expression The Main Domain of blacklist address and the Main Domain of white list in database after name, the renewal.Specifically, the Main Domain Acquisition submodule 121 can be used for performing the sub-step S121 shown in Fig. 3, and specific operating method refers to sub-step S121's It is described in detail.
The detection sub-module 123 is used to obtain MD5 values to be detected, root according to the Main Domain of the web page address to be detected Blacklist MD5 values are obtained according to the Main Domain of the blacklist address in the database after the renewal, and according to the white list The Main Domain of link obtains white list MD5 values.Specifically, the detection sub-module 123 can be used for performing the sub-step shown in Fig. 3 Rapid S123, specific operating method refer to the detailed description of sub-step S123.
The matched sub-block 125 is used for the MD5 values to be detected and the blacklist MD5 values and white list MD5 values Matched respectively.Specifically, the matched sub-block 125 can be used for performing the sub-step S125 shown in Fig. 3, specific behaviour The detailed description of sub-step S125 is referred to as method.
The grader generation module 130 is used for according to blacklist link and white list link generation grader.Tool Body, the grader generation module 130 can be used for performing the step S130 shown in Fig. 2, and specific operating method refers to The detailed description of step S130.
Incorporated by reference to Fig. 7, optionally, in the present embodiment, the grader generation module 130 includes:First obtains submodule Block 131, first generates submodule 132, the second acquisition submodule 133, classification submodule 134, receiving submodule 135 and second and gives birth to Into submodule 136.
First acquisition submodule 131 is used to obtaining multiple blacklists link that total quantity is setting value and multiple white Name singular link is as initial sample.Specifically, first acquisition submodule 131 can be used for performing the sub-step shown in Fig. 4 S131, specific operating method refer to the detailed description of sub-step S131.
The first generation submodule 132 is used to take identical power for each code characteristic in the initial sample Weight, and multiple svm classifier study is carried out to generate preliminary classification device.Specifically, the first generation submodule 132 can be used for holding Sub-step S132 shown in row Fig. 4, specific operating method refer to the detailed description of sub-step S132.
Second acquisition submodule 133 is used to obtain on multiple internets not labeled link as candidate samples. Specifically, second acquisition submodule 133 can be used for performing the sub-step S133 shown in Fig. 4, and specific operating method can With reference to the detailed description of sub-step S133.
The classification submodule 134 is used for from the preliminary classification device to all unlabelled links in the candidate samples The unlabelled link of default quantity is chosen after being classified, wherein, multiple unlabelled links of the default quantity are close to institute State the Optimal Separating Hyperplane of preliminary classification device.Specifically, the classification submodule 134 can be used for performing the sub-step shown in Fig. 4 S134, specific operating method refer to the detailed description of sub-step S134.
The receiving submodule 135 is used to receive user for not marking each in the unlabelled link of the default quantity The mark instructions that the chain of note receives and sends, and corresponding unlabelled link is marked according to the mark instructions and is marked Link afterwards, is linked as blacklist link or white list link after the mark.Specifically, the receiving submodule 135 can be used In performing the sub-step S135 shown in Fig. 4, specific operating method refers to the detailed description of sub-step S135.
The second generation submodule 136 is used to the link after the mark of default quantity adding the initial sample In, to take identical weight again for each code characteristic in the initial sample, and carry out multiple svm classifier and learn to obtain Grader.Specifically, the second generation submodule 136 can be used for performing the sub-step S136 shown in Fig. 4, specific operation Method refers to the detailed description of sub-step S136.
The extraction module 140 is used to extracting the scripted code fragment of the unknown webpage, and according to feature critical word from Feature vector is extracted in the scripted code fragment.Specifically, the extraction module 140 can be used for performing the step shown in Fig. 2 Rapid S140, specific operating method refer to the detailed description of step S140.
Optionally, in the present embodiment, the extraction module 140 is additionally operable to:Number, the word occurred according to feature critical word Symbol string length and information entropy extract feature vector from the scripted code fragment.
Second detection module 150 is used to described eigenvector and default vector be classified using the grader Judgement obtains whether the unknown webpage is malicious web pages, and when the unknown webpage is malicious web pages, obtains the unknown webpage Network address is simultaneously labeled as blacklist network address and preserves to the database after the renewal.Specifically, second detection module 150 can For performing the step S150 shown in Fig. 2, specific operating method refers to the detailed description of step S150.
Optionally, malicious code feature is prestored in the database, the webpage detection device 100 further includes the 3rd inspection Module is surveyed, the 3rd detection module is used to be matched the source code of the unknown webpage with the malicious code feature, if The unknown Web Page Tags are then malicious web pages by matching, and the network address of the unknown webpage labeled as blacklist network address and is preserved extremely In database after the renewal.
To sum up, a kind of page detection method and device provided by the invention, applied to the terminal device 10 including database, Method includes:The malicious link address on internet is obtained to update the data the link of the blacklist in storehouse, by webpage to be detected Web page interlinkage is linked with the blacklist in the database after renewal and white list is linked and matched respectively, is marked if all mismatching Unknown webpage is denoted as, and grader is generated according to blacklist link and white list link, extracts the scripted code piece of unknown webpage Section, and extracts feature vector according to feature critical word from scripted code fragment, using grader by feature vector and preset to Amount carries out classification and judges whether the unknown webpage is malicious web pages, and is labeled as blacklist when the unknown webpage is malicious web pages Network address and the database preserved to after updating.By the above method the accuracy and high efficiency that webpage detects is effectively ensured, Real-time update database during detection webpage.
In several embodiments that the embodiment of the present invention is provided, it should be understood that disclosed apparatus and method, also may be used To realize by another way.Apparatus and method embodiment described above is only schematical, for example, in attached drawing Flow chart and block diagram show the device of multiple embodiments according to the present invention, the possibility of method and computer program product is realized Architectural framework, function and operation.At this point, each square frame in flow chart or block diagram can represent module, a program A part for section or code, a part for the module, program segment or code are used for realization defined patrol comprising one or more Collect the executable instruction of function.It should also be noted that at some as the function of in the implementation replaced, being marked in square frame Can be with different from the order marked in attached drawing generation.It is also noted that each square frame in block diagram and/or flow chart, And the combination of the square frame in block diagram and/or flow chart, function or the dedicated of action as defined in performing can be used to be based on hardware Device realize, or can be realized with the combination of specialized hardware and computer instruction.In addition, in each implementation of the present invention Each function module in example can integrate to form an independent part or modules individualism, An independent part can be integrated to form with two or more modules.
If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, electronic equipment, or network equipment etc.) perform all or part of step of each embodiment the method for the present invention Suddenly.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), deposit at random Access to memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes. It should be noted that herein, term " comprising ", "comprising" or its any other variant are intended to the bag of nonexcludability Contain, so that process, method, article or equipment including a series of elements not only include those key elements, but also including Other elements that are not explicitly listed, or further include as elements inherent to such a process, method, article, or device.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the invention, for the skill of this area For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of page detection method, applied to the terminal device including database, it is characterised in that stored in the database There are blacklist link and white list link, the described method includes:
The malicious link address obtained on internet links the data after being updated to update the blacklist in the database Storehouse;
The web page interlinkage of webpage to be detected is linked with the blacklist in the database after the renewal and white list links difference Matched, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, matched if being linked with white list It is then secure web-page by the Web Page Tags to be detected, unknown webpage is labeled as if all mismatching;
According to blacklist link and white list link generation grader;
The scripted code fragment of the unknown webpage is extracted, and extracts spy from the scripted code fragment according to feature critical word Sign vector;
Use the grader that described eigenvector and default vector are carried out classification and whether judge to obtain the unknown webpage as evil Meaning webpage, and when the unknown webpage is malicious web pages, obtains the network address of the unknown webpage and be labeled as blacklist network address and protect Deposit to the database after the renewal.
2. page detection method according to claim 1, it is characterised in that according to blacklist link and white list chain Deliver a child constituent class device the step of include:
Obtain total quantity and be used as initial sample for the link of multiple blacklists and the link of multiple white lists of setting value;
Identical weight is taken for each code characteristic in the initial sample, and carries out multiple svm classifier study to generate Preliminary classification device;
The link not being labeled on multiple internets is obtained as candidate samples;
Default quantity is chosen after classifying from the preliminary classification device to all unlabelled links in the candidate samples Unlabelled link, wherein, the Optimal Separating Hyperplane of multiple unlabelled close preliminary classification devices of link of the default quantity;
The mark instructions that user receives and sends for each unlabelled chain in the unlabelled link of the default quantity are received, and The link after being marked is marked to corresponding unlabelled link according to the mark instructions, is linked as after the mark Blacklist links or white list link;
Link after the mark of default quantity is added in the initial sample, with again for every in the initial sample A code characteristic takes identical weight, and carries out multiple svm classifier and learn to obtain grader.
3. page detection method according to claim 1, it is characterised in that by the web page interlinkage of webpage to be detected with it is described The step of blacklist link and white list link in database after renewal are matched respectively includes:
Using the blacklist in the database after the Main Domain of the regular expression extraction web page address to be detected, the renewal The Main Domain of address and the Main Domain of white list;
MD5 values to be detected are obtained according to the Main Domain of the web page address to be detected, according in the database after the renewal The Main Domain of blacklist address obtains blacklist MD5 values, and obtains white list MD5 according to the Main Domain that the white list links Value;
The MD5 values to be detected are matched respectively with the blacklist MD5 values and white list MD5 values.
4. page detection method according to claim 1, it is characterised in that malicious code spy is prestored in the database Sign, by the web page interlinkage of webpage to be detected linked with the blacklist in the database after the renewal and white list link respectively into After row matching, the method further includes:
The source code of the unknown webpage is matched with the malicious code feature, is by the unknown Web Page Tags if matching Malicious web pages, and the network address of the unknown webpage labeled as blacklist network address and is preserved into the database after the renewal.
5. page detection method according to claim 1, it is characterised in that according to feature critical word from the scripted code The step of feature vector is extracted in fragment includes:
Number, string length and the information entropy occurred according to feature critical word extracts spy from the scripted code fragment Sign vector.
6. a kind of webpage detection device, it is characterised in that applied to the terminal device including database, described device includes:
Update module:Link to obtain to update the blacklist in the database for obtaining the malicious link address on internet Database after renewal;
First detection module:For the web page interlinkage of webpage to be detected and the blacklist in the database after the renewal to be linked Matched respectively with white list link, if linked with blacklist match if by the Web Page Tags to be detected be malicious web pages, if Link that to match be then secure web-page by the Web Page Tags to be detected with white list, unknown webpage is labeled as if all mismatching;
Grader generation module:For according to blacklist link and white list link generation grader;
Extraction module:For extracting the scripted code fragment of the unknown webpage, and according to feature critical word from the script generation Feature vector is extracted in chip segment;
Second detection module:It is somebody's turn to do for described eigenvector and default vector to be carried out classification judgement using the grader Whether unknown webpage is malicious web pages, and when the unknown webpage is malicious web pages, obtains the network address and mark of the unknown webpage For blacklist network address and preserve to the database after the renewal.
7. webpage detection device according to claim 6, it is characterised in that the grader generation module includes:
First acquisition submodule:For obtaining the link of multiple blacklists and multiple white lists link work that total quantity is setting value For initial sample;
First generation submodule:For taking identical weight for each code characteristic in the initial sample, and carry out more Secondary svm classifier learns to generate preliminary classification device;
Second acquisition submodule:For obtaining the link not being labeled on multiple internets as candidate samples;
Classification submodule:After from the preliminary classification device to all unlabelled links are classified in the candidate samples The unlabelled link of default quantity is chosen, wherein, multiple unlabelled links of the default quantity are close to the preliminary classification The Optimal Separating Hyperplane of device;
Receiving submodule:For receiving user for each unlabelled chain sending and receiving in the unlabelled link of the default quantity The mark instructions sent, and the link after being marked is marked to corresponding unlabelled link according to the mark instructions, Blacklist link or white list link are linked as after the mark;
Second generation submodule:Added for the link after the mark by default quantity in the initial sample, with again Identical weight is taken for each code characteristic in the initial sample, and carries out multiple svm classifier and learns to obtain grader.
8. webpage detection device according to claim 6, it is characterised in that the first detection module includes:
Main Domain acquisition submodule:For extracted using regular expression the web page address to be detected Main Domain, it is described more The Main Domain of blacklist address and the Main Domain of white list in database after new;
Detection sub-module:For obtaining MD5 values to be detected according to the Main Domain of the web page address to be detected, according to the renewal The Main Domain of blacklist address in database afterwards obtains blacklist MD5 values, and the main domain linked according to the white list Name obtains white list MD5 values;
Matched sub-block:For the MD5 values to be detected and the blacklist MD5 values and white list MD5 values to be carried out respectively Match somebody with somebody.
9. webpage detection device according to claim 6, it is characterised in that malicious code spy is prestored in the database Sign, described device further include:
3rd detection module:For the source code of the unknown webpage to be matched with the malicious code feature, if matching Be malicious web pages by the unknown Web Page Tags, and by the network address of the unknown webpage labeled as blacklist network address and preserve to it is described more In database after new.
10. webpage detection device according to claim 6, it is characterised in that the extraction module is additionally operable to:According to feature Number, string length and the information entropy that keyword occurs extract feature vector from the scripted code fragment.
CN201711230916.6A 2017-11-29 2017-11-29 Page detection method and device Pending CN107948168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711230916.6A CN107948168A (en) 2017-11-29 2017-11-29 Page detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711230916.6A CN107948168A (en) 2017-11-29 2017-11-29 Page detection method and device

Publications (1)

Publication Number Publication Date
CN107948168A true CN107948168A (en) 2018-04-20

Family

ID=61946846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711230916.6A Pending CN107948168A (en) 2017-11-29 2017-11-29 Page detection method and device

Country Status (1)

Country Link
CN (1) CN107948168A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108683649A (en) * 2018-04-28 2018-10-19 四川大学 A kind of malice domain name detection method based on text feature
CN108710797A (en) * 2018-06-15 2018-10-26 四川大学 A kind of malice document detection method based on entropy information distribution
CN109190372A (en) * 2018-07-09 2019-01-11 四川大学 A kind of JavaScript Malicious Code Detection model based on bytecode
CN109241462A (en) * 2018-08-13 2019-01-18 中国联合网络通信集团有限公司 The black word treatment method of webpage, device, equipment and storage medium
CN109299346A (en) * 2018-10-31 2019-02-01 长春理工大学 A kind of recognition methods and system of invalid address webpage
CN109413041A (en) * 2018-09-25 2019-03-01 东莞海露欧信息科技有限公司 Information Security verification method and device
CN109784038A (en) * 2018-12-29 2019-05-21 北京奇安信科技有限公司 Detecting black chain method, apparatus, system and computer readable storage medium
CN110661680A (en) * 2019-09-11 2020-01-07 深圳市永达电子信息股份有限公司 Method and system for detecting data stream white list based on regular expression
CN110781497A (en) * 2019-10-21 2020-02-11 新华三信息安全技术有限公司 Method for detecting web page link and storage medium
CN111988265A (en) * 2019-05-23 2020-11-24 深信服科技股份有限公司 Network traffic attack identification method, firewall system and related components
CN112202807A (en) * 2020-10-13 2021-01-08 北京明略昭辉科技有限公司 Grayscale replacement method and device for IP (Internet protocol) blacklist, electronic equipment and storage medium
CN112511525A (en) * 2020-11-24 2021-03-16 山西三友和智慧信息技术股份有限公司 Website malicious third-party content detection method and system
CN115001789A (en) * 2022-05-27 2022-09-02 绿盟科技集团股份有限公司 Method, device, equipment and medium for detecting defect-losing equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820366A (en) * 2010-01-27 2010-09-01 南京邮电大学 Pre-fetching-based phishing web page detection method
CN101826105A (en) * 2010-04-02 2010-09-08 南京邮电大学 Phishing webpage detection method based on Hungary matching algorithm
CN102801697A (en) * 2011-12-20 2012-11-28 北京安天电子设备有限公司 Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator)
CN103544436A (en) * 2013-10-12 2014-01-29 深圳先进技术研究院 System and method for distinguishing phishing websites
CN103577755A (en) * 2013-11-01 2014-02-12 浙江工业大学 Malicious script static detection method based on SVM (support vector machine)
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN105824822A (en) * 2015-01-05 2016-08-03 任子行网络技术股份有限公司 Method clustering phishing page to locate target page
WO2016173327A1 (en) * 2015-04-28 2016-11-03 北京瀚思安信科技有限公司 Method and device for detecting website attack
CN106357618A (en) * 2016-08-26 2017-01-25 北京奇虎科技有限公司 Web abnormality detection method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820366A (en) * 2010-01-27 2010-09-01 南京邮电大学 Pre-fetching-based phishing web page detection method
CN101826105A (en) * 2010-04-02 2010-09-08 南京邮电大学 Phishing webpage detection method based on Hungary matching algorithm
CN102801697A (en) * 2011-12-20 2012-11-28 北京安天电子设备有限公司 Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator)
CN103544436A (en) * 2013-10-12 2014-01-29 深圳先进技术研究院 System and method for distinguishing phishing websites
CN103577755A (en) * 2013-11-01 2014-02-12 浙江工业大学 Malicious script static detection method based on SVM (support vector machine)
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN105824822A (en) * 2015-01-05 2016-08-03 任子行网络技术股份有限公司 Method clustering phishing page to locate target page
WO2016173327A1 (en) * 2015-04-28 2016-11-03 北京瀚思安信科技有限公司 Method and device for detecting website attack
CN106357618A (en) * 2016-08-26 2017-01-25 北京奇虎科技有限公司 Web abnormality detection method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周辉著: "第三章 第3.2.1节 网络爬虫", 《互联网信息监测系统研究》 *
许伟编著: "第六章 第五节 常见反钓鱼技术", 《网络营销实验指导》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108683649A (en) * 2018-04-28 2018-10-19 四川大学 A kind of malice domain name detection method based on text feature
CN108710797A (en) * 2018-06-15 2018-10-26 四川大学 A kind of malice document detection method based on entropy information distribution
CN109190372A (en) * 2018-07-09 2019-01-11 四川大学 A kind of JavaScript Malicious Code Detection model based on bytecode
CN109241462B (en) * 2018-08-13 2021-12-14 中国联合网络通信集团有限公司 Webpage black word processing method, device, equipment and storage medium
CN109241462A (en) * 2018-08-13 2019-01-18 中国联合网络通信集团有限公司 The black word treatment method of webpage, device, equipment and storage medium
CN109413041A (en) * 2018-09-25 2019-03-01 东莞海露欧信息科技有限公司 Information Security verification method and device
CN109299346A (en) * 2018-10-31 2019-02-01 长春理工大学 A kind of recognition methods and system of invalid address webpage
CN109784038A (en) * 2018-12-29 2019-05-21 北京奇安信科技有限公司 Detecting black chain method, apparatus, system and computer readable storage medium
CN111988265A (en) * 2019-05-23 2020-11-24 深信服科技股份有限公司 Network traffic attack identification method, firewall system and related components
CN110661680A (en) * 2019-09-11 2020-01-07 深圳市永达电子信息股份有限公司 Method and system for detecting data stream white list based on regular expression
CN110661680B (en) * 2019-09-11 2023-03-14 深圳市永达电子信息股份有限公司 Method and system for detecting data stream white list based on regular expression
CN110781497A (en) * 2019-10-21 2020-02-11 新华三信息安全技术有限公司 Method for detecting web page link and storage medium
CN112202807A (en) * 2020-10-13 2021-01-08 北京明略昭辉科技有限公司 Grayscale replacement method and device for IP (Internet protocol) blacklist, electronic equipment and storage medium
CN112511525A (en) * 2020-11-24 2021-03-16 山西三友和智慧信息技术股份有限公司 Website malicious third-party content detection method and system
CN115001789A (en) * 2022-05-27 2022-09-02 绿盟科技集团股份有限公司 Method, device, equipment and medium for detecting defect-losing equipment

Similar Documents

Publication Publication Date Title
CN107948168A (en) Page detection method and device
CN104077396B (en) Method and device for detecting phishing website
Blum et al. Lexical feature based phishing URL detection using online learning
Azeez et al. Identifying phishing attacks in communication networks using URL consistency features
Lakshmi et al. Efficient prediction of phishing websites using supervised learning algorithms
CN103544436B (en) System and method for distinguishing phishing websites
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
Wardman et al. High-performance content-based phishing attack detection
CN104899508A (en) Multistage phishing website detecting method and system
Folorunso et al. Ca-NIDS: A network intrusion detection system using combinatorial algorithm approach
Dadkhah et al. An introduction to journal phishings and their detection approach
CN113098887A (en) Phishing website detection method based on website joint characteristics
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
Gu et al. An efficient approach to detecting phishing web
CN108023868A (en) Malice resource address detection method and device
CN107665164A (en) Secure data detection method and device
CN105959328A (en) Evidence graph and vulnerability reasoning combined network evidence collection method and system
Mythreya et al. Prediction and prevention of malicious URL using ML and LR techniques for network security: machine learning
Akour et al. Using classical machine learning for phishing websites detection from URLs
Salehi et al. A novel approach for detecting DGA-based ransomwares
Zeydan et al. Current state of anti-phishing approaches and revealing competencies
Alkawaz et al. Identification and analysis of phishing website based on machine learning methods
CN109194605A (en) A kind of suspected threat index Proactive authentication method and system based on open source information
WO2021229786A1 (en) Learning device, detection device, learning method, detection method, learning program, and detection program
CN108694325A (en) The condition discriminating apparatus of the discriminating conduct and specified type website of specified type website

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180420

RJ01 Rejection of invention patent application after publication