CN108683666A - A kind of web page identification method and device - Google Patents

A kind of web page identification method and device Download PDF

Info

Publication number
CN108683666A
CN108683666A CN201810468614.0A CN201810468614A CN108683666A CN 108683666 A CN108683666 A CN 108683666A CN 201810468614 A CN201810468614 A CN 201810468614A CN 108683666 A CN108683666 A CN 108683666A
Authority
CN
China
Prior art keywords
page
target webpage
feature
preset
page data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810468614.0A
Other languages
Chinese (zh)
Other versions
CN108683666B (en
Inventor
任方英
郝益壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Security Technologies Co Ltd
Original Assignee
New H3C Security Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Security Technologies Co Ltd filed Critical New H3C Security Technologies Co Ltd
Priority to CN201810468614.0A priority Critical patent/CN108683666B/en
Publication of CN108683666A publication Critical patent/CN108683666A/en
Application granted granted Critical
Publication of CN108683666B publication Critical patent/CN108683666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

An embodiment of the present invention provides a kind of web page identification method and devices, are related to technical field of network security, and the method is applied to safety equipment, the method includes:The web-page requests that user terminal is sent are received, the URL of target webpage to be visited is carried in the web-page requests;According to the URL of the target webpage, the page data of the target webpage is obtained from the server of the target webpage;According to preset feature extraction rule and the page data, the page feature of the target webpage is determined;According to the page feature and preset page classifications model, judges whether the target webpage is illegal web page, if the target webpage is illegal web page, send a warning message to the user terminal.The discrimination of illegal website can be improved using the present invention.

Description

A kind of web page identification method and device
Technical field
The present invention relates to technical field of network security, more particularly to a kind of web page identification method and device.
Background technology
With the explosive increase of internet data, the fishing of the private informations such as account No., the password of user's submission is stolen Fishnet station is also more and more.The page of fishing website and actual site interface are completely the same, to inveigle user to input the quick of individual Feel information.For example, fishing website submits the privacy informations such as account and password, user cheating to fill out to get the winning number in a bond for bait requirement visitor Write the information such as identity information, bank account.For another example, fishing website imitates the on-line payments webpages such as Taobao, industrial and commercial bank, gains user's silver by cheating Row card information or Alipay account.Frequently occurring for fishing website brings great danger to the privacy and property safety of isdn user Evil.
In order to avoid fishing website steals user information, technical staff can establish black list database in safety equipment, Be stored in the black list database technical staff collect in advance fishing website URL (Uniform Resoure Locator, Uniform resource locator).Safety equipment, can be first by the URL of the website and black name when detecting user terminal access website URL in single database is compared, if there are the URL in black list database, judges that the website is fishing website, to User terminal returns to warning information, to prompt the user website as fishing website.If there is no should in black list database URL, then it is fishing website to judge the website not, and safety equipment obtains the page data of the website according to the URL, and whole to user End returns to the page data, so that user accesses the website.
However, the black list database is manually established, the URL of storage is not comprehensive enough, and lag is also compared in update, Cause to identify that the discrimination of fishing website is relatively low.
Invention content
The embodiment of the present invention is designed to provide a kind of web page identification method and device, to improve the identification of illegal website Rate.Specific technical solution is as follows:
In a first aspect, providing a kind of web page identification method, the method is applied to safety equipment, the method includes:
The web-page requests that user terminal is sent are received, the unification of target webpage to be visited is carried in the web-page requests Resource localizer URL;
According to the URL of the target webpage, the page of the target webpage is obtained from the server of the target webpage Data;
According to preset feature extraction rule and the page data, the page feature of the target webpage is determined;
According to the page feature and preset page classifications model, judge whether the target webpage is illegal web page, If the target webpage is illegal web page, send a warning message to the user terminal.
Second aspect provides a kind of webpage identification device, and described device is applied to safety equipment, and described device includes:
Receiving module, the web-page requests for receiving user terminal transmission carry in the web-page requests to be visited The uniform resource locator URL of target webpage;
First acquisition module obtains institute for the URL according to the target webpage from the server of the target webpage State the page data of target webpage;
First determining module, for according to preset feature extraction rule and the page data, determining the target network The page feature of page;
First sending module, for according to the page feature and preset page classifications model, judging the target network Whether page is illegal web page, if the target webpage is illegal web page, is sent a warning message to the user terminal.
The third aspect provides a kind of safety equipment, including processor and machine readable storage medium, described machine readable Storage medium is stored with the machine-executable instruction that can be executed by the processor, and the processor can perform by the machine Instruction promotes:Realize any method and steps of claim 1-7.
Fourth aspect provides a kind of machine readable storage medium, is stored with machine-executable instruction, by processor tune When with executing, the machine-executable instruction promotes the processor:Realize any method steps of claim 1-6 Suddenly.
PP174545
In the embodiment of the present invention, after safety equipment receives the web-page requests of user terminal transmission, according to the web-page requests The URL of middle carrying obtains the page data of target webpage, then according to preset feature extraction from the server of target webpage Rule and page data, determine the page feature of target webpage, according to page feature and preset page classifications model, judge mesh It marks whether webpage is illegal web page, if target webpage is illegal web page, sends a warning message to user terminal.Based on above-mentioned Processing, can judge whether target webpage is the illegal page, is not necessarily to according to the page feature and page classifications model of target webpage Black list database manually is established, can be avoided because the data in black list database are not comprehensive, and leads to not identify non- The case where net of justice station, improves the discrimination of illegal website.Certainly, it implements any of the products of the present invention or method must be needed not necessarily To reach all the above advantage simultaneously.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.
Fig. 1 is system framework figure provided in an embodiment of the present invention;
Fig. 2 is a kind of method flow diagram of web page identification method provided in an embodiment of the present invention;
Fig. 3 is a kind of example of web page identification method provided in an embodiment of the present invention;
Fig. 4 is a kind of structural schematic diagram of webpage identification device provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram of webpage identification device provided in an embodiment of the present invention;
Fig. 6 is a kind of structural schematic diagram of webpage identification device provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of webpage identification device provided in an embodiment of the present invention;
Fig. 8 is the structural schematic diagram of safety equipment provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a kind of web page identification method, this method can be applied to safety equipment, which sets It is standby to be connect respectively with the server in user terminal and network.As shown in Figure 1, being system frame provided in an embodiment of the present invention Frame figure, including safety equipment, multiple user terminals and multiple servers.
As shown in Fig. 2, the processing procedure of this method may comprise steps of:
Step 201, the web-page requests that user terminal is sent are received.
Wherein, carried in the web-page requests target webpage to be visited URL (Uniform Resoure Locator, Uniform resource locator).
In force, the application program for browsing webpage, such as the application of certain browser can be installed in user terminal Program.When user desires access to some webpage (can be described as target webpage), user can open this in user terminal and apply journey Then sequence clicks the corresponding icon of target webpage, or the network address of input target webpage, user terminal can then receive corresponding mesh The access instruction for marking webpage, then can obtain the URL of preset target webpage, and then can generate the net for carrying the URL Page request.User terminal can send the web-page requests to safety equipment, after safety equipment receives web-page requests, be asked to webpage It asks and is parsed, obtain URL therein.In the embodiment of the present invention, target webpage can be the Html pages.
Step 202, according to the URL of target webpage, the page data of target webpage is obtained from the server of target webpage.
In force, after safety equipment gets the URL of target webpage, the page of target webpage can be obtained according to the URL Face data.Specifically, safety equipment can regenerate web-page requests, the net according to the URL and preset message generating algorithm The source address of page request is the address of safety equipment, in this way, safety equipment can be corresponding to target webpage with pseudo subscriber terminal Server sends the web-page requests.Alternatively, the web-page requests can also be transmitted to target webpage by safety equipment according to the URL Corresponding server.
After server receives the web-page requests, the web-page requests can be responded, to return to the page data of target webpage. After safety equipment receives page data, will not the page data be directly transmitted to user terminal, but execute step 203, To judge whether target webpage is the illegal page.
Step 203, according to preset feature extraction rule and page data, the page feature of target webpage is determined.
In force, feature extraction rule can be previously stored in safety equipment, this feature extracting rule can be by skill Art personnel set.After safety equipment receives page data, according to preset feature extraction rule and the page data, really Set the goal the page feature of webpage.
Optionally, determining the concrete processing procedure of page feature can be:From the page data of target webpage, extraction the One target component determines the page feature of target webpage according to the first object parameter extracted.
Wherein, first object parameter can rule of thumb be set by technical staff, and in the embodiment of the present invention, target component is extremely Include less the page URL of target pages, the preset field in list, heading message, web page element information and target webpage page The preset number URL for including in face data.Wherein, the page URL of target pages is the URL of target webpage, can be denoted as Page_ URL;Heading message is in page data<title>, Page_Title can be denoted as;Web page element information is in page data <meta name>, Page_Meta can be denoted as;The preset number URL for including in the page data of target webpage is target webpage Page data in include URL, Content_URL can be denoted as respectivelyi(i (1, n)), n is preset number, and by [Content_ URL1,Content_URL2... ..., Content_URLi] it is denoted as Content_URL;Preset field in list can be target The content of the action fields in form (i.e. list) in webpage, can be denoted as Page_Action.In addition, target component may be used also To include<head>The JS scripts loaded in domain, can be denoted as Head_Js, and randomly selected from page data a section It falls, Body_Segment can be denoted as.First object parameter can also include the other parameters in page data, the embodiment of the present invention It does not limit.
In force, multiple parameters have been generally comprised in page data, technical staff can in advance match in safety equipment It sets and needs the parameter (i.e. first object parameter) extracted can be according to preset field after safety equipment receives page data Extraction algorithm searches first object parameter in page data, and then extracts first object parameter, then according to the extracted One target component determines the page feature of first object webpage.In one implementation, safety equipment can will extract First object parameter, the page feature as target webpage.In practice, the number for the URL that the page data of target webpage includes Mesh may be more than preset number, at this point, safety equipment can be pre- before extracting according to the URL sequences of the appearance in page data If number URL.
For example, safety equipment after extracting target component in target webpage, can obtain a feature vector a, then Using this feature vector as the page feature of target webpage, wherein:
A=[Page_URL, Page_Action, Page_Title, Page_Meta, Content_URL, Head_Js, Body_Segment]。
Optionally, safety equipment can be combined with the page data for the URL that target webpage is included, and determine target webpage Page feature, specific processing procedure can be as follows:Obtain the page data of the corresponding linked web pages of preset number URL;Needle To each linked web pages, the second target component is extracted from the page data of the linked web pages, by first object parameter and from The second target component extracted in preset number linked web pages, the page feature as target webpage.
Wherein, the second target component include at least the page URL of linked web pages, the preset field in list, heading message, The preset number URL for including in the page data of web page element information and linked web pages.Wherein, the page URL of linked web pages For the URL of linked web pages, A-Page_URL can be denoted as;Heading message is in page data<title>, A-Page_ can be denoted as Title;Web page element information is in page data<meta name>, A-Page_Meta can be denoted as;The page of linked web pages The preset number URL for including in data is the URL for including in the page data of linked web pages, can be denoted as A-Content_ respectively URLi(i ∈ (1, n)), n is preset number, and by [A-Content_URL1,A-Content_URL2... ..., A-Content_ URLi] it is denoted as A-Content_URL;;Preset field in list can be in the form (i.e. list) in linked web pages The content of action fields can be denoted as A-Page_Action.In addition, the second target component can also include<head>Add in domain The JS scripts of load can be denoted as A-Head_Js, and the paragraph randomly selected from the page data of linked web pages, can remember For A-Body_Segment.Second target component can also include the other parameters in page data, and the embodiment of the present invention is not done It limits.
In force, multiple linked web pages access entrances have been generally comprised in target webpage, user can be in target webpage The corresponding icon of middle click linked web pages, accesses the linked web pages.Correspondingly, can include chain in the web data of target webpage Meet the URL of webpage.
After safety equipment extracts the preset number URL for including in the page data of target webpage, for each URL, Safety equipment can generate the web-page requests for including the URL, then by the net according to the URL and preset message generating algorithm Page request is sent to server.After server receives the web-page requests, the web-page requests can be responded, to return to this URL pairs The page data for the linked web pages answered.
It, can be from the link net after safety equipment receives the page data of the linked web pages for each linked web pages In the page data of page, the second target component, the second target component (i.e. feature vector) that will be extracted, as the link are extracted The page feature of webpage, specific processing procedure is with reference to above description, and details are not described herein again.In this way, safety equipment can obtain The feature vector for the preset number linked web pages that the feature vector and target webpage of target webpage include, to obtain spy Matrix is levied, for example, preset number is 5, then safety equipment can obtain eigenmatrix A=[a1,a2,a3,a4,a5,a6,], wherein a1For the feature vector of target webpage, a2~a6For the feature vector of 5 linked web pages.Safety equipment can be by this feature matrix Page feature as target webpage.
Optionally, safety equipment can first judge whether the web data of target webpage is normal page data, then again Determine that page feature, specific processing procedure are:Judge whether the validation value that page data includes is preset normal page number According to corresponding validation value;If validation value is the corresponding validation value of preset normal page data, execute according to preset spy Extracting rule and page data are levied, determines the page feature step of target webpage;If validation value is not preset normal page The corresponding validation value of data then sends error notification message to user terminal.
In force, due to server failure etc., the page data that server returns may have exception, because This, would generally be pre-set in page data there are one verifying field, the verifying field be used for storing page data verification Value, safety equipment can identify whether the page data received is normal page data according to the validation value.
After safety equipment receives page data, the verifying field can be parsed in the page data, tested to obtain Card value illustrates that the page data is normal page if the validation value is the corresponding validation value of preset normal page data Data, safety equipment can then cache the page data, and execute according to preset feature extraction rule and page number According to determining the page feature step of target webpage.If validation value is not the corresponding validation value of preset normal page data, Illustrate that the page data is abnormal data (for example the page data is mess code data or invalid data), safety equipment is to user Terminal sends error notification message, and without other processing.After user terminal receives error notification message, display is corresponding Miscue information, such as " page can not access " or " page that you access is not present " etc..For example, if validation value is 200, then page data is judged for normal page data, if validation value is 404, judges page data for abnormal data.
In this way, safety equipment can judge page data for normal page data when, just execution step 203, if page Face data is abnormal data, will be without processing, so as to save the process resource of safety equipment.
Step 204, according to page feature and preset page classifications model, judge whether target webpage is illegal web page, If target webpage is illegal web page, send a warning message to user terminal.
In force, the page feature of target webpage can be input to page classifications mould trained in advance by safety equipment In type, page classifications model can then export the corresponding classification results of the target webpage, and classification results are used to indicate the target webpage Whether it is illegal web page.If classification results indicate that the target webpage is illegal web page, safety equipment can be sent out to user terminal Warning information is sent, to prompt ownership goal webpage as illegal web page, to prevent user's access target webpage, improves user's letter The safety of breath.
If classification results indicate that the target webpage is legal webpage, safety equipment can be by the page data of target webpage It is sent to user terminal, so that user terminal is according to page data display target webpage.
It should be noted that after receiving page data, it is aobvious can first to extract basis from page data for safety equipment Basic display data is sent to user terminal by registration evidence, wherein basic display data is that the basic framework of webpage is corresponding aobvious Registration evidence.In this way, user terminal can be based on display data, the basic framework of first display target webpage, so that at target webpage State in load, so as to improve user experience.After safety equipment judgement target webpage is legal webpage, then by mesh The mark specific content-data of webpage issues user terminal, so that user terminal shows complete target webpage.
Optionally, the present embodiment additionally provides the training method of page classifications model, and concrete processing procedure is as follows:Safety is set Standby to obtain pre-stored multiple training samples, training sample includes the page feature of illegal web page and the page spy of legal webpage Sign;Based on multiple training samples, preset initial neural network model is trained, page classifications model is obtained.
In force, initial neural network model can be built in safety equipment, which can be Using BP (back propagation, backpropagation) neural networks or convolutional neural networks.Technical staff can be initial to this Neural network model is initialized, for example, setting input layer to hidden layer weights ωmn, hidden layer to output layer weights vlkWith And error threshold Δt.Multiple training samples can also be stored in safety equipment, training sample includes the page feature of illegal web page With the page feature of legal webpage.Specifically, technical staff can collect page data, the Yi Jixiang of multiple illegal web pages in advance With the page data of the legal webpage of quantity, and the page feature of each webpage is determined by safety equipment respectively, extract the page The process of feature is referred to above description, and details are not described herein again.
Safety equipment can be based on multiple training samples, be trained to preset initial neural network model, obtain page Face disaggregated model, specific processing procedure are:Training sample is input to initial neural network model, output training sample corresponds to Testing feature vector;By back-propagation algorithm, using the corresponding testing feature vector of training sample, to initial neural network The model parameter that model includes is adjusted, and obtains page classifications model, and model parameter includes at least input layer to the power of hidden layer The weights of value and hidden layer to output layer.
In force, multiple training samples can be sequentially inputted in initial neural network model by safety equipment, for Each training sample, initial neural network model then it is corresponding to calculate the training sample according to preset neural network algorithm Testing feature vector, and the error of the testing feature vector and preset desired character vector is calculated, and then judge that the error is It is no to be less than preset error threshold.If the error is greater than or equal to preset error threshold, according to preset backpropagation Algorithm, the weights ω of adjustment input layer to hidden layermnWith the weights v of hidden layer to output layerlk, then, safety equipment input is next Training sample, and above-mentioned processing is repeated, until the error determined is less than preset error threshold.When what is determined When error is less than preset error threshold, the weights ω of input layer that safety equipment can be current to neural network to hidden layermn, hidden layer To the weights v of output layerlkAnd error threshold ΔtIt is stored, to determine page classifications model.
As shown in figure 3, being a kind of example of web page identification method provided in an embodiment of the present invention, concrete processing procedure is:
Step 301, the web-page requests that user terminal is sent are received.
Step 302, according to the URL of target webpage, the page data of target webpage is obtained from the server of target webpage.
Step 303, judge whether the validation value that page data includes is the corresponding validation value of preset normal page data. If not, thening follow the steps 304;Otherwise, step 305~step 306 is executed.
Step 304, error notification message is sent to user terminal.
Step 305, according to preset feature extraction rule and page data, the page feature of target webpage is determined.
Step 306, according to page feature and preset page classifications model, judge whether target webpage is legal webpage. If not, thening follow the steps 307;Otherwise, step 308 is executed.
Step 307, it sends a warning message to user terminal.
Step 308, the page data of target webpage is sent to user terminal, so that user terminal is aobvious according to page data Show target webpage.
In the embodiment of the present invention, after safety equipment receives the web-page requests of user terminal transmission, according to the web-page requests The URL of middle carrying obtains the page data of target webpage, then according to preset feature extraction from the server of target webpage Rule and page data, determine the page feature of target webpage, according to page feature and preset page classifications model, judge mesh It marks whether webpage is illegal web page, if target webpage is illegal web page, sends a warning message to user terminal.Based on above-mentioned Processing, can judge whether target webpage is the illegal page, is not necessarily to according to the page feature and page classifications model of target webpage Black list database manually is established, can be avoided because the data in black list database are not comprehensive, and leads to not identify non- The case where net of justice station, improves the discrimination of illegal website.
It based on the same technical idea, should as shown in figure 4, the embodiment of the present invention additionally provides a kind of webpage identification device Device is applied to safety equipment, which includes:
Receiving module 410, the web-page requests for receiving user terminal transmission carry mesh to be visited in web-page requests Mark the URL of webpage;
First acquisition module 420 obtains target network for the URL according to target webpage from the server of target webpage The page data of page;
First determining module 430, for according to preset feature extraction rule and page data, determining the page of target webpage Region feature;
First sending module 440, for according to page feature and preset page classifications model, whether judging target webpage It is sent a warning message to user terminal for illegal web page if target webpage is illegal web page.
Optionally, as shown in figure 5, the device further includes:
Second sending module 450 sends out the page data of target webpage if not being illegal web page for target webpage User terminal is given, so that user terminal is according to page data display target webpage.
Optionally, the first determining module 430, is specifically used for:
From the page data of the target webpage, first object parameter is extracted, the first object parameter includes:It is described The page URL of target pages, the page of the preset field in list, heading message, web page element information and the target webpage The preset number URL for including in data;
According to the first object parameter extracted, the page feature of the target webpage is determined.
Optionally, the first determining module 430, is specifically used for:
Obtain the page data of the corresponding linked web pages of the preset number URL;
For each linked web pages, the second target component, second mesh are extracted from the page data of the linked web pages Marking parameter includes:The page URL of the linked web pages, the preset field in list, heading message, web page element information and described The preset number URL for including in the page data of linked web pages;
By the first object parameter and the second target component extracted from the preset number linked web pages, Page feature as the target webpage.
Optionally, include validation value in page data, as shown in fig. 6, the device further includes:
Judgment module 460, for judging whether the validation value that page data includes is that preset normal page data correspond to Validation value;
Second determining module 470 triggers if being the corresponding validation value of preset normal page data for validation value First determining module 430 executes the page feature step that target webpage is determined according to preset feature extraction rule and page data Suddenly;
Third sending module 480, if not being the corresponding validation value of preset normal page data for validation value, to User terminal sends miscue information.
Optionally, as shown in fig. 7, the device further includes:
Second acquisition module 490, for obtaining pre-stored multiple training samples, training sample includes illegal web page The page feature of page feature and legal webpage;
Training module 4100 is trained preset initial neural network model, obtains for being based on multiple training samples To page classifications model.
Optionally, the training module 4100, is specifically used for:
The training sample is input to the initial neural network model, it is special to export the corresponding test of the training sample Sign vector;
By back-propagation algorithm, using the corresponding testing feature vector of the training sample, to the initial nerve net The model parameter that network model includes is adjusted, and obtains the page classifications model, and the model parameter includes at least input layer To hidden layer weights and hidden layer to output layer weights.
In the embodiment of the present invention, after safety equipment receives the web-page requests of user terminal transmission, according to the web-page requests The URL of middle carrying obtains the page data of target webpage, then according to preset feature extraction from the server of target webpage Rule and page data, determine the page feature of target webpage, according to page feature and preset page classifications model, judge mesh It marks whether webpage is illegal web page, if target webpage is illegal web page, sends a warning message to user terminal.Based on above-mentioned Processing, can judge whether target webpage is the illegal page, is not necessarily to according to the page feature and page classifications model of target webpage Black list database manually is established, can be avoided because the data in black list database are not comprehensive, and leads to not identify non- The case where net of justice station, improves the discrimination of illegal website.
The embodiment of the present application also provides a kind of safety equipments, as shown in figure 8, including processor 801, communication interface 802, Memory 803 and communication bus 804, wherein processor 801, communication interface 802, memory 803 are complete by communication bus 804 At mutual communication,
Memory 803, for storing computer program;
Processor 801, when for executing the program stored on memory 803, so that the safety equipment executes following step Suddenly, which includes:
The web-page requests that user terminal is sent are received, the unification of target webpage to be visited is carried in the web-page requests Resource localizer URL;
According to the URL of the target webpage, the page of the target webpage is obtained from the server of the target webpage Data;
According to preset feature extraction rule and the page data, the page feature of the target webpage is determined;
According to the page feature and preset page classifications model, judge whether the target webpage is illegal web page, If the target webpage is illegal web page, send a warning message to the user terminal.
Optionally, the method further includes:
If the target webpage is not illegal web page, the page data of the target webpage is sent to the user Terminal, so that the user terminal shows the target webpage according to the page data.
Optionally, described according to preset feature extraction rule and the page data, determine the page of the target webpage Region feature, including:
From the page data of the target webpage, first object parameter is extracted, the first object parameter includes:It is described The page URL of target pages, the page of the preset field in list, heading message, web page element information and the target webpage The preset number URL for including in data;
According to the first object parameter extracted, the page feature of the target webpage is determined.
Optionally, the first object parameter that the basis extracts, determines the page feature of the target webpage, packet It includes:
Obtain the page data of the corresponding linked web pages of the preset number URL;
For each linked web pages, the second target component, second mesh are extracted from the page data of the linked web pages Marking parameter includes:The page URL of the linked web pages, the preset field in list, heading message, web page element information and described The preset number URL for including in the page data of linked web pages;
By the first object parameter and the second target component extracted from the preset number linked web pages, Page feature as the target webpage.
Optionally, include validation value in the page data, it is described according to preset feature extraction rule and the page Data, before the page feature for determining the target webpage, the method further includes:
Judge whether the validation value that the page data includes is the corresponding validation value of preset normal page data;
If the validation value is the corresponding validation value of preset normal page data, execute described according to preset spy Extracting rule and the page data are levied, determines the page feature step of the target webpage;
If the validation value is not the corresponding validation value of the preset normal page data, to the user terminal Send miscue information.
Optionally, the method further includes:
Pre-stored multiple training samples are obtained, the training sample includes the page feature of illegal web page and legal net The page feature of page;
Based on the multiple training sample, preset initial neural network model is trained, obtains the page point Class model.
Optionally, described to be based on the multiple training sample, preset initial neural network model is trained, is obtained The page classifications model, including:
The training sample is input to the initial neural network model, it is special to export the corresponding test of the training sample Sign vector;
By back-propagation algorithm, using the corresponding testing feature vector of the training sample, to the initial nerve net The model parameter that network model includes is adjusted, and obtains the page classifications model, and the model parameter includes at least input layer To hidden layer weights and hidden layer to output layer weights.
Machine readable storage medium may include RAM (Random Access Memory, random access memory), also may be used To include NVM (Non-Volatile Memory, nonvolatile memory), for example, at least a magnetic disk storage.In addition, machine Device readable storage medium storing program for executing can also be at least one storage device for being located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including CPU (Central Processing Unit, central processing Device), NP (Network Processor, network processing unit) etc.;Can also be DSP (Digital Signal Processing, Digital signal processor), ASIC (Application Specific Integrated Circuit, application-specific integrated circuit), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device are divided Vertical door or transistor logic, discrete hardware components.
In the embodiment of the present invention, after safety equipment receives the web-page requests of user terminal transmission, according to the web-page requests The URL of middle carrying obtains the page data of target webpage, then according to preset feature extraction from the server of target webpage Rule and page data, determine the page feature of target webpage, according to page feature and preset page classifications model, judge mesh It marks whether webpage is illegal web page, if target webpage is illegal web page, sends a warning message to user terminal.Based on above-mentioned Processing, can judge whether target webpage is the illegal page, is not necessarily to according to the page feature and page classifications model of target webpage Black list database manually is established, can be avoided because the data in black list database are not comprehensive, and leads to not identify non- The case where net of justice station, improves the discrimination of illegal website.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, identical similar portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so description is fairly simple, related place is referring to embodiment of the method Part explanation.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (16)

1. a kind of web page identification method, which is characterized in that the method is applied to safety equipment, the method includes:
The web-page requests that user terminal is sent are received, the unified resource of target webpage to be visited is carried in the web-page requests Locator URL;
According to the URL of the target webpage, the page data of the target webpage is obtained from the server of the target webpage;
According to preset feature extraction rule and the page data, the page feature of the target webpage is determined;
According to the page feature and preset page classifications model, judge whether the target webpage is illegal web page, if The target webpage is illegal web page, then sends a warning message to the user terminal.
2. according to the method described in claim 1, it is characterized in that, the method further includes:
If the target webpage is not illegal web page, it is whole that the page data of the target webpage is sent to the user End, so that the user terminal shows the target webpage according to the page data.
3. according to the method described in claim 1, it is characterized in that, described according to preset feature extraction rule and the page Data determine the page feature of the target webpage, including:
From the page data of the target webpage, first object parameter is extracted, the first object parameter includes:The target The page URL of the page, the page data of the preset field in list, heading message, web page element information and the target webpage In include preset number URL;
According to the first object parameter extracted, the page feature of the target webpage is determined.
4. according to the method described in claim 3, it is characterized in that, the first object parameter that the basis extracts, really The page feature of the fixed target webpage, including:
Obtain the page data of the corresponding linked web pages of the preset number URL;
For each linked web pages, the second target component, the second target ginseng are extracted from the page data of the linked web pages Number includes:Page URL, the preset field in list, heading message, web page element information and the link of the linked web pages The preset number URL for including in the page data of webpage;
By the first object parameter and the second target component extracted from the preset number linked web pages, as The page feature of the target webpage.
5. according to the method described in claim 1, it is characterized in that, including validation value in the page data, the basis is pre- If feature extraction rule and the page data, before the page feature for determining the target webpage, the method further includes:
Judge whether the validation value that the page data includes is the corresponding validation value of preset normal page data;
If the validation value is the corresponding validation value of preset normal page data, executes and described carried according to preset feature Rule and the page data are taken, determines the page feature step of the target webpage;
If the validation value is not the corresponding validation value of the preset normal page data, sent to the user terminal Miscue information.
6. according to the method described in claim 1, it is characterized in that, the method further includes:
Pre-stored multiple training samples are obtained, the training sample includes the page feature of illegal web page and legal webpage Page feature;
Based on the multiple training sample, preset initial neural network model is trained, obtains the page classifications mould Type.
7. according to the method described in claim 6, it is characterized in that, described be based on the multiple training sample, to preset first Beginning neural network model is trained, and obtains the page classifications model, including:
The training sample is input to the initial neural network model, export the corresponding test feature of the training sample to Amount;
By back-propagation algorithm, using the corresponding testing feature vector of the training sample, to the initial neural network mould The model parameter that type includes is adjusted, and obtains the page classifications model, and the model parameter includes at least input layer to hidden Layer weights and hidden layer to output layer weights.
8. a kind of webpage identification device, which is characterized in that described device is applied to safety equipment, and described device includes:
Receiving module, the web-page requests for receiving user terminal transmission carry target to be visited in the web-page requests The uniform resource locator URL of webpage;
First acquisition module obtains the mesh for the URL according to the target webpage from the server of the target webpage Mark the page data of webpage;
First determining module, for according to preset feature extraction rule and the page data, determining the target webpage Page feature;
First sending module, for according to the page feature and preset page classifications model, judging that the target webpage is It is no to be sent a warning message to the user terminal for illegal web page if the target webpage is illegal web page.
9. device according to claim 8, which is characterized in that described device further includes:
Second sending module, if not being illegal web page for the target webpage, by the page data of the target webpage It is sent to the user terminal, so that the user terminal shows the target webpage according to the page data.
10. device according to claim 8, which is characterized in that first determining module is specifically used for:
From the page data of the target webpage, first object parameter is extracted, the first object parameter includes:The target The page URL of the page, the page data of the preset field in list, heading message, web page element information and the target webpage In include preset number URL;
According to the first object parameter extracted, the page feature of the target webpage is determined.
11. device according to claim 10, which is characterized in that first determining module is specifically used for:
Obtain the page data of the corresponding linked web pages of the preset number URL;
For each linked web pages, the second target component, the second target ginseng are extracted from the page data of the linked web pages Number includes:Page URL, the preset field in list, heading message, web page element information and the link of the linked web pages The preset number URL for including in the page data of webpage;
By the first object parameter and the second target component extracted from the preset number linked web pages, as The page feature of the target webpage.
12. device according to claim 8, which is characterized in that include validation value in the page data, described device is also Including:
Judgment module is tested for judging whether validation value that the page data includes is that preset normal page data are corresponding Card value;
Second determining module triggers institute if being the corresponding validation value of preset normal page data for the validation value It is described according to preset feature extraction rule and the page data to state the execution of the first determining module, determines the target webpage Page feature step;
Third sending module, if not being the corresponding validation value of the preset normal page data for the validation value, Miscue information is sent to the user terminal.
13. device according to claim 8, which is characterized in that described device further includes:
Second acquisition module, for obtaining pre-stored multiple training samples, the training sample includes the page of illegal web page The page feature of region feature and legal webpage;
Training module is trained preset initial neural network model, obtains institute for being based on the multiple training sample State page classifications model.
14. device according to claim 13, which is characterized in that the training module is specifically used for:
The training sample is input to the initial neural network model, export the corresponding test feature of the training sample to Amount;
By back-propagation algorithm, using the corresponding testing feature vector of the training sample, to the initial neural network mould The model parameter that type includes is adjusted, and obtains the page classifications model, and the model parameter includes at least input layer to hidden Layer weights and hidden layer to output layer weights.
15. a kind of safety equipment, which is characterized in that including processor and machine readable storage medium, the machine readable storage Media storage has the machine-executable instruction that can be executed by the processor, and the processor is by the machine-executable instruction Promote:Realize any method and steps of claim 1-6.
16. a kind of machine readable storage medium, which is characterized in that be stored with machine-executable instruction, by processor call and When execution, the machine-executable instruction promotes the processor:Realize any method and steps of claim 1-6.
CN201810468614.0A 2018-05-16 2018-05-16 Webpage identification method and device Active CN108683666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810468614.0A CN108683666B (en) 2018-05-16 2018-05-16 Webpage identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810468614.0A CN108683666B (en) 2018-05-16 2018-05-16 Webpage identification method and device

Publications (2)

Publication Number Publication Date
CN108683666A true CN108683666A (en) 2018-10-19
CN108683666B CN108683666B (en) 2021-04-16

Family

ID=63806629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810468614.0A Active CN108683666B (en) 2018-05-16 2018-05-16 Webpage identification method and device

Country Status (1)

Country Link
CN (1) CN108683666B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059272A (en) * 2018-11-02 2019-07-26 阿里巴巴集团控股有限公司 A kind of page feature recognition methods and device
CN110401660A (en) * 2019-07-26 2019-11-01 秒针信息技术有限公司 Recognition methods, device, processing equipment and the storage medium of false flow
CN110674442A (en) * 2019-09-17 2020-01-10 中国银联股份有限公司 Page monitoring method, device, equipment and computer readable storage medium
CN110704867A (en) * 2019-09-06 2020-01-17 翼集分电子商务(上海)有限公司 Method, system, medium and apparatus for integral theft prevention
CN111104618A (en) * 2019-12-19 2020-05-05 秒针信息技术有限公司 Webpage skipping method and device
CN111385293A (en) * 2020-03-04 2020-07-07 腾讯科技(深圳)有限公司 Network risk detection method and device
CN111488452A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN111556065A (en) * 2020-05-08 2020-08-18 鹏城实验室 Phishing website detection method and device and computer readable storage medium
CN112347402A (en) * 2020-10-21 2021-02-09 上海淇玥信息技术有限公司 Illegal website/APP automatic identification method, system and electronic device
CN112711723A (en) * 2019-10-25 2021-04-27 北京搜狗科技发展有限公司 Malicious website detection method and device and electronic equipment
CN112866279A (en) * 2021-02-03 2021-05-28 恒安嘉新(北京)科技股份公司 Webpage security detection method, device, equipment and medium
CN112989341A (en) * 2021-03-03 2021-06-18 中国信息通信研究院 Method, system and medium for determining fraud-related webpage
CN113221035A (en) * 2021-05-13 2021-08-06 北京百度网讯科技有限公司 Method, apparatus, device, medium, and program product for determining an abnormal web page
CN113709094A (en) * 2020-05-22 2021-11-26 辉达公司 User-perceptible marking for network address identifiers
WO2021253252A1 (en) * 2020-06-17 2021-12-23 深圳市欢太数字科技有限公司 Method and apparatus for testing webpage, and electronic device and storage medium
CN114463730A (en) * 2021-07-15 2022-05-10 荣耀终端有限公司 Page identification method and terminal equipment
CN114465811A (en) * 2022-03-09 2022-05-10 北京华云安信息技术有限公司 Website login determination method and device, electronic equipment and storage medium
CN114760124A (en) * 2022-04-07 2022-07-15 黑龙江省敏动传感科技有限公司 Big data based computer network security intelligent analysis system and method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950337A (en) * 2010-09-08 2011-01-19 乔永清 System and method for monitoring website truthful data
CN102571768A (en) * 2011-12-26 2012-07-11 北京大学 Detection method for phishing site
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN102622553A (en) * 2012-04-24 2012-08-01 腾讯科技(深圳)有限公司 Method and device for detecting webpage safety
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Information prompting method and information prompting device for e-mails
CN102819591A (en) * 2012-08-07 2012-12-12 北京网康科技有限公司 Content-based web page classification method and system
CN102932348A (en) * 2012-10-30 2013-02-13 常州大学 Real-time detection method and system of phishing website
CN103501306A (en) * 2013-10-23 2014-01-08 腾讯科技(武汉)有限公司 Web site identification method, server and system
CN103685308A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Detection method and system of phishing web pages, client and server
CN103810193A (en) * 2012-11-08 2014-05-21 北京金山安全软件有限公司 Webpage element shielding method and device
CN104809125A (en) * 2014-01-24 2015-07-29 腾讯科技(深圳)有限公司 Method and device for identifying webpage categories
CN107018152A (en) * 2017-05-27 2017-08-04 北京奇虎科技有限公司 Message block method, device and electronic equipment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950337A (en) * 2010-09-08 2011-01-19 乔永清 System and method for monitoring website truthful data
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN102571768A (en) * 2011-12-26 2012-07-11 北京大学 Detection method for phishing site
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Information prompting method and information prompting device for e-mails
CN102622553A (en) * 2012-04-24 2012-08-01 腾讯科技(深圳)有限公司 Method and device for detecting webpage safety
CN102819591A (en) * 2012-08-07 2012-12-12 北京网康科技有限公司 Content-based web page classification method and system
CN102932348A (en) * 2012-10-30 2013-02-13 常州大学 Real-time detection method and system of phishing website
CN103810193A (en) * 2012-11-08 2014-05-21 北京金山安全软件有限公司 Webpage element shielding method and device
CN103501306A (en) * 2013-10-23 2014-01-08 腾讯科技(武汉)有限公司 Web site identification method, server and system
CN103685308A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Detection method and system of phishing web pages, client and server
CN104809125A (en) * 2014-01-24 2015-07-29 腾讯科技(深圳)有限公司 Method and device for identifying webpage categories
CN107018152A (en) * 2017-05-27 2017-08-04 北京奇虎科技有限公司 Message block method, device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沙泓州: "面向大规模网络流量的URL实时分类关键技术研究", 《博士学位论文数据库,信息科技辑》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059272A (en) * 2018-11-02 2019-07-26 阿里巴巴集团控股有限公司 A kind of page feature recognition methods and device
CN110059272B (en) * 2018-11-02 2023-08-15 创新先进技术有限公司 Page feature recognition method and device
CN111488452A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN110401660A (en) * 2019-07-26 2019-11-01 秒针信息技术有限公司 Recognition methods, device, processing equipment and the storage medium of false flow
CN110401660B (en) * 2019-07-26 2022-03-01 秒针信息技术有限公司 False flow identification method and device, processing equipment and storage medium
CN110704867A (en) * 2019-09-06 2020-01-17 翼集分电子商务(上海)有限公司 Method, system, medium and apparatus for integral theft prevention
CN110704867B (en) * 2019-09-06 2023-06-16 翼集分(上海)数字科技有限公司 Integral anti-theft method, system, medium and device
CN110674442A (en) * 2019-09-17 2020-01-10 中国银联股份有限公司 Page monitoring method, device, equipment and computer readable storage medium
CN110674442B (en) * 2019-09-17 2023-08-18 中国银联股份有限公司 Page monitoring method, device, equipment and computer readable storage medium
CN112711723B (en) * 2019-10-25 2024-04-30 北京搜狗科技发展有限公司 Malicious website detection method and device and electronic equipment
CN112711723A (en) * 2019-10-25 2021-04-27 北京搜狗科技发展有限公司 Malicious website detection method and device and electronic equipment
CN111104618A (en) * 2019-12-19 2020-05-05 秒针信息技术有限公司 Webpage skipping method and device
CN111385293B (en) * 2020-03-04 2021-06-22 腾讯科技(深圳)有限公司 Network risk detection method and device
CN111385293A (en) * 2020-03-04 2020-07-07 腾讯科技(深圳)有限公司 Network risk detection method and device
CN111556065A (en) * 2020-05-08 2020-08-18 鹏城实验室 Phishing website detection method and device and computer readable storage medium
CN113709094A (en) * 2020-05-22 2021-11-26 辉达公司 User-perceptible marking for network address identifiers
WO2021253252A1 (en) * 2020-06-17 2021-12-23 深圳市欢太数字科技有限公司 Method and apparatus for testing webpage, and electronic device and storage medium
CN112347402A (en) * 2020-10-21 2021-02-09 上海淇玥信息技术有限公司 Illegal website/APP automatic identification method, system and electronic device
CN112866279A (en) * 2021-02-03 2021-05-28 恒安嘉新(北京)科技股份公司 Webpage security detection method, device, equipment and medium
CN112866279B (en) * 2021-02-03 2022-12-09 恒安嘉新(北京)科技股份公司 Webpage security detection method, device, equipment and medium
CN112989341A (en) * 2021-03-03 2021-06-18 中国信息通信研究院 Method, system and medium for determining fraud-related webpage
CN113221035A (en) * 2021-05-13 2021-08-06 北京百度网讯科技有限公司 Method, apparatus, device, medium, and program product for determining an abnormal web page
CN114463730A (en) * 2021-07-15 2022-05-10 荣耀终端有限公司 Page identification method and terminal equipment
CN114465811B (en) * 2022-03-09 2023-05-23 北京华云安信息技术有限公司 Website login determination method and device, electronic equipment and storage medium
CN114465811A (en) * 2022-03-09 2022-05-10 北京华云安信息技术有限公司 Website login determination method and device, electronic equipment and storage medium
CN114760124B (en) * 2022-04-07 2022-10-04 呀邦管理科技(北京)有限责任公司 Big data based computer network security intelligent analysis system and method
CN114760124A (en) * 2022-04-07 2022-07-15 黑龙江省敏动传感科技有限公司 Big data based computer network security intelligent analysis system and method

Also Published As

Publication number Publication date
CN108683666B (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN108683666A (en) A kind of web page identification method and device
US11245718B2 (en) Method and system for tracking fraudulent activity
CN108156237B (en) Product information pushing method and device, storage medium and computer equipment
US10121009B2 (en) Computer system for discovery of vulnerabilities in applications including guided tester paths based on application coverage measures
EP3104294B1 (en) Fast device classification
CN104092811B (en) Mobile terminal information download method, system, terminal device and server
CN104468531B (en) The authorization method of sensitive data, device and system
CN111435507A (en) Advertisement anti-cheating method and device, electronic equipment and readable storage medium
CN109039987A (en) A kind of user account login method, device, electronic equipment and storage medium
CN108665297A (en) Detection method, device, electronic equipment and the storage medium of abnormal access behavior
CN106453216A (en) Malicious website interception method, malicious website interception device and client
CN102831218B (en) Method and device for determining data in thermodynamic chart
CN108268635B (en) Method and apparatus for acquiring data
CN106549959B (en) Method and device for identifying proxy Internet Protocol (IP) address
CN108696490A (en) The recognition methods of account permission and device
CN107395558A (en) For the method for communication, system and computer-readable non-transitory storage medium
CN102833212A (en) Webpage visitor identity identification method and system
CN106682489A (en) Password security detection method, password security reminding method and corresponding devices
CN105959371A (en) Webpage sharing system
CN110912874B (en) Method and system for effectively identifying machine access behaviors
CN109831429A (en) A kind of Webshell detection method and device
CN108196829A (en) A kind of artificial intelligence mobile phone accounting system for analyzing account and method
CN105141610A (en) Phishing page detection method and system
CN109657459A (en) Webpage back door detection method, equipment, storage medium and device
CN109672658A (en) Detection method, device, equipment and the storage medium of JSON abduction loophole

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant