CN107862050A

CN107862050A - A kind of web site contents safety detecting system and method

Info

Publication number: CN107862050A
Application number: CN201711090519.3A
Authority: CN
Inventors: 王电钢; 龚艳; 母继元; 毛启均; 常健
Original assignee: State Grid Sichuan Electric Power Co Ltd
Current assignee: State Grid Sichuan Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Sichuan Electric Power Co Ltd
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2018-03-30

Abstract

The invention discloses a kind of web site contents safety detecting system and method, including front end request module：URL network address to be detected is inputted, submits request to arrive reptile module；Reptile module：Crawl the pictorial information of target URL network address；Characteristic extracting module：The pictorial information of the pictorial information of reptile module and samples pictures module is extracted as characteristic vector；Model trainer：The characteristic vector of samples pictures is generated into grader by way of supervised learning；FPGA hardware accelerator：Function of hardware acceleration is provided to characteristic extracting module；Safe arbitration modules：Classification results according to grader to picture feature, calculate the safety coefficient of target URL network address.The present invention passes through above-mentioned principle, input using sample image feature as model trainer obtains grader, characteristic extracting module algorithm is accelerated using FPGA hardware accelerator with lifting system response speed, to realize the purpose of quick, efficient and accurate web site contents safety detection.

Description

A kind of web site contents safety detecting system and method

Technical field

The present invention relates to technical field of network security, and in particular to a kind of web site contents safety detecting system and method.

Background technology

With the development of Internet technology, web application brings great convenience for the life of people, greatly rich The rich circulation way of information.But some illegal molecules seek profit by making the websites such as fishing, gambling and pornographic for oneself Benefit, great potential safety hazard is brought to the safe and healthy online of people.Therefore, the detection of malicious websites has become one sternly The network security problem of weight.

Detection to malicious web pages at present mainly includes two methods of static nature detection and behavioral characteristics detection.It is static special Sign detection includes entering the DNS information of webpage, WHOIS information, URL syntax feature, HTML content and JavaScript code etc. Row analysis；Behavioral characteristics detection includes analyzing linking the relation that redirects, browser behavior and registration table change etc., uses machine It is also the supplement to above-mentioned two classes way that the mode of device study, which carries out classification and Detection to webpage,.In addition, using Honeypot Techniques to disliking Meaning webpage carries out detection and more ripe way.

In document《Beyond Blacklists:Learning to Detect Malicious Web Sites from Suspicious URLs》In, the researcher such as Justin is according to DNS information, WHOIS information and URL syntax feature, using machine The URL of malice is identified the mode of device study.Which has the following disadvantages：(1) some malice URL in grammar property and There is no express malice feature on WHOIS log-on messages, have a great similitude with normal URL, rate of false alarm is higher；(2) lack pair Webpage JavaScript and HTML content analysis, only judge URL security by analyzing DNS, WHOIS and URL information It is unilateral.

In document《Prophiler:A Fast Filter for the Large-Scale Detection of Malicious Web Pages》In, Davide is added on Justin Research foundation to webpage Javascript and HTML The analysis of feature, the recognition accuracy to malicious websites is improved by the detection to web page contents；In paper《Dug based on data The design of the Trojan horse detection system of pick and machine learning and realization》In, Shi Yu by extracting web page characteristics, and using machine learning and The mode of BP neural network is classified to webpage, so as to reach the identification to malicious websites.Both the above method is compared with Justin Research have and be extremely improved, but all ignore the problem of several important：(1) to the classification of web page contents, especially to figure The classification of piece, using performance when SVM models or BP neural network complicated classification image and bad, easily produce larger inclined Difference；(2) great expense can be brought to system using the mode of machine learning or deep learning web page contents of classifying, for present The popular measure by using hardware-accelerated mode lifting system response speed, the two does not do similar acceleration processing.

The content of the invention

The technical problems to be solved by the invention are to lift the response speed of website content safety detection, in webpage Appearance is analyzed, and reduces rate of false alarm, and it is an object of the present invention to provide a kind of web site contents safety detecting system and method, special with sample image Levy and obtain grader as the input of model trainer, characteristic extracting module algorithm is accelerated using FPGA hardware accelerator With lifting system response speed, the purpose of quick, efficient and accurate web site contents safety detection is realized.

The present invention is achieved through the following technical solutions：

A kind of web site contents safety detecting system, including

Front end request module：URL network address to be detected is inputted, submits request to arrive reptile module；

Reptile module：Crawl the pictorial information of target URL network address；

Characteristic extracting module：The pictorial information of the pictorial information of reptile module and samples pictures module is extracted and is characterized Vector；

Model trainer：The characteristic vector of samples pictures is generated into grader by way of supervised learning；

FPGA hardware accelerator：Function of hardware acceleration is provided to characteristic extracting module；

Safe arbitration modules：Classification results according to grader to picture feature, calculate the safety system of target URL network address Number；

Data memory module：The pictorial information that storage reptile module crawls, stores the testing result information to target URL；

Responsor：Forward end request module returns to target URL safety coefficient.

This programme carries out safety detection, characteristic extracting module extraction figure to web site contents by using the mode of machine learning As feature, model trainer obtains grader according to the sample image features training of extraction, and grader is according to characteristics of image to figure As being classified, realize and image be subjected to classification judgement malice URL will not had on grammar property and WHOIS log-on messages There is express malice feature, obscure with normal URL phases, erroneous judgement occurs, the determination methods deviation of this programme is small, rate of false alarm bottom, and Characteristic extracting module algorithm is accelerated using FPGA hardware accelerator with lifting system response speed, realized quick, efficient And the purpose of accurate web site contents safety detection.

Preferably, FPGA hardware accelerator uses the reconfigurable acceleration storehouses of Xilinx, with reference to Caffe machine learning frameworks It is achieved with Xilinx deep neural network DNN storehouses.

Preferably, Caffe machine learning framework is the integrated framework of a CNN convolutional neural networks deep learning.It is existing When technology uses SVM models or BP neural network complicated classification image, larger deviation is easily produced, and this programme grader Text and image content will be crawled to obtain, image feature vector is extracted by using the method for CNN convolutional neural networks deep learnings, Input using sample image feature as model trainer obtains the line of grader, when analyzing complicated image compared with SVM models Or BP neural network sorting algorithm is not likely to produce deviation, website the selection result is more accurate.This programme characteristic extracting module uses The reconfigurable acceleration for accelerating storehouse FPGA hardware accelerator to carry out core algorithm of Xilinx, greatly improves the response of system Speed.

Preferably, safe arbitration modules are by being labeled whether non-security number of pictures exceedes given threshold, to calculate Obtain targeted website safety coefficient.

Preferably, samples pictures module includes normal picture and improper picture, and improper picture, which refers to, gambling and pornographic The picture of feature.

A kind of web site contents safety detection method, comprises the following steps：

S1：The pictorial information of samples pictures module is extracted as the form of characteristic vector by characteristic extracting module；

S2：The sampling feature vectors that S1 is obtained are input, and model trainer generates classification using the mode of supervised learning Device；

S3：In front end, request module inputs URL network address to be detected, detects the legitimacy of the network address, the network address is submitted To reptile module；

S4：Reptile module receives the URL network address sent from front end request module, crawls the picture letter of target URL network address Breath, and content storage will be crawled and arrive data memory module；

S5：The characteristic vector for the picture that characteristic extracting module extraction S4 is crawled；

S6：The image crawled is classified as input, grader using the image feature vector of S5 extractions；

S7：Safe arbitration modules calculate the safety coefficient of target network address according to S6 classification results, and with target URL nets Location, local picture path, detection time and the safety coefficient for preserving targeted website are stored；

S8：The testing result of target network address is sent to front end request module by respond module.

Preferably, characteristic extracting module is accelerated using FPGA accelerators to picture feature extraction algorithm.

Preferably, FPGA hardware accelerator uses the reconfigurable acceleration storehouses of Xilinx, with reference to Caffe machine learning frameworks It is achieved with Xilinx deep neural network DNN storehouses, Caffe machine learning framework is a CNN convolutional neural networks depth The integrated framework of study.

The present invention compared with prior art, has the following advantages and advantages：

1st, input of the present invention using sample image feature as model trainer obtains grader, by using machine learning Mode safety detection is carried out to web site contents, and picture feature extraction algorithm is accelerated using FPGA accelerators, realized A kind of web site contents realize quick, efficient and accurate web site contents safety detection.

2nd, the text crawled and image content are carried out image spy by grader of the present invention using the mode of CNN deep learnings The extraction of sign, when analyzing complicated image compared with SVM models or BP neural network sorting algorithm, larger deviation is not likely to produce, is carried Take effect more preferable.

3rd, extraction module of the present invention accelerates storehouse FPGA hardware accelerator to carry out core algorithm using Xilinx is reconfigurable Acceleration, greatly improve the response speed of system.

Brief description of the drawings

Accompanying drawing described herein is used for providing further understanding the embodiment of the present invention, forms one of the application Point, do not form the restriction to the embodiment of the present invention.In the accompanying drawings：

Fig. 1 is schematic structural view of the invention；

Fig. 2, which is that Xilinx is reconfigurable, accelerates protocol stack schematic diagram.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, with reference to embodiment and accompanying drawing, to this Invention is described in further detail, and exemplary embodiment of the invention and its explanation are only used for explaining the present invention, do not make For limitation of the invention.

Embodiment 1：

As shown in Figure 1-2, the present invention includes a kind of web site contents safety detecting system, including

The existing system to malicious websites detection does not have to some malice URL on grammar property and WHOIS log-on messages Express malice feature, there is the webpage of great similitude with normal URL, and rate of false alarm is higher；Lack simultaneously to webpage JavaScript and HTML content analysis, only judge URL security by analyzing DNS, WHOIS and URL information, judge It is very unilateral；Classification to web page contents, the especially classification to complicated image, larger deviation is easily produced, influenceed most Whole judged result；Classified by the way of machine learning or deep learning web page contents, system low-response, influence efficiency.

Embodiment 2：

The present embodiment is preferably as follows on the basis of embodiment 1：FPGA hardware accelerator adds using Xilinx is reconfigurable Fast storehouse, it is achieved with reference to Caffe machine learning framework and Xilinx deep neural network DNN storehouses.

Caffe machine learning framework is the integrated framework of a CNN convolutional neural networks deep learning.Prior art uses When SVM models or BP neural network complicated classification image, larger deviation is easily produced, and this programme grader will crawl Text and image content, image feature vector is extracted by using the method for CNN convolutional neural networks deep learnings, with sample graph As input of the feature as model trainer obtains the line of grader, when analyzing complicated image compared with SVM models or BP nerves Meshsort algorithm is not likely to produce deviation, and website the selection result is more accurate.This programme characteristic extracting module can be weighed using Xilinx Configuration accelerates the acceleration of storehouse FPGA hardware accelerator progress core algorithm, greatly improves the response speed of system.

Safe arbitration modules are by being labeled whether non-security number of pictures exceedes given threshold, target is calculated Web portal security coefficient.

Samples pictures module includes normal picture and improper picture, and improper picture, which refers to, the features such as gambling and pornographic Picture.The grader generated by samples pictures module, for judging whether the picture of URL network address is that improper picture judges accurate True rate is high.

Embodiment 3：

Characteristic extracting module is accelerated using FPGA accelerators to picture feature extraction algorithm.

FPGA hardware accelerator uses the reconfigurable acceleration storehouses of Xilinx, with reference to Caffe machine learning framework and Xilinx deep neural network DNN storehouses are achieved, and Caffe machine learning framework is a CNN convolutional neural networks depth The integrated framework of habit.

This programme first step is converted training set samples pictures using the convert_imageset methods of Caffe frameworks For the .leveldb files that it can run ,-resize_width and-resize_height parameters are used when calling this method Option is consistent training set samples pictures size, and the resolution ratio after the image correction that this method uses is 256*256, and Training set samples pictures are all pre- to first pass through label process.

Second step, the extract_features methods of Caffe frameworks are continuing with to .leveldb generated above File extracts sample image feature in the form of characteristic vector, and calls Xilinx is reconfigurable to accelerate stack depth neutral net storehouse DNN is hardware-accelerated to process progress, to lift the speed of service of the module.

Third step, Boot Model training aids, by defining name.prototxt and name_solver.prototxt texts Part, using the model training train methods and its parameter of Caffe frameworks -- the characteristic vector that solver obtains to step 2 uses The mode training pattern of supervised learning, the process are constantly corrected to model using fine-turning operations, ultimately generated With number of tags identical and the grader that can be divided to sensitive (gambling, pornographic etc.) picture.

Four steps, using Html, CSS and written in JavaScript front-end interface, in front end, input frame, which is filled in, to detect Target URL, detect the legitimacy of the URL, whether the content of such as input may cause XSS, SQL injection security breaches.If The URL of input is legal, and the URL is sent into reptile module using ajax the post () methods in JQuery storehouses.

5th step, reptile module receive the URL detection requests of front end request module, use Python Scrapy frames Frame crawls pictorial information to target URL, and is preserved the picture crawled in a manner of local file stores.

6th step, similar to step 1, the picture crawled to step 5, which carries out size revision and generation Caffe, to be transported Capable .leveldb files.And use using the picture that step 5 crawls as test set characteristic extracting module extraction reptile image Characteristic vector, reptile image is classified according to this feature vector using the grader that step 3 generates, by sensitive image mark It is designated as non-security image.

7th step, safe arbitration modules are calculated by being labeled whether non-security number of pictures exceedes given threshold Targeted website safety coefficient is obtained, and with target URL network address, local picture path, detection time and the peace for preserving targeted website Overall coefficient etc. is field data storage memory module.

8th step, responsor forward end request module send this target URL safety detection data.

This method first captures the pictorial information for needing to detect website, after carrying out intelligent classification by grader, is calculated Accurately detection web portal security coefficient value, is then returned to front end request module and shows.This programme is by using machine learning Mode carries out safety detection, characteristic extracting module extraction characteristics of image, sample of the model trainer according to extraction to web site contents Characteristics of image is trained to obtain grader, and grader is classified according to characteristics of image to image, realizes that image is carried out into classification sentences Disconnected, deviation is small, rate of false alarm bottom, and characteristic extracting module algorithm is accelerated to ring with lifting system using FPGA hardware accelerator Speed is answered, realizes the purpose of quick, efficient and accurate web site contents safety detection.

Above-described embodiment, the purpose of the present invention, technical scheme and beneficial effect are carried out further Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., all should include Within protection scope of the present invention.

Claims

A kind of 1. web site contents safety detecting system, it is characterised in that including

Front end request module：URL network address to be detected is inputted, submits request to arrive reptile module；

Reptile module：Crawl the pictorial information of target URL network address；

Characteristic extracting module：The pictorial information of reptile module and the pictorial information of samples pictures module are extracted be characterized to Amount；

Model trainer：The characteristic vector of samples pictures is generated into grader by way of supervised learning；

FPGA hardware accelerator：Function of hardware acceleration is provided to characteristic extracting module；

Safe arbitration modules：Classification results according to grader to picture feature, calculate the safety coefficient of target URL network address；

Data memory module：The pictorial information that storage reptile module crawls, stores the testing result information to target URL；

Responsor：Forward end request module returns to target URL safety coefficient.
2. a kind of web site contents safety detecting system according to claim 1, it is characterised in that FPGA hardware accelerator makes With the reconfigurable acceleration storehouses of Xilinx, give reality with reference to Caffe machine learning framework and Xilinx deep neural network DNN storehouses It is existing.
A kind of 3. web site contents safety detecting system according to claim 2, it is characterised in that Caffe machine learning frames Frame is the integrated framework of a CNN convolutional neural networks deep learning.
4. a kind of web site contents safety detecting system according to claim 1, it is characterised in that safe arbitration modules pass through It is labeled whether non-security number of pictures exceedes given threshold, targeted website safety coefficient is calculated.
5. a kind of web site contents safety detecting system according to claim 1, it is characterised in that samples pictures module includes Normal picture and improper picture, improper picture refer to the picture for having gambling and pornographic feature.
6. a kind of web site contents safety detection method, it is characterised in that comprise the following steps：

S1：The pictorial information of samples pictures module is extracted as the form of characteristic vector by characteristic extracting module；

S2：The sampling feature vectors that S1 is obtained are input, and model trainer generates grader using the mode of supervised learning；

S3：In front end, request module inputs URL network address to be detected, detects the legitimacy of the network address, the network address is submitted to and climbed Erpoglyph block；

S4：Reptile module receives the URL network address sent from front end request module, crawls the pictorial information of target URL network address, and Content storage will be crawled and arrive data memory module；

S5：The characteristic vector for the picture that characteristic extracting module extraction S4 is crawled；

S6：The image crawled is classified as input, grader using the image feature vector of S5 extractions；

S7：Safe arbitration modules calculate the safety coefficient of target network address according to S6 classification results, and with target URL network address, this Picture path, detection time and the safety coefficient that ground preserves targeted website are stored；

S8：The testing result of target network address is sent to front end request module by respond module.
7. a kind of web site contents safety detection method according to claim 6, it is characterised in that characteristic extracting module uses FPGA hardware accelerator accelerates to picture feature extraction algorithm.
8. a kind of web site contents safety detection method according to claim 7, it is characterised in that FPGA hardware accelerator makes With the reconfigurable acceleration storehouses of Xilinx, give reality with reference to Caffe machine learning framework and Xilinx deep neural network DNN storehouses Existing, Caffe machine learning framework is the integrated framework of a CNN convolutional neural networks deep learning.