CN111008347A

CN111008347A - Website identification method, device and system and computer readable storage medium

Info

Publication number: CN111008347A
Application number: CN201911166434.8A
Authority: CN
Inventors: 温延龙; 范渊
Original assignee: DBAPPSecurity Co Ltd
Current assignee: DBAPPSecurity Co Ltd
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2020-04-14

Abstract

The application discloses a website identification method, which comprises the steps of determining a website to be identified according to a received website identification instruction; calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm; identifying the website to be identified by using the preset website identification model, and determining the website category of the website to be identified; the website identification method can be used for quickly and accurately identifying each application system website in the Internet. The application also discloses a website identification device, a website identification system and a computer readable storage medium, which have the beneficial effects.

Description

Website identification method, device and system and computer readable storage medium

Technical Field

The present application relates to the field of network security technologies, and in particular, to a website identification method, and further, to a website identification apparatus, system, and computer-readable storage medium.

Background

With the rapid development of the internet, a large number of internet sites are produced, including a large number of application systems, most of which are application system sites developed for storing data materials and managing enterprises or schools. However, these application system websites just become the object of frequent attacks and penetration by hacker organizations, resulting in illegal stealing of a large amount of important files and information.

Therefore, it is very important to quickly identify the application system open on the internet, which is an effective way to enhance the security supervision of the application system. In the prior art, although there are many websites on the internet, the way of identifying the types of the websites is somewhat lacking, generally speaking, the judgment is mainly performed manually, however, the workload of manual judgment is huge, the websites need to be identified and then matched, the identification efficiency is extremely low, and careless mistakes are easy to occur in the identification process.

Therefore, how to quickly and accurately identify an application system website in the internet is a problem to be solved by those skilled in the art.

Disclosure of Invention

The website identification method can be used for quickly and accurately identifying each application system website in the Internet; another object of the present invention is to provide a website recognition apparatus, system, communication server and computer readable storage medium, which also have the above advantages.

In order to solve the above technical problem, the present application provides a website identification method, where the website identification method includes:

determining a website to be identified according to the received website identification instruction;

calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;

and identifying the website to be identified by using the preset website identification model, and determining the website category of the website to be identified.

Preferably, the establishing the preset website identification model based on the random forest algorithm includes:

acquiring a sample website;

acquiring website information according to the sample website;

extracting the characteristics of the website information to obtain specified characteristic information;

and performing model training on the specified characteristic information by using the random forest algorithm to obtain the preset website identification model.

Preferably, the obtaining website information according to the sample website includes:

and crawling information of the sample website by using a crawler technology to obtain the website information.

Preferably, before performing model training on the specified feature information by using the random forest algorithm to obtain the preset website recognition model, the method further includes:

and carrying out standardization processing on the specified characteristic information by using the mean value and the variance to obtain the standardized specified characteristic information.

Preferably, the performing model training on the specified feature information by using the random forest algorithm to obtain the preset website identification model includes:

generating a sample matrix and a type matrix corresponding to the sample matrix according to the specified characteristic information;

and performing random forest calculation on the sample matrix and the type matrix based on python to obtain the preset website identification model.

Preferably, the website identification method further includes:

and verifying the preset website identification model by using a cross verification method to obtain a first optimized website identification model.

Preferably, the website identification method further includes:

and optimizing the preset website identification model according to the website category of the website to be identified to obtain a second optimized website identification model.

In order to solve the above technical problem, the present application further provides a website identification apparatus, including:

the instruction receiving module is used for determining the website to be identified according to the received website identification instruction;

the model calling module is used for calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;

and the website identification module is used for identifying the website to be identified by using the preset website identification model and determining the website category of the website to be identified.

In order to solve the above technical problem, the present application further provides a website identification system, where the website identification system further includes:

a memory for storing a computer program;

and the processor is used for realizing the steps of any one of the website identification methods when the computer program is executed.

In order to solve the above technical problem, the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of any one of the above website identification methods.

The website identification method comprises the steps of determining a website to be identified according to a received website identification instruction; calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm; and identifying the website to be identified by using the preset website identification model, and determining the website category of the website to be identified.

Therefore, according to the website identification method, the website identification model is established in advance by using the random forest algorithm, so that the unknown website can be automatically identified through the website identification model.

The website identification device, the website identification system and the computer readable storage medium provided by the application all have the beneficial effects, and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a website identification method provided in the present application;

FIG. 2 is a schematic flow chart illustrating a website recognition model construction method according to the present application;

FIG. 3 is a flowchart of a website identification method provided in the present application;

fig. 4 is a schematic structural diagram of a website identification apparatus provided in the present application;

fig. 5 is a schematic structural diagram of a website identification system provided in the present application.

Detailed Description

The core of the application is to provide a website identification method, which can quickly and accurately identify each application system website in the Internet; another core of the present application is to provide a website identification apparatus, a device and a computer-readable storage medium, which also have the above-mentioned advantages.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a website identification method according to the present application, where the website identification method may include:

s101: determining a website to be identified according to the received website identification instruction;

the step aims to realize the determination of the website to be identified, and the website to be identified is the website of which the unknown website type needs to be identified. Specifically, when a certain unknown website needs to be identified, the unknown website can be used as a website to be identified to send a website identification instruction to the controller, so that the controller can determine the website to be identified according to the website identification instruction and identify the type of the website to be identified.

S102: calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;

the method comprises the steps of obtaining a preset website identification model, wherein the preset website identification model is a model which is established in advance based on a random forest algorithm and can be used for identifying unknown website types, and when the model to be identified is determined, the preset website identification model can be directly called to realize website identification. After being established, the preset website identification model can be stored in a preset storage space, such as a memory, a magnetic disk and the like, so as to be called later.

S103: and identifying the website to be identified by using a preset website identification model, and determining the website category of the website to be identified.

The method aims to realize website identification, namely, identification is directly carried out on a to-be-identified website by using a called preset website identification model, a corresponding identification result can be obtained, and the website category of the to-be-identified website is determined. The specific content of the website category does not affect the implementation of the technical scheme, and the application does not limit the content.

As a preferred embodiment, the website identification method may further include: and optimizing the preset website identification model according to the website category of the website to be identified to obtain a second optimized website identification model.

The preferred embodiment aims to realize model optimization, and specifically, after the identification of the website to be identified is completed, the preset website identification model can be optimized by using the identification result to obtain the optimized website identification model, namely, the second optimized identification model, so that the model identification precision is improved, and convenience is provided for website identification again.

According to the website identification method, the website identification model is established by utilizing the random forest algorithm in advance, so that the unknown website can be automatically identified through the website identification model.

On the basis of the foregoing embodiments, the present application embodiment introduces details of the building process of the preset website identification model, please refer to fig. 2, and fig. 2 is a schematic flow diagram of a website identification model building method provided by the present application, where the website identification model building method includes:

s201: acquiring a sample website;

the step aims to realize the acquisition of a sample website, which is a website of a known website type and is opposite to a website to be identified, and is used for model training, the number of the sample website is not unique, and it can be understood that the more the number of the sample website is, the higher the accuracy of the obtained preset website identification model is. The sample website can be divided into a positive sample and a negative sample, or can be different website samples with determined categories.

S202: acquiring website information according to a sample website;

the step aims to realize the acquisition of the website information, namely, the website information in each sample website is acquired to realize the model training, and the acquisition method can adopt any one of the prior art, which is not limited in the application.

Preferably, the obtaining of the website information according to the sample website may include: and crawling information of the sample website by using a crawler technology to obtain website information.

The preferred embodiment provides a specific website information obtaining method, which is realized based on a web crawler technology, and specifically, after obtaining each sample website, information crawling can be performed by using the web crawler to obtain website information of each sample website.

S203: extracting the characteristics of the website information to obtain specified characteristic information;

this step is intended to realize feature extraction, that is, extracting relevant information for obtaining specified features, that is, the specified feature information, from website information. The specific feature is preset by a technician according to actual needs, and the specific content of the feature is not limited in the present application, for example, the feature dimensions to be extracted include the data amount of the header < a > tag, the number of < form > forms included in < body >, the number of type ═ password "in < body >, whether" login "or" login "is included, the length of html tag content removed from the header, the number of < img > tags, the number of < li > in the page, the number of type ═ text" and type ═ password in the page body, and the like.

S204: and performing model training on the specified characteristic information by using a random forest algorithm to obtain a preset website identification model.

The step aims to realize model training, namely, the random forest algorithm is used for carrying out model training on each piece of extracted specified characteristic information, and the preset website recognition model can be obtained.

Preferably, before the model training is performed on the specified feature information by using the random forest algorithm to obtain the preset website recognition model, the method may further include: and carrying out standardization processing on the specified characteristic information by using the mean value and the variance to obtain the standardized specified characteristic information.

In order to further improve the recognition accuracy and the training speed of the preset website recognition model, before the model training is carried out by using the specified characteristic information, the further standardization processing can be carried out on each specified characteristic information, wherein the standardization process can be realized by mean value calculation and variance calculation, and further standardized specified characteristic information is obtained, so that the subsequent model training can be carried out based on the standardized specified characteristic information.

Preferably, the performing model training on the specified feature information by using a random forest algorithm to obtain the preset website recognition model may include: generating a sample matrix and a type matrix corresponding to the sample matrix according to the specified characteristic information; and performing random forest calculation on the sample matrix and the type matrix based on python to obtain a preset website identification model.

The preferred embodiment provides a specific model training method based on a random forest algorithm, which comprises the steps of firstly, generating a sample matrix and a type matrix corresponding to the sample matrix by utilizing each piece of specified characteristic information, and further, executing the random forest algorithm on the sample matrix by utilizing python to obtain a corresponding preset website identification model.

Preferably, the website identification method may further include: and verifying the preset website identification model by using a cross verification method to obtain a first optimized website identification model.

In order to further improve the model identification precision and ensure the accuracy of the identification result, the preset website identification model can be verified after being constructed, and the method can be specifically realized by a cross verification method, so that the optimized website identification model, namely the first optimized website identification model, is obtained. The cross validation is generally applied to the modeling process, namely, in a given modeling sample, most samples are taken out for model construction, a small part of samples are reserved for forecasting by using the just-established model, the forecasting errors of the small part of samples are solved, the square sum of the forecasting errors is recorded, and the stability of the model can be effectively improved.

The embodiment of the application provides a specific website identification model construction method, so that automatic identification of a website to be identified is realized through the website identification model, and the website identification efficiency and the accuracy of an identification result are further improved.

On the basis of the foregoing embodiments, the present application embodiment provides a more specific website identification method by taking identification of an application system website as an example, and with reference to fig. 3, fig. 3 is a flowchart of the website identification method provided by the present application, and a specific implementation flow thereof is as follows:

firstly, collecting website home pages of the Internet and crawling website home page information, specifically, manually collecting a batch of websites of an application system as positive samples in advance, collecting a batch of websites which are not the application system as negative samples, recording the negative samples as portal websites, and further crawling the information of the website home page body according to the collected sample data.

Further, setting feature dimensions to be acquired, wherein the feature dimensions are 8 in total and are respectively as follows: the data size of the < a > tag of the home page, the number of < form > forms contained in < body >, the number of type ═ password "contained in < body >, whether the ' login ' or ' login ' is contained, the length of html tag content removed from the home page, the number of < img > tags, the number of < li > in the page, the number of type ═ text ' and type ═ password in the page body.

Wherein, the above 8 characteristic dimension data acquisition modes are:

(1) according to the completeness of the application system website and the portal website home page, the first characteristic dimension is determined to be the quantity of statistical home page < a > tags, generally, the data quantity of the page < a > tags in the portal website is large, and the quantity of the page < a > tags in the application system website is small.

(2) The number of < form > forms contained in the home page < body > of the website is counted, and generally, at least one < form > form is contained in the application system website to submit the login information.

(3) Counting the number of types in the < body >, generally, the application system website home page includes fields such as passwords, and since various descriptions may appear on the websites with the Chinese language "passwords", the statistics can be performed by using the mode of counting the tag attributes of the types.

(4) Whether the matching is carried with the keyword 'login' or 'logic', if the home page is carried with the keyword 'login' or 'logic', the matching is marked as 1, the matching corresponds to an application system website, and if the matching is not carried with the keyword 'login' or 'logic', the matching is marked as 0, and the matching corresponds to a portal website.

(5) Counting the length of the home page content, wherein generally, the web page richness of a portal website is large, and the length of the page content is large; the web page content of the application system website is low in richness and small in content.

(6) Counting the number of < img > tags, in general, the web portal page is simple, only comprises a few pictures, and the page is rich.

(7) Counting the number of < li > in the page, generally, navigation bars are arranged in the portal website, and the navigation bars are generally composed of < li >.

(8) Counting the number of types ═ text "and" password "in the page body, generally, if the page of the application system website has login boxes, the number is definitely not less than two, and the portal website has only one login box, therefore, if the sum of the number of types ═ text" and the number of types ═ password "is greater than or equal to 2, the sum is recorded as 1, the application system website is corresponded, otherwise, the sum is recorded as 0, and the portal website is corresponded.

Further, training data is obtained according to the determined feature dimension, for example, the following data are obtained as the training data:

negative sample data: portal website (0)

68 27 12 0 2 0 0 1167 http://www.zifi.cn

21 11 0 1 0 0 0 1801 http://tdtri.org

139 24 90 1 2 0 0 2884 http://renzefoundation.org

98 0 57 0 1 0 0 893 http://47.105.132.13

16 16 9 0 0 0 0 446 http://www.vis-top.com

97 54 7 0 1 0 0 1382 http://m.focus.sinorusfocus.com

18 9 0 0 1 0 0 90 http://youzhi.sdm.net.cn

70 0 0 0 0 0 0 378 http://www.xcryedu.com

Positive sample data: application system (1)

1 1 5 0 1 1 0 93 http://jszb.nhfpc.gov.cn

6 2 1 1 1 1 1 41 http://eip.zgyj.org.cn

2 1 2 1 1 0 0 207 http://cx.bjmzdx.org:8089

7 2 0 0 1 1 1 115 http://caers.org.cn

11 1 0 1 1 0 1 621 http://taya1.anji.gov.cn

4 1 0 1 2 0 1 110 http://i.yyszx.com

1 4 0 1 1 1 1 43 http://47.104.233.200:85

0 0 0 1 1 1 0 33 http://47.104.242.253:8088

4 0 5 1 1 1 1 149 http://47.105.137.228

1 3 0 1 1 1 0 35 http://47.105.170.68:81

Further, a random forest algorithm is used for model training by using python, and the training process comprises the following steps:

(1) and loading training data and carrying out data normalization processing. Because some feature dimension values are particularly high, such as the length of the website content is short, and some dimensions are particularly low, such as only 1 or 0, the implementation of geometric reduction results in a lower convergence rate, and here, the training data can be normalized by means of the mean and variance; further, a sample matrix and a type matrix are generated using the normalized training data, for example: x matrix (sample matrix): [ [ -0.4909595-0.44880872-0.40717837.. 0.728683151.10401228-0.13842202 ] … … … … … … … ]; y matrix (type matrix): corresponding to a matrix of positive and negative samples 0 and 1.

(2) And training by using python to generate a website recognition model. Further, in the data test, the output result is 0 or 1, 1 represents an application system website, and 0 represents a portal website.

Further, optimizing the website identification model:

(1) and identifying the website to be identified by using the website identification model, and adding the identification result into the training data again for training to obtain the optimized website identification model (a second optimized website identification model).

(2) And verifying the website identification model by using a cross-validation technology to obtain an optimized website identification model (a first optimized website identification model).

(3) Presetting model parameters, and calculating optimal solution parameters of the website recognition model so as to retrain training data according to the optimal solution and obtain the optimized website recognition model.

Therefore, according to the website identification method provided by the embodiment of the application, the website identification model is established in advance by using the random forest algorithm, so that the unknown website can be automatically identified through the website identification model.

To solve the above problem, please refer to fig. 4, fig. 4 is a schematic structural diagram of a website recognition apparatus provided in the present application, where the website recognition apparatus may include:

the instruction receiving module 10 is configured to determine a website to be identified according to the received website identification instruction;

the model retrieving module 20 is used for retrieving a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;

the website identification module 30 is configured to identify a website to be identified by using a preset website identification model, and determine a website category of the website to be identified.

Therefore, the website recognition device provided by the embodiment of the application establishes the website recognition model by using the random forest algorithm in advance, so that the unknown website can be automatically recognized through the website recognition model.

As a preferred embodiment, the website identification apparatus may further include a model building module, and the model building module may include:

the system comprises a sample acquisition unit, a sample acquisition unit and a sample acquisition unit, wherein the sample acquisition unit is used for acquiring a sample website;

the information acquisition unit is used for acquiring website information according to the sample website;

the characteristic extraction unit is used for extracting the characteristics of the website information to obtain specified characteristic information;

and the model training unit is used for performing model training on the specified characteristic information by using a random forest algorithm to obtain a preset website identification model.

As a preferred embodiment, the information collecting unit may be specifically configured to perform information crawling on a sample website by using a crawler technology to obtain website information.

As a preferred embodiment, the model building module may further include a normalization unit, configured to perform normalization processing on the specified feature information by using the mean and the variance, so as to obtain normalized specified feature information.

As a preferred embodiment, the model training unit may be specifically configured to generate a sample matrix and a type matrix corresponding to the sample matrix according to the specified feature information; and performing random forest calculation on the sample matrix and the type matrix based on python to obtain a preset website identification model.

As a preferred embodiment, the website identification apparatus may further include a first model optimization module, configured to verify a preset website identification model by using a cross-validation method, so as to obtain a first optimized website identification model.

As a preferred embodiment, the website identification apparatus may further include a second model optimization module, configured to perform optimization processing on the preset website identification model according to the website category of the website to be identified, so as to obtain a second optimized website identification model.

For the introduction of the apparatus provided in the present application, please refer to the above method embodiments, which are not described herein again.

To solve the above problem, please refer to fig. 5, fig. 5 is a schematic structural diagram of a website recognition system provided by the present application, where the website recognition system may include:

a memory 1 for storing a computer program;

the processor 2 is configured to implement the steps of any one of the above website identification methods when executing the computer program.

For the introduction of the server provided in the present application, please refer to the above method embodiment, which is not described herein again.

In order to solve the above problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, can implement the steps of any one of the above website identification methods.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

For the introduction of the computer-readable storage medium provided in the present application, please refer to the above method embodiments, which are not described herein again.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The website identification method, apparatus, system, and computer-readable storage medium provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and these improvements and modifications also fall into the elements of the protection scope of the claims of the present application.

Claims

1. A website identification method, comprising:

2. The website recognition method of claim 1, wherein building the preset website recognition model based on the random forest algorithm comprises:

acquiring a sample website;

acquiring website information according to the sample website;

3. The website identification method according to claim 2, wherein the obtaining website information according to the sample website comprises:

4. The website recognition method according to claim 2, wherein before performing model training on the specified feature information by using the random forest algorithm to obtain the preset website recognition model, the method further comprises:

5. The website recognition method according to claim 2, wherein the performing model training on the specified feature information by using the random forest algorithm to obtain the preset website recognition model comprises:

6. The website identification method of claim 2, further comprising:

7. The website identification method according to any one of claims 1 to 6, further comprising:

8. A website recognition apparatus, comprising:

9. A website identification system, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the website identification method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, carries out the steps of the website identification method according to any one of claims 1 to 7.