CN111008347A - Website identification method, device and system and computer readable storage medium - Google Patents

Website identification method, device and system and computer readable storage medium Download PDF

Info

Publication number
CN111008347A
CN111008347A CN201911166434.8A CN201911166434A CN111008347A CN 111008347 A CN111008347 A CN 111008347A CN 201911166434 A CN201911166434 A CN 201911166434A CN 111008347 A CN111008347 A CN 111008347A
Authority
CN
China
Prior art keywords
website
model
preset
identification
website identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911166434.8A
Other languages
Chinese (zh)
Inventor
温延龙
范渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN201911166434.8A priority Critical patent/CN111008347A/en
Publication of CN111008347A publication Critical patent/CN111008347A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a website identification method, which comprises the steps of determining a website to be identified according to a received website identification instruction; calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm; identifying the website to be identified by using the preset website identification model, and determining the website category of the website to be identified; the website identification method can be used for quickly and accurately identifying each application system website in the Internet. The application also discloses a website identification device, a website identification system and a computer readable storage medium, which have the beneficial effects.

Description

Website identification method, device and system and computer readable storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a website identification method, and further, to a website identification apparatus, system, and computer-readable storage medium.
Background
With the rapid development of the internet, a large number of internet sites are produced, including a large number of application systems, most of which are application system sites developed for storing data materials and managing enterprises or schools. However, these application system websites just become the object of frequent attacks and penetration by hacker organizations, resulting in illegal stealing of a large amount of important files and information.
Therefore, it is very important to quickly identify the application system open on the internet, which is an effective way to enhance the security supervision of the application system. In the prior art, although there are many websites on the internet, the way of identifying the types of the websites is somewhat lacking, generally speaking, the judgment is mainly performed manually, however, the workload of manual judgment is huge, the websites need to be identified and then matched, the identification efficiency is extremely low, and careless mistakes are easy to occur in the identification process.
Therefore, how to quickly and accurately identify an application system website in the internet is a problem to be solved by those skilled in the art.
Disclosure of Invention
The website identification method can be used for quickly and accurately identifying each application system website in the Internet; another object of the present invention is to provide a website recognition apparatus, system, communication server and computer readable storage medium, which also have the above advantages.
In order to solve the above technical problem, the present application provides a website identification method, where the website identification method includes:
determining a website to be identified according to the received website identification instruction;
calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;
and identifying the website to be identified by using the preset website identification model, and determining the website category of the website to be identified.
Preferably, the establishing the preset website identification model based on the random forest algorithm includes:
acquiring a sample website;
acquiring website information according to the sample website;
extracting the characteristics of the website information to obtain specified characteristic information;
and performing model training on the specified characteristic information by using the random forest algorithm to obtain the preset website identification model.
Preferably, the obtaining website information according to the sample website includes:
and crawling information of the sample website by using a crawler technology to obtain the website information.
Preferably, before performing model training on the specified feature information by using the random forest algorithm to obtain the preset website recognition model, the method further includes:
and carrying out standardization processing on the specified characteristic information by using the mean value and the variance to obtain the standardized specified characteristic information.
Preferably, the performing model training on the specified feature information by using the random forest algorithm to obtain the preset website identification model includes:
generating a sample matrix and a type matrix corresponding to the sample matrix according to the specified characteristic information;
and performing random forest calculation on the sample matrix and the type matrix based on python to obtain the preset website identification model.
Preferably, the website identification method further includes:
and verifying the preset website identification model by using a cross verification method to obtain a first optimized website identification model.
Preferably, the website identification method further includes:
and optimizing the preset website identification model according to the website category of the website to be identified to obtain a second optimized website identification model.
In order to solve the above technical problem, the present application further provides a website identification apparatus, including:
the instruction receiving module is used for determining the website to be identified according to the received website identification instruction;
the model calling module is used for calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;
and the website identification module is used for identifying the website to be identified by using the preset website identification model and determining the website category of the website to be identified.
In order to solve the above technical problem, the present application further provides a website identification system, where the website identification system further includes:
a memory for storing a computer program;
and the processor is used for realizing the steps of any one of the website identification methods when the computer program is executed.
In order to solve the above technical problem, the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of any one of the above website identification methods.
The website identification method comprises the steps of determining a website to be identified according to a received website identification instruction; calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm; and identifying the website to be identified by using the preset website identification model, and determining the website category of the website to be identified.
Therefore, according to the website identification method, the website identification model is established in advance by using the random forest algorithm, so that the unknown website can be automatically identified through the website identification model.
The website identification device, the website identification system and the computer readable storage medium provided by the application all have the beneficial effects, and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart of a website identification method provided in the present application;
FIG. 2 is a schematic flow chart illustrating a website recognition model construction method according to the present application;
FIG. 3 is a flowchart of a website identification method provided in the present application;
fig. 4 is a schematic structural diagram of a website identification apparatus provided in the present application;
fig. 5 is a schematic structural diagram of a website identification system provided in the present application.
Detailed Description
The core of the application is to provide a website identification method, which can quickly and accurately identify each application system website in the Internet; another core of the present application is to provide a website identification apparatus, a device and a computer-readable storage medium, which also have the above-mentioned advantages.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a website identification method according to the present application, where the website identification method may include:
s101: determining a website to be identified according to the received website identification instruction;
the step aims to realize the determination of the website to be identified, and the website to be identified is the website of which the unknown website type needs to be identified. Specifically, when a certain unknown website needs to be identified, the unknown website can be used as a website to be identified to send a website identification instruction to the controller, so that the controller can determine the website to be identified according to the website identification instruction and identify the type of the website to be identified.
S102: calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;
the method comprises the steps of obtaining a preset website identification model, wherein the preset website identification model is a model which is established in advance based on a random forest algorithm and can be used for identifying unknown website types, and when the model to be identified is determined, the preset website identification model can be directly called to realize website identification. After being established, the preset website identification model can be stored in a preset storage space, such as a memory, a magnetic disk and the like, so as to be called later.
S103: and identifying the website to be identified by using a preset website identification model, and determining the website category of the website to be identified.
The method aims to realize website identification, namely, identification is directly carried out on a to-be-identified website by using a called preset website identification model, a corresponding identification result can be obtained, and the website category of the to-be-identified website is determined. The specific content of the website category does not affect the implementation of the technical scheme, and the application does not limit the content.
As a preferred embodiment, the website identification method may further include: and optimizing the preset website identification model according to the website category of the website to be identified to obtain a second optimized website identification model.
The preferred embodiment aims to realize model optimization, and specifically, after the identification of the website to be identified is completed, the preset website identification model can be optimized by using the identification result to obtain the optimized website identification model, namely, the second optimized identification model, so that the model identification precision is improved, and convenience is provided for website identification again.
According to the website identification method, the website identification model is established by utilizing the random forest algorithm in advance, so that the unknown website can be automatically identified through the website identification model.
On the basis of the foregoing embodiments, the present application embodiment introduces details of the building process of the preset website identification model, please refer to fig. 2, and fig. 2 is a schematic flow diagram of a website identification model building method provided by the present application, where the website identification model building method includes:
s201: acquiring a sample website;
the step aims to realize the acquisition of a sample website, which is a website of a known website type and is opposite to a website to be identified, and is used for model training, the number of the sample website is not unique, and it can be understood that the more the number of the sample website is, the higher the accuracy of the obtained preset website identification model is. The sample website can be divided into a positive sample and a negative sample, or can be different website samples with determined categories.
S202: acquiring website information according to a sample website;
the step aims to realize the acquisition of the website information, namely, the website information in each sample website is acquired to realize the model training, and the acquisition method can adopt any one of the prior art, which is not limited in the application.
Preferably, the obtaining of the website information according to the sample website may include: and crawling information of the sample website by using a crawler technology to obtain website information.
The preferred embodiment provides a specific website information obtaining method, which is realized based on a web crawler technology, and specifically, after obtaining each sample website, information crawling can be performed by using the web crawler to obtain website information of each sample website.
S203: extracting the characteristics of the website information to obtain specified characteristic information;
this step is intended to realize feature extraction, that is, extracting relevant information for obtaining specified features, that is, the specified feature information, from website information. The specific feature is preset by a technician according to actual needs, and the specific content of the feature is not limited in the present application, for example, the feature dimensions to be extracted include the data amount of the header < a > tag, the number of < form > forms included in < body >, the number of type ═ password "in < body >, whether" login "or" login "is included, the length of html tag content removed from the header, the number of < img > tags, the number of < li > in the page, the number of type ═ text" and type ═ password in the page body, and the like.
S204: and performing model training on the specified characteristic information by using a random forest algorithm to obtain a preset website identification model.
The step aims to realize model training, namely, the random forest algorithm is used for carrying out model training on each piece of extracted specified characteristic information, and the preset website recognition model can be obtained.
Preferably, before the model training is performed on the specified feature information by using the random forest algorithm to obtain the preset website recognition model, the method may further include: and carrying out standardization processing on the specified characteristic information by using the mean value and the variance to obtain the standardized specified characteristic information.
In order to further improve the recognition accuracy and the training speed of the preset website recognition model, before the model training is carried out by using the specified characteristic information, the further standardization processing can be carried out on each specified characteristic information, wherein the standardization process can be realized by mean value calculation and variance calculation, and further standardized specified characteristic information is obtained, so that the subsequent model training can be carried out based on the standardized specified characteristic information.
Preferably, the performing model training on the specified feature information by using a random forest algorithm to obtain the preset website recognition model may include: generating a sample matrix and a type matrix corresponding to the sample matrix according to the specified characteristic information; and performing random forest calculation on the sample matrix and the type matrix based on python to obtain a preset website identification model.
The preferred embodiment provides a specific model training method based on a random forest algorithm, which comprises the steps of firstly, generating a sample matrix and a type matrix corresponding to the sample matrix by utilizing each piece of specified characteristic information, and further, executing the random forest algorithm on the sample matrix by utilizing python to obtain a corresponding preset website identification model.
Preferably, the website identification method may further include: and verifying the preset website identification model by using a cross verification method to obtain a first optimized website identification model.
In order to further improve the model identification precision and ensure the accuracy of the identification result, the preset website identification model can be verified after being constructed, and the method can be specifically realized by a cross verification method, so that the optimized website identification model, namely the first optimized website identification model, is obtained. The cross validation is generally applied to the modeling process, namely, in a given modeling sample, most samples are taken out for model construction, a small part of samples are reserved for forecasting by using the just-established model, the forecasting errors of the small part of samples are solved, the square sum of the forecasting errors is recorded, and the stability of the model can be effectively improved.
The embodiment of the application provides a specific website identification model construction method, so that automatic identification of a website to be identified is realized through the website identification model, and the website identification efficiency and the accuracy of an identification result are further improved.
On the basis of the foregoing embodiments, the present application embodiment provides a more specific website identification method by taking identification of an application system website as an example, and with reference to fig. 3, fig. 3 is a flowchart of the website identification method provided by the present application, and a specific implementation flow thereof is as follows:
firstly, collecting website home pages of the Internet and crawling website home page information, specifically, manually collecting a batch of websites of an application system as positive samples in advance, collecting a batch of websites which are not the application system as negative samples, recording the negative samples as portal websites, and further crawling the information of the website home page body according to the collected sample data.
Further, setting feature dimensions to be acquired, wherein the feature dimensions are 8 in total and are respectively as follows: the data size of the < a > tag of the home page, the number of < form > forms contained in < body >, the number of type ═ password "contained in < body >, whether the ' login ' or ' login ' is contained, the length of html tag content removed from the home page, the number of < img > tags, the number of < li > in the page, the number of type ═ text ' and type ═ password in the page body.
Wherein, the above 8 characteristic dimension data acquisition modes are:
(1) according to the completeness of the application system website and the portal website home page, the first characteristic dimension is determined to be the quantity of statistical home page < a > tags, generally, the data quantity of the page < a > tags in the portal website is large, and the quantity of the page < a > tags in the application system website is small.
(2) The number of < form > forms contained in the home page < body > of the website is counted, and generally, at least one < form > form is contained in the application system website to submit the login information.
(3) Counting the number of types in the < body >, generally, the application system website home page includes fields such as passwords, and since various descriptions may appear on the websites with the Chinese language "passwords", the statistics can be performed by using the mode of counting the tag attributes of the types.
(4) Whether the matching is carried with the keyword 'login' or 'logic', if the home page is carried with the keyword 'login' or 'logic', the matching is marked as 1, the matching corresponds to an application system website, and if the matching is not carried with the keyword 'login' or 'logic', the matching is marked as 0, and the matching corresponds to a portal website.
(5) Counting the length of the home page content, wherein generally, the web page richness of a portal website is large, and the length of the page content is large; the web page content of the application system website is low in richness and small in content.
(6) Counting the number of < img > tags, in general, the web portal page is simple, only comprises a few pictures, and the page is rich.
(7) Counting the number of < li > in the page, generally, navigation bars are arranged in the portal website, and the navigation bars are generally composed of < li >.
(8) Counting the number of types ═ text "and" password "in the page body, generally, if the page of the application system website has login boxes, the number is definitely not less than two, and the portal website has only one login box, therefore, if the sum of the number of types ═ text" and the number of types ═ password "is greater than or equal to 2, the sum is recorded as 1, the application system website is corresponded, otherwise, the sum is recorded as 0, and the portal website is corresponded.
Further, training data is obtained according to the determined feature dimension, for example, the following data are obtained as the training data:
negative sample data: portal website (0)
68 27 12 0 2 0 0 1167 http://www.zifi.cn
21 11 0 1 0 0 0 1801 http://tdtri.org
139 24 90 1 2 0 0 2884 http://renzefoundation.org
98 0 57 0 1 0 0 893 http://47.105.132.13
16 16 9 0 0 0 0 446 http://www.vis-top.com
97 54 7 0 1 0 0 1382 http://m.focus.sinorusfocus.com
18 9 0 0 1 0 0 90 http://youzhi.sdm.net.cn
70 0 0 0 0 0 0 378 http://www.xcryedu.com
Positive sample data: application system (1)
1 1 5 0 1 1 0 93 http://jszb.nhfpc.gov.cn
6 2 1 1 1 1 1 41 http://eip.zgyj.org.cn
2 1 2 1 1 0 0 207 http://cx.bjmzdx.org:8089
7 2 0 0 1 1 1 115 http://caers.org.cn
11 1 0 1 1 0 1 621 http://taya1.anji.gov.cn
4 1 0 1 2 0 1 110 http://i.yyszx.com
1 4 0 1 1 1 1 43 http://47.104.233.200:85
0 0 0 1 1 1 0 33 http://47.104.242.253:8088
4 0 5 1 1 1 1 149 http://47.105.137.228
1 3 0 1 1 1 0 35 http://47.105.170.68:81
Further, a random forest algorithm is used for model training by using python, and the training process comprises the following steps:
(1) and loading training data and carrying out data normalization processing. Because some feature dimension values are particularly high, such as the length of the website content is short, and some dimensions are particularly low, such as only 1 or 0, the implementation of geometric reduction results in a lower convergence rate, and here, the training data can be normalized by means of the mean and variance; further, a sample matrix and a type matrix are generated using the normalized training data, for example: x matrix (sample matrix): [ [ -0.4909595-0.44880872-0.40717837.. 0.728683151.10401228-0.13842202 ] … … … … … … … ]; y matrix (type matrix): corresponding to a matrix of positive and negative samples 0 and 1.
(2) And training by using python to generate a website recognition model. Further, in the data test, the output result is 0 or 1, 1 represents an application system website, and 0 represents a portal website.
Further, optimizing the website identification model:
(1) and identifying the website to be identified by using the website identification model, and adding the identification result into the training data again for training to obtain the optimized website identification model (a second optimized website identification model).
(2) And verifying the website identification model by using a cross-validation technology to obtain an optimized website identification model (a first optimized website identification model).
(3) Presetting model parameters, and calculating optimal solution parameters of the website recognition model so as to retrain training data according to the optimal solution and obtain the optimized website recognition model.
Therefore, according to the website identification method provided by the embodiment of the application, the website identification model is established in advance by using the random forest algorithm, so that the unknown website can be automatically identified through the website identification model.
To solve the above problem, please refer to fig. 4, fig. 4 is a schematic structural diagram of a website recognition apparatus provided in the present application, where the website recognition apparatus may include:
the instruction receiving module 10 is configured to determine a website to be identified according to the received website identification instruction;
the model retrieving module 20 is used for retrieving a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;
the website identification module 30 is configured to identify a website to be identified by using a preset website identification model, and determine a website category of the website to be identified.
Therefore, the website recognition device provided by the embodiment of the application establishes the website recognition model by using the random forest algorithm in advance, so that the unknown website can be automatically recognized through the website recognition model.
As a preferred embodiment, the website identification apparatus may further include a model building module, and the model building module may include:
the system comprises a sample acquisition unit, a sample acquisition unit and a sample acquisition unit, wherein the sample acquisition unit is used for acquiring a sample website;
the information acquisition unit is used for acquiring website information according to the sample website;
the characteristic extraction unit is used for extracting the characteristics of the website information to obtain specified characteristic information;
and the model training unit is used for performing model training on the specified characteristic information by using a random forest algorithm to obtain a preset website identification model.
As a preferred embodiment, the information collecting unit may be specifically configured to perform information crawling on a sample website by using a crawler technology to obtain website information.
As a preferred embodiment, the model building module may further include a normalization unit, configured to perform normalization processing on the specified feature information by using the mean and the variance, so as to obtain normalized specified feature information.
As a preferred embodiment, the model training unit may be specifically configured to generate a sample matrix and a type matrix corresponding to the sample matrix according to the specified feature information; and performing random forest calculation on the sample matrix and the type matrix based on python to obtain a preset website identification model.
As a preferred embodiment, the website identification apparatus may further include a first model optimization module, configured to verify a preset website identification model by using a cross-validation method, so as to obtain a first optimized website identification model.
As a preferred embodiment, the website identification apparatus may further include a second model optimization module, configured to perform optimization processing on the preset website identification model according to the website category of the website to be identified, so as to obtain a second optimized website identification model.
For the introduction of the apparatus provided in the present application, please refer to the above method embodiments, which are not described herein again.
To solve the above problem, please refer to fig. 5, fig. 5 is a schematic structural diagram of a website recognition system provided by the present application, where the website recognition system may include:
a memory 1 for storing a computer program;
the processor 2 is configured to implement the steps of any one of the above website identification methods when executing the computer program.
For the introduction of the server provided in the present application, please refer to the above method embodiment, which is not described herein again.
In order to solve the above problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, can implement the steps of any one of the above website identification methods.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
For the introduction of the computer-readable storage medium provided in the present application, please refer to the above method embodiments, which are not described herein again.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The website identification method, apparatus, system, and computer-readable storage medium provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and these improvements and modifications also fall into the elements of the protection scope of the claims of the present application.

Claims (10)

1. A website identification method, comprising:
determining a website to be identified according to the received website identification instruction;
calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;
and identifying the website to be identified by using the preset website identification model, and determining the website category of the website to be identified.
2. The website recognition method of claim 1, wherein building the preset website recognition model based on the random forest algorithm comprises:
acquiring a sample website;
acquiring website information according to the sample website;
extracting the characteristics of the website information to obtain specified characteristic information;
and performing model training on the specified characteristic information by using the random forest algorithm to obtain the preset website identification model.
3. The website identification method according to claim 2, wherein the obtaining website information according to the sample website comprises:
and crawling information of the sample website by using a crawler technology to obtain the website information.
4. The website recognition method according to claim 2, wherein before performing model training on the specified feature information by using the random forest algorithm to obtain the preset website recognition model, the method further comprises:
and carrying out standardization processing on the specified characteristic information by using the mean value and the variance to obtain the standardized specified characteristic information.
5. The website recognition method according to claim 2, wherein the performing model training on the specified feature information by using the random forest algorithm to obtain the preset website recognition model comprises:
generating a sample matrix and a type matrix corresponding to the sample matrix according to the specified characteristic information;
and performing random forest calculation on the sample matrix and the type matrix based on python to obtain the preset website identification model.
6. The website identification method of claim 2, further comprising:
and verifying the preset website identification model by using a cross verification method to obtain a first optimized website identification model.
7. The website identification method according to any one of claims 1 to 6, further comprising:
and optimizing the preset website identification model according to the website category of the website to be identified to obtain a second optimized website identification model.
8. A website recognition apparatus, comprising:
the instruction receiving module is used for determining the website to be identified according to the received website identification instruction;
the model calling module is used for calling a preset website identification model, wherein the preset website identification model is established based on a random forest algorithm;
and the website identification module is used for identifying the website to be identified by using the preset website identification model and determining the website category of the website to be identified.
9. A website identification system, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the website identification method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, carries out the steps of the website identification method according to any one of claims 1 to 7.
CN201911166434.8A 2019-11-25 2019-11-25 Website identification method, device and system and computer readable storage medium Pending CN111008347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911166434.8A CN111008347A (en) 2019-11-25 2019-11-25 Website identification method, device and system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911166434.8A CN111008347A (en) 2019-11-25 2019-11-25 Website identification method, device and system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111008347A true CN111008347A (en) 2020-04-14

Family

ID=70112768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911166434.8A Pending CN111008347A (en) 2019-11-25 2019-11-25 Website identification method, device and system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111008347A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112468503A (en) * 2020-11-30 2021-03-09 苏州浪潮智能科技有限公司 Website authentication method, device, equipment and medium based on firewall

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521604A (en) * 2011-11-21 2012-06-27 上海交通大学 Device and method for estimating performance degradation of equipment based on inspection system
CN107341272A (en) * 2017-08-25 2017-11-10 北京奇艺世纪科技有限公司 A kind of method for pushing, device and electronic equipment
CN107566391A (en) * 2017-09-20 2018-01-09 上海斗象信息科技有限公司 Domain identification plus the method for the topic identification structure machine learning model detection dark chain of webpage
CN108875060A (en) * 2018-06-29 2018-11-23 成都市映潮科技股份有限公司 A kind of website identification method and identifying system
CN109146080A (en) * 2018-09-14 2019-01-04 苏州正载信息技术有限公司 The method of model realization framework based on supervision class machine learning algorithm
CN109165587A (en) * 2018-08-11 2019-01-08 石修英 intelligent image information extraction method
CN109710825A (en) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 Webpage harmful information identification method based on machine learning
CN110061975A (en) * 2019-03-29 2019-07-26 中国科学院计算技术研究所 A kind of counterfeit website identification method and system based on offline flow Packet analyzing
CN110334262A (en) * 2019-06-06 2019-10-15 阿里巴巴集团控股有限公司 A kind of model training method, device and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521604A (en) * 2011-11-21 2012-06-27 上海交通大学 Device and method for estimating performance degradation of equipment based on inspection system
CN107341272A (en) * 2017-08-25 2017-11-10 北京奇艺世纪科技有限公司 A kind of method for pushing, device and electronic equipment
CN107566391A (en) * 2017-09-20 2018-01-09 上海斗象信息科技有限公司 Domain identification plus the method for the topic identification structure machine learning model detection dark chain of webpage
CN108875060A (en) * 2018-06-29 2018-11-23 成都市映潮科技股份有限公司 A kind of website identification method and identifying system
CN109165587A (en) * 2018-08-11 2019-01-08 石修英 intelligent image information extraction method
CN109146080A (en) * 2018-09-14 2019-01-04 苏州正载信息技术有限公司 The method of model realization framework based on supervision class machine learning algorithm
CN109710825A (en) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 Webpage harmful information identification method based on machine learning
CN110061975A (en) * 2019-03-29 2019-07-26 中国科学院计算技术研究所 A kind of counterfeit website identification method and system based on offline flow Packet analyzing
CN110334262A (en) * 2019-06-06 2019-10-15 阿里巴巴集团控股有限公司 A kind of model training method, device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112468503A (en) * 2020-11-30 2021-03-09 苏州浪潮智能科技有限公司 Website authentication method, device, equipment and medium based on firewall

Similar Documents

Publication Publication Date Title
CN107707545B (en) Abnormal webpage access fragment detection method, device, equipment and storage medium
CN109726763B (en) Information asset identification method, device, equipment and medium
CN107347052B (en) Method and device for detecting database collision attack
CN107784205B (en) User product auditing method, device, server and storage medium
CN110798445B (en) Public gateway interface testing method and device, computer equipment and storage medium
CN111125658B (en) Method, apparatus, server and storage medium for identifying fraudulent user
CN110768875A (en) Application identification method and system based on DNS learning
CN113242236A (en) Method for constructing network entity threat map
CN111583000B (en) Method and device for identifying behavior of surrounding mark and string mark, computer equipment and storage medium
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
EP3893128A1 (en) Crawler data recognition method, system and device
CN111008347A (en) Website identification method, device and system and computer readable storage medium
CN114386013A (en) Automatic student status authentication method and device, computer equipment and storage medium
CN113949525A (en) Method and device for detecting abnormal access behavior, storage medium and electronic equipment
CN117254983A (en) Method, device, equipment and storage medium for detecting fraud-related websites
CN111125704B (en) Webpage Trojan horse recognition method and system
CN104580100A (en) Method, device and server for identifying malicious message
CN106097403B (en) Method for acquiring network protected index data based on image curve calculation
CN109583210A (en) A kind of recognition methods, device and its equipment of horizontal permission loophole
CN110413909B (en) Machine learning-based intelligent identification method for online firmware of large-scale embedded equipment
CN105224655B (en) Detection method, the treating method and apparatus of website conversion setting
CN113032836A (en) Data desensitization method and apparatus
CN112488562A (en) Service implementation method and device
CN106547913B (en) Page information collection and classification feedback method, device and system
CN111242307A (en) Judgment result obtaining method and device based on deep learning and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200414