CN112328936A

CN112328936A - Website identification method, device and equipment and computer readable storage medium

Info

Publication number: CN112328936A
Application number: CN202011203397.6A
Authority: CN
Inventors: 宋建昌; 孙学军
Original assignee: Hangzhou Anheng Information Security Technology Co Ltd
Current assignee: Hangzhou Anheng Information Security Technology Co Ltd
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2021-02-05

Abstract

The application discloses a website identification method, which comprises the steps of carrying out website detection according to an identification instruction to obtain a target website with preset keywords; extracting the features of each webpage in the target website to obtain target feature information; evaluating the target website according to the target characteristic information to obtain a website evaluation value; determining the website type of the target website according to the website evaluation value; the website identification method can realize website identification more quickly and accurately and ensure network safety. The application also discloses a website identification device, equipment and a computer readable storage medium, which have the beneficial effects.

Description

Website identification method, device and equipment and computer readable storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a website identification method, and further, to a website identification apparatus, a device, and a computer-readable storage medium.

Background

With the rapid development of internet technology, economic crimes gradually evolve into novel crimes in combination with the internet, and the dissemination is a crime form in which the influence is relatively large. The network marketing is hidden in operation, diffused and traversed and can be quickly copied on line, so that the network marketing is very difficult to discover and attack, and the website identification on the internet is more difficult to realize due to the large website base number. In the related technology, most of the suspected targets of network marketing are obtained by means of internet public sentiment, offline sentiment collection and the like, but the implementation mode is low in hit rate and accuracy and difficult to guarantee network safety and user information safety.

Therefore, how to more quickly and accurately realize website identification and ensure network security is a problem to be urgently solved by technical personnel in the field.

Disclosure of Invention

The website identification method can realize website identification more quickly and accurately and ensure network safety; another object of the present application is to provide a website identification apparatus, a website identification device, and a computer-readable storage medium, all of which have the above advantages.

In a first aspect, the present application provides a website identification method, including:

detecting websites according to the identification instruction to obtain a target website with preset keywords;

extracting the features of each webpage in the target website to obtain target feature information;

evaluating the target website according to the target characteristic information to obtain a website evaluation value;

and determining the website type of the target website according to the website evaluation value.

Preferably, the website detecting according to the identification instruction to obtain the target website with the preset keyword includes:

and when the identification instruction is received, website detection is carried out by utilizing a web crawler technology, and the target website with the preset keywords is obtained.

Preferably, before performing feature extraction on each webpage in the target website to obtain target feature information, the method further includes:

and traversing the target website through the web crawler technology to obtain each webpage in the target website.

Preferably, the extracting the features of each webpage in the target website to obtain the target feature information includes:

and extracting the features of the webpages by using a webpage structural analysis technology to obtain recommended arrangement relation features.

and extracting the features of the webpages by using a preset feature extraction algorithm to obtain the target text features.

Preferably, the website identification method further includes:

obtaining ICP filing information of the target website;

determining a filing user of the target website according to the ICP filing information;

inquiring the network information of the filing user;

and extracting the characteristics of the network information to obtain the target characteristic information.

Preferably, the determining the website type of the target website according to the website evaluation value includes:

comparing the website evaluation value with a preset confidence coefficient to obtain a comparison result;

and determining the website type of the target website according to the comparison result.

In a second aspect, the present application further discloses a website identification apparatus, including:

the website detection module is used for detecting websites according to the identification instruction to obtain target websites with preset keywords;

the characteristic extraction module is used for extracting the characteristics of each webpage in the target website to obtain target characteristic information;

the website evaluation module is used for evaluating the target website according to the target characteristic information to obtain a website evaluation value;

and the website identification module is used for determining the website type of the target website according to the website evaluation value.

In a third aspect, the present application further discloses a click control device, including:

a memory for storing a computer program;

a processor for implementing the steps of any of the above described website identification methods when executing the computer program.

In a fourth aspect, the present application further discloses a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of any of the website identification methods described above.

The website identification method comprises the steps of detecting a website according to an identification instruction to obtain a target website with preset keywords; extracting the features of each webpage in the target website to obtain target feature information; evaluating the target website according to the target characteristic information to obtain a website evaluation value; and determining the website type of the target website according to the website evaluation value.

Therefore, the website identification method provided by the application realizes the screening of the target website through website detection based on the preset keywords, realizes the evaluation of the target website based on the feature extraction of the specified features, and further completes the website type identification, compared with the manual identification in the prior art, the implementation mode is faster and more convenient, the automatic identification of the internet website can be realized without manual operation, the accuracy of the identification result is effectively ensured, and the network safety is further improved; in addition, the implementation mode is suitable for various types of website identification and has high applicability.

The website identification device, the equipment and the computer readable storage medium provided by the application all have the beneficial effects, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the prior art and the embodiments of the present application, the drawings that are needed to be used in the description of the prior art and the embodiments of the present application will be briefly described below. Of course, the following description of the drawings related to the embodiments of the present application is only a part of the embodiments of the present application, and it will be obvious to those skilled in the art that other drawings can be obtained from the provided drawings without any creative effort, and the obtained other drawings also belong to the protection scope of the present application.

Fig. 1 is a schematic flowchart of a website identification method provided in the present application;

fig. 2 is a schematic flowchart of a website identification method provided in the present application;

fig. 3 is a schematic structural diagram of a website identification apparatus provided in the present application;

fig. 4 is a schematic structural diagram of a website identification device provided in the present application.

Detailed Description

The core of the application is to provide a website identification method, which can realize website identification more quickly and accurately and ensure network safety; another core of the present application is to provide a website identification apparatus, a device and a computer-readable storage medium, which also have the above-mentioned advantages.

In order to more clearly and completely describe the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a website identification method according to the present application, where the website identification method includes:

s101: detecting websites according to the identification instruction to obtain a target website with preset keywords;

the method comprises the steps of detecting a target website, and detecting the website based on preset keywords, wherein the target website refers to a certain type of internet website needing to be identified, such as a shopping website, a friend-making website, a marketing website and the like. Of course, the number of target websites is not unique, and it can be understood that website detection is performed on all websites in the internet, and the number of obtained target websites is generally multiple.

Specifically, when the identification instruction is received, website detection of preset keywords can be performed on each website in the internet, and a target website with the preset keywords is obtained. The preset keywords correspond to target websites, different target websites may correspond to different preset keywords, and when a certain category of websites needs to be identified, the preset keywords corresponding to the websites are adopted. For example, the predetermined keyword corresponding to the identification of the shopping website may be "price", "coupon", "return" or the like, the predetermined keyword corresponding to the identification of the friend-making website may be "friend-making", "age", "interest" or the like, and the predetermined keyword corresponding to the identification of the marketing website may be "marketing", "fraud" or the like. In addition, the specific content of the preset keyword may be obtained by analyzing the identification instruction, or may be directly input and set by a technician, which is not limited in the present application.

As a preferred embodiment, the detecting the website according to the identification instruction to obtain the target website with the preset keyword may include: and when the identification instruction is received, performing website detection by using a web crawler technology to obtain a target website with preset keywords.

The preferred embodiment provides a specific target website acquisition method, which is realized based on a web crawler technology, and when an identification instruction is received, the web crawler can be directly utilized to crawl a target website with preset keywords in the internet. The web crawler technology is a technology for automatically capturing programs or scripts of world wide web information according to a certain rule, and is high in efficiency, wide in coverage and high in accuracy.

S102: extracting the characteristics of each webpage in the target website to obtain target characteristic information;

the step aims to realize feature extraction, obtain target feature information in a target website, and specifically perform feature extraction on each webpage in the target website. The target feature information is similar to the preset keywords, the feature information required to be extracted by different types of websites is different, and the feature content required to be extracted is set according to the type of the website required to be identified, for example, for a shopping website, the target feature information required to be extracted can be a transaction ordering feature, a user evaluation feature and the like, and for a marketing website, the target feature information required to be extracted can be a recommended arrangement relationship feature, a bonus system feature and the like. It should be noted that the specific implementation method of feature extraction does not affect the implementation of the present technical solution, and the technical personnel may set the implementation method according to the actual situation, which is not limited in the present application.

As a preferred embodiment, before the performing the feature extraction on each webpage in the target website to obtain the target feature information, the method may further include: and traversing the target website through a web crawler technology to obtain each webpage in the target website.

Specifically, before feature extraction is performed on the web page content, acquisition of each web page in the target website is required, and the implementation process of the method can be implemented based on a web crawler technology. Of course, the web page obtaining method is only one implementation manner provided in the preferred embodiment, and may also be implemented by other technologies, such as a regular expression, and the like, which is not limited in the present application.

As a preferred embodiment, the extracting the features of the web pages in the target website to obtain the target feature information may include: and extracting the features of each webpage by utilizing a webpage structural analysis technology to obtain the recommended arrangement relationship features.

The preferred embodiment provides specific type of feature information, namely recommended placement relationship features, and the recommended placement relationship features in the target website can be obtained through a webpage structured parsing technology. Specifically, for the marketing website, when a user needs to register in the website, a recommendation code or a placement relation person and the like generally needs to be submitted in a form input box, so that the form input box can be identified through a page structured analysis technology to obtain a recommended placement relation feature. The page structured parsing technology is a technology for structuring html text according to xml structure rules, and common tools include Dom4j, Jsoup and the like.

As a preferred embodiment, the extracting the features of the web pages in the target website to obtain the target feature information may include: and extracting the features of each webpage by using a preset feature extraction algorithm to obtain the target text features.

The preferred embodiment provides another specific type of feature information, that is, a target text feature, and the target text feature in the target website can be obtained through a preset feature extraction algorithm, where the target text feature is similar to the preset keyword, and specifically may be text information of a certain type of feature specified in advance. The preset feature extraction algorithm may be any algorithm that can be used to extract feature information, such as an article association algorithm, a vector machine algorithm, a semantic analysis algorithm, a picture recognition algorithm, and the like.

As a preferred embodiment, the website identification method may further include: obtaining ICP filing information of a target website; determining a filing user of a target website according to ICP (Internet Content Provider) filing information; inquiring network information of a filing user; and extracting the characteristics of the network information to obtain target characteristic information.

The preferred embodiment provides another implementation manner for obtaining the target feature information, namely implementation based on ICP filing information. The ICP is a telecommunication operator providing Internet information services and value-added services for a large number of users, and is a formal operation enterprise or department approved by a national administrative department, the country implements a licensing system for the operational Internet information services, implements a filing system for the non-operational Internet information services, and does not obtain licenses or fulfill filing procedures and cannot engage in the Internet information services, so that enterprises, public institutions, individuals and Internet websites can be associated through ICP filing information. Specifically, firstly, a filing user of a target website is determined according to ICP filing information, and further, feature extraction is performed on network information of the filing user to obtain target feature information, wherein the network information of the filing user includes, but is not limited to, related network public opinions, soft texts and the like.

S103: evaluating the target website according to the target characteristic information to obtain a website evaluation value;

the website evaluation is realized by evaluating the target website according to the extracted target characteristic information, and the website evaluation value can be obtained and used for realizing website type identification of the target website. Specifically, in the website evaluation process, the extracted target feature information may be digitized through a preset algorithm model, so as to obtain the website evaluation value.

S104: and determining the website type of the target website according to the website evaluation value.

The step aims to determine the website type, that is, whether the target website belongs to a certain specific type of website is determined according to the website evaluation value, and the determination may be specifically realized by referring to a preset evaluation grade table, a standard evaluation value and the like.

As a preferred embodiment, the determining the website type of the target website according to the website evaluation value may include: comparing the website evaluation value with a preset confidence coefficient to obtain a comparison result; and determining the website type of the target website according to the comparison result.

The preferred embodiment provides a specific website type determination method, which is implemented based on preset confidence level, and determines the website type of a target website by comparing the website evaluation value with the preset confidence level. For example, when the website evaluation value exceeds the preset confidence level, the target website is considered to belong to a certain type of website, otherwise, the target website is judged not to belong to the certain type. Of course, the specific value of the preset confidence does not affect the implementation of the technical scheme, and the technical staff can set the value according to the actual situation, which is not limited in the present application.

The embodiment of the application provides another website identification method.

The website identification method provided in the embodiment of the present application is introduced by taking identifying a cancellation website as an example, please refer to fig. 2, and fig. 2 is a schematic flow chart of the website identification method provided in the present application, where a specific implementation flow of the website identification method may include:

step one, detecting websites in the Internet through a web crawler technology, and searching websites (target websites) with biography keywords or recommendation and placement relation keywords.

And step two, traversing the websites meeting the conditions by utilizing a deep crawler technology to obtain all sub-page contents of the websites.

Thirdly, identifying and acquiring recommended arrangement characteristic information of all page contents under the website by utilizing a webpage structured analysis technology:

(1) the page is subjected to structured processing through a webpage structured analysis technology, and a Jsoup tool can be used for finding the contents of the form and the input box in the page more quickly;

(2) identifying and obtaining recommended placement characteristic information: and analyzing an input box in each form in the page, judging whether the input box is associated with the recommended arrangement relation key words, and if so, determining that the input box accords with the recommended arrangement characteristics.

Analyzing the page content by utilizing the reimbursement bonus system identification technology, and judging whether the website has a bonus system; the identification technology of the reimbursement bonus system is an intelligent identification method of the reimbursement item bonus system based on picture identification and semantic analysis, and can obtain suspected text of the reimbursement item bonus system through new word discovery, article relevance algorithm, feature extraction, support vector machine and the like, and accurately judge the reimbursement item bonus system through picture identification and digital gradient matrix.

And step five, identifying and obtaining the characteristics of the bonus system by utilizing ICP record information:

(1) obtaining ICP record information of the website;

(2) obtaining an operation main enterprise through ICP filing information, and inquiring related Internet public sentiments and propaganda soft texts according to a filing company if the operation main enterprise can be obtained;

(3) identifying whether public sentiment and soft text have a marketing bonus system by using a marketing bonus system identification technology;

(4) for the web pages with bonus systems, the pagerank value (web page ranking) of the web pages on the search engine is obtained and recorded.

Step six, establishing a biography and marketing algorithm model, digitizing the analysis results in the step three, the step four and the step five to obtain an evaluation value of the website, and assuming that:

(1) the score of the website with the recommended placement relationship characteristics is as follows: (x) ln (x), (0< ═ x), where x is the number of occurrences of the form input box associated with the recommended placement relationship keyword;

(2) the scores of the existing bonus system of the website are as follows: g (n) ═ n, where n is a constant;

(3) the single webpage score of the public sentiment or the existing bonus system in the soft text of the corresponding enterprise of the website is as follows: o (g) ═ g, where g is the pagerank value; the total value of the matching term is

Wherein m is the number of times of public sentiments or soft texts with bonus system;

thus, the final evaluation result of the website is F (x, z, m) ═ F (x) + g (n) + o (m).

And step seven, determining whether the website is in a marketing website or not according to the website evaluation result F, wherein the greater the value of F, the greater the probability of representing that the website is a marketing website, the minimum confidence y can be preset, and when the website evaluation result F is greater than y, the website can be judged to be a marketing website.

Based on the explanation, the identification method of the marketing website can acquire the websites with the recommended placement relationship from the internet through a web crawler technology, a page structured analytic technology, a marketing website feature identification technology and the like, deeply scan the websites, mine bonus system evidence, simultaneously obtain a website operation main body through calling ICP (inductively coupled plasma) filing information, crawl the related public sentiments and soft texts of the operation main body, mine the bonus system evidence and finally achieve the purposes of network marketing mining and identification, wherein the suspected marketing main body can be scored more accurately by modeling and analyzing the analysis results of the three dimensions of the recommended placement relationship identification, the website bonus system identification and the operation main body public sentiment analysis, and the accuracy of the identification result is further ensured.

Therefore, the website identification method provided by the embodiment of the application realizes the screening of the target website through website detection based on the preset keywords, realizes the evaluation of the target website based on the feature extraction of the specified features, and further completes the website type identification, compared with the manual identification in the prior art, the implementation mode is faster and more convenient, the automatic identification of the internet website can be realized without manual operation, the accuracy of the identification result is effectively ensured, and the network safety is further improved; in addition, the implementation mode is suitable for various types of website identification and has high applicability.

To solve the above technical problem, the present application further provides a website recognition apparatus, please refer to fig. 3, where fig. 3 is a schematic structural diagram of the website recognition apparatus provided in the present application, and the website recognition apparatus may include:

the website detection module 1 is used for performing website detection according to the identification instruction to obtain a target website with preset keywords;

the feature extraction module 2 is used for extracting features of each webpage in the target website to obtain target feature information;

the website evaluation module 3 is used for evaluating a target website according to the target characteristic information to obtain a website evaluation value;

and the website identification module 4 is used for determining the website type of the target website according to the website evaluation value.

Therefore, the website identification device provided by the embodiment of the application realizes the screening of the target website through website detection based on the preset keywords, realizes the evaluation of the target website based on the feature extraction of the specified features, and further completes the website type identification, compared with the manual identification in the prior art, the implementation mode is faster and more convenient, the automatic identification of the internet website can be realized without manual operation, the accuracy of the identification result is effectively ensured, and the network safety is further improved; in addition, the implementation mode is suitable for various types of website identification and has high applicability.

As a preferred embodiment, the website detecting module 1 may be specifically configured to, when receiving the identification instruction, perform website detection by using a web crawler technology to obtain a target website with preset keywords.

As a preferred embodiment, the website identification apparatus may further include a web page obtaining module, configured to traverse the target website through a web crawler technology before performing feature extraction on each web page in the target website to obtain target feature information, so as to obtain each web page in the target website.

As a preferred embodiment, the feature extraction module 2 may be specifically configured to perform feature extraction on each webpage by using a webpage structural analysis technology to obtain recommended placement relationship features.

As a preferred embodiment, the feature extraction module 2 may be specifically configured to perform feature extraction on each webpage by using a preset feature extraction algorithm to obtain a target text feature.

As a preferred embodiment, the website identification apparatus may further include a filing feature extraction module, configured to obtain ICP filing information of the target website; determining a filing user of the target website according to the ICP filing information; inquiring network information of a filing user; and extracting the characteristics of the network information to obtain target characteristic information.

As a preferred embodiment, the website identification module 4 may be specifically configured to compare the website evaluation value with a preset confidence level to obtain a comparison result; and determining the website type of the target website according to the comparison result.

For the introduction of the apparatus provided in the present application, please refer to the above method embodiments, which are not described herein again.

To solve the above technical problem, the present application further provides a website identification device, please refer to fig. 4, where fig. 4 is a schematic structural diagram of the website identification device provided in the present application, and the website identification device may include:

a memory 10 for storing a computer program;

the processor 20, when executing the computer program, may implement the steps of any of the above-mentioned website identification methods.

For the introduction of the device provided in the present application, please refer to the above method embodiment, which is not described herein again.

To solve the above problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, can implement the steps of any one of the above website identification methods.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

For the introduction of the computer-readable storage medium provided in the present application, please refer to the above method embodiments, which are not described herein again.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The technical solutions provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, without departing from the principle of the present application, several improvements and modifications can be made to the present application, and these improvements and modifications also fall into the protection scope of the present application.

Claims

1. A website identification method, comprising:

2. The website identification method according to claim 1, wherein the website detecting according to the identification instruction to obtain the target website with the preset keyword comprises:

3. The website identification method according to claim 2, wherein before the extracting the features of each webpage in the target website to obtain the target feature information, the method further comprises:

4. The website identification method according to claim 1, wherein the extracting features of each webpage in the target website to obtain target feature information comprises:

5. The website identification method according to claim 1, wherein the extracting features of each webpage in the target website to obtain target feature information comprises:

6. The website identification method according to any one of claims 1 to 5, further comprising:

obtaining ICP filing information of the target website;

inquiring the network information of the filing user;

7. The website identification method according to claim 6, wherein the determining the website type of the target website according to the website evaluation value comprises:

8. A website recognition apparatus, comprising:

9. A website identification device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the website identification method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the website identification method according to any one of claims 1 to 7.