CN111831948B

CN111831948B - Webpage type detection method and device and computer equipment

Info

Publication number: CN111831948B
Application number: CN201910315871.5A
Authority: CN
Inventors: 庞玉
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2024-06-14
Anticipated expiration: 2039-04-18
Also published as: CN111831948A

Abstract

The application discloses a method and a device for detecting webpage types and computer equipment. Wherein the method comprises the following steps: acquiring text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag; comparing the text density with a first threshold value to obtain a first comparison result; and identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result. The application solves the technical problems that the recall rate of the risk webpage is greatly dependent on the sensitive word stock and affects the recall rate of the risk webpage.

Description

Webpage type detection method and device and computer equipment

Technical Field

The present application relates to the field of computers, and in particular, to a method and apparatus for detecting a web page type, and a computer device.

Background

The existing judging method for the risk webpage is basically based on detection of a sensitive word stock, and risk content is detected and identified according to the sensitive word stock; however, the sensitive word stock needs a lot of labor cost to be updated regularly; the recall rate has strong dependence on the sensitive word stock, and if the sensitive word stock is not updated in time, the recall rate is lower, so that the recall rate of the risk webpage is lower.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a method, a device and computer equipment for detecting webpage types, which are used for at least solving the technical problem that the recall rate of a risk webpage is influenced by large dependence on a sensitive word stock.

According to an aspect of an embodiment of the present application, there is provided a method for detecting a web page type, including: acquiring text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag; comparing the text density with a first threshold value to obtain a first comparison result; and identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result.

According to another aspect of the embodiment of the present application, there is also provided a method for detecting a web page type, including: acquiring text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag; comparing the text density with a first threshold value to obtain a first comparison result; and when the first comparison result indicates that the text density is larger than a first threshold value, generating prompt information for indicating that the target webpage is a webpage of a specified type, and displaying the prompt information.

According to another aspect of the embodiment of the present application, there is also provided a device for detecting a web page type, including: the acquisition module is used for acquiring the text density of tag content in the target webpage, wherein the tag content is corresponding to the hypertext markup language tag; the comparison module is used for comparing the text density with a first threshold value to obtain a first comparison result; and the identification module is used for identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result.

According to another aspect of the embodiment of the present application, there is further provided a nonvolatile storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute the above method for detecting a web page type.

According to another aspect of an embodiment of the present application, there is also provided a computer apparatus including: a processor; and a memory, coupled to the processor, for providing instructions to the processor to process the steps of: step 1, obtaining text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag; step 2, comparing the text density with a first threshold value to obtain a first comparison result; and step 3, identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result.

In the embodiment of the application, the text density of the tag content in the target webpage is acquired, wherein the tag content is the content corresponding to the hypertext markup language tag; comparing the text density with a first threshold value to obtain a first comparison result; the method for identifying the webpage type of the target webpage at least according to the first comparison result and obtaining the target identification result achieves the aim of judging whether the webpage is a risk webpage or not through acquiring and judging text density in the target webpage, so that the technical effect of improving the detection amount of the risk webpage based on analysis of the webpage structure is achieved, and the technical problem that recall rate of the risk webpage depends on a sensitive word stock greatly and affects recall rate of the risk webpage is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 shows a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a method of detecting web page types;

FIG. 2 is a flowchart of a method for detecting a web page type according to embodiment 1 of the present application;

FIG. 3 is a flowchart of a method for detecting a web page type according to embodiment 2 of the present application;

FIG. 4 is a flowchart of a method for determining a first threshold according to embodiment 2 of the present application;

FIG. 5 is a flowchart of a method for detecting a web page type according to embodiment 3 of the present application;

Fig. 6 is a schematic diagram of a web page type detection device according to embodiment 4 of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For better understanding of the above embodiments, technical terms related to the embodiments of the present application are explained below:

text density: the body text is the ratio of the length of the entire tag content (including the tag).

Neural network model: neural networks are complex network systems formed by a large number of simple processing units (called neurons) widely interconnected, reflecting many of the fundamental features of human brain function, and are highly complex nonlinear dynamic learning systems.

HTML tag: keywords enclosed by brackets in the HTML program, such as < b >, typically appear in pairs.

Recall rate: the ratio of the searched related documents to the number of all related documents in the document library is used for measuring the recall ratio of the search system.

In the related technology, when the risk web page is detected, the sensitive word stock is often detected, but the detection mode has strong dependence on the sensitive word stock, and the embodiment of the application mainly determines whether the current web page is the risk web page through analyzing the text density in the web page, thereby supplementing the problem that the risk of the web page is judged to be too dependent on the keyword stock only by the keywords, improving the detection amount of the risk web page and improving the accuracy of identifying the risk web page. The following is a detailed description of specific embodiments.

Example 1

In accordance with an embodiment of the present application, there is also provided a method embodiment of the detection of web page types, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

The method embodiment provided in embodiment 1 of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal (or mobile device) for implementing a method of detecting a web page type. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …,102 n) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, etc. processing means), a memory 104 for storing data, and a transmission means 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as a program instruction/data storage device corresponding to the () method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the method for detecting the web page type of the application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that, in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

In the above-mentioned operation environment, the present application provides a method for detecting the type of web page as shown in fig. 2. Fig. 2 is a flowchart of a method for detecting a web page type according to embodiment 1 of the present application. As shown in fig. 2, the method includes:

Step S21, obtaining text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag;

Specifically, the webpage mainly comprises three major elements of a navigation column, a column and text content. The target webpage is the webpage to be identified currently, the tag content comprises tags and text in the tags, and the text density is the ratio of the text in the length (including the tags) of the whole tag content.

Step S23, comparing the text density with a first threshold value to obtain a first comparison result;

step S25, at least identifying the webpage type of the target webpage according to the first comparison result, and obtaining a target identification result.

Through the above respective execution steps, the type of the web page is determined based on the text density in the target web page, for example, it may be determined whether the web page is a risk web page. Therefore, the technical effect of improving the detection effect of the risk webpage based on the analysis of the webpage structure is achieved, and the technical problem that the recall rate of the risk webpage is influenced due to the fact that the recall rate of the risk webpage depends on the sensitive word stock greatly is solved.

Optionally, the above web page types include at least one of: legal web pages, risk web pages (e.g., illegally active web pages).

In some optional embodiments of the present application, before obtaining the text density of the tag content in the target web page, the tag to be identified may be further determined, and the type of the web page may be determined according to the text density of the tag, which may be specifically implemented by the following steps:

step S101, inputting a target webpage into a first model for analysis to obtain a target label in the target webpage, wherein the first model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: a risk web page and web page tags in the risk web page having text density greater than a first threshold.

In some alternative embodiments of the present application, the first model includes, but is not limited to, a neural network model.

In one risk web page, the web page tags having text densities greater than the first threshold may be one or more. Thus, the target web page is input to the first model for analysis, and one or more target tags in the obtained target web page are also possible.

Step S102, extracting text content corresponding to a target label; wherein the step may crawl the text content through a crawler tool.

Step S103, determining text density based on the text content.

Specifically, when the target webpage is input into the first model for analysis, and the target label in the obtained target webpage is one, text content corresponding to the target label is directly extracted, and text density is determined based on the text content.

Since the target web page is input to the first model for analysis, the target tags in the obtained target web page may also be multiple, and in this case, the text density of the target web page needs to consider the text densities of the multiple target tags, in some optional embodiments of the present application, after the target web page is input to the first model for analysis, the following steps may be further performed: extracting text contents corresponding to the plurality of target tags, and determining text density in each target tag based on the text contents corresponding to each target tag in the plurality of target tags to obtain a plurality of text densities; and determining an average value of the text densities, and taking the average value as the text density of the label content in the target webpage.

Optionally, the identifying the web page type of the target web page according to at least the first comparison result in the step S25 may be implemented as follows: when the first comparison result indicates that the text density is greater than a first threshold value, determining that the webpage type is a specified type; when the first comparison result indicates that the text density is less than the first threshold, it is determined that the web page type is not the specified type.

Wherein the specified type may be a risk web page.

Specifically, identifying the web page type of the target web page at least according to the first comparison result, and obtaining the target identification result may include the following steps: determining a first identification result of the target webpage according to the first comparison result; determining whether to detect keywords in the target webpage according to the first recognition result; when the keywords in the target webpage are determined to be detected, the type of the target webpage is identified according to the keywords, a second identification result is obtained, and the target identification result is determined at least according to the second identification result.

When it is determined that the first recognition result of the target webpage indicates that the current webpage is related to illegal activity webpage according to the first comparison result, determining to detect keywords in the target webpage, for example: and detecting the target webpage through keywords related to illegal activities.

Optionally, if the first recognition result of the target webpage determined according to the first comparison result indicates that the target webpage is a risk webpage, determining a keyword in the detected target webpage, and when the type of the target webpage is recognized according to the keyword, obtaining a second recognition result, determining the target recognition result according to the second recognition result. For example: when the first recognition result indicates that the target webpage is a risk webpage and the second recognition result indicates that the target webpage is a legal webpage, determining that the target recognition result is the legal webpage; when the first recognition result indicates that the target webpage is a risk webpage and the second recognition result also indicates that the target webpage is a risk webpage, determining that the target recognition result is a risk webpage.

In some alternative embodiments of the present application, determining the target recognition result at least in accordance with the second recognition result may be achieved by: and jointly determining a target recognition result according to the first recognition result and the second recognition result.

Specifically, the target recognition result may be determined in steps S251 to S257:

step S251, determining a first evaluation index of a first recognition result and a second evaluation index of a second recognition result, wherein the first evaluation index and the second evaluation index are used for indicating the probability that the target webpage is of a specified type;

Specifically, the first evaluation index and the second evaluation index of the first recognition result may be probabilities that the target web page is a risk web page, but are not limited thereto.

Step S253 of determining a target evaluation index based on the first and second evaluation indexes, and the weights of the first and second evaluation indexes;

In some optional embodiments of the present application, the weight of the first evaluation index and the weight of the second evaluation index may be set in advance, and when no update of the keyword library is detected for a long time, the setting principle may be: when the text density of the tag content in the target web page exceeds a third threshold, the weight of the first evaluation index is set to be larger than the weight of the second evaluation index, and when the text density of the tag content in the target web page is smaller than the third threshold, the weight of the first evaluation index is set to be smaller than the weight of the second evaluation index. When the time length of the update time of the keyword library from the current time is detected to be greater than a preset threshold value, the weight of the first evaluation index is set to be greater than that of the second evaluation index.

In other optional embodiments of the present application, since the detection method of the keyword is accurate, when a normal update of the keyword library is detected, the setting principle may be: the weight of the first evaluation index is set to be smaller than that of the second evaluation index.

Step S255, comparing the target evaluation index with a second threshold value to obtain a second comparison result;

step S257, determining a target recognition result according to the second comparison result, wherein when the second comparison result indicates that the target evaluation index is greater than a second threshold value, the target recognition result is determined to be of a specified type; and when the second comparison result indicates that the target evaluation index is smaller than the second threshold value, determining that the target recognition result is not of the specified type.

Specifically, determining whether to detect the keyword in the target web page according to the first recognition result may be performed by: when the webpage type indicated by the first identification result is determined to be the appointed type, determining keywords in the detection target webpage; and refusing to detect the keywords in the target webpage when the webpage type indicated by the first identification result is not the designated type.

In some alternative embodiments of the present application, the text density is compared with a first threshold value, and before the first comparison result is obtained, the following steps are further performed: the first threshold is determined as follows: acquiring a plurality of sets of sample data, wherein each set of sample data in the plurality of sets of sample data comprises: marking legal web pages as legal; and the text density of the tag content in the legal web page; a first threshold is determined based on the plurality of sets of sample data.

Specifically, when the text density of the legal web page and the tag content in the legal web page is determined as sample data, the first threshold is a density upper limit of the text densities of the tag content in all sample legal web pages, or a value larger than the density upper limit.

Optionally, before comparing the text density with the first threshold value to obtain the first comparison result, the following steps may be further performed: the first threshold is determined as follows: acquiring a plurality of sets of sample data, wherein each set of sample data in the plurality of sets of sample data comprises: a risk web page labeled risk; and text density of tag content in the risk web page; a first threshold is determined based on the plurality of sets of sample data.

Specifically, when the first threshold is determined using the risk web page and the text density of the tag content in the risk web page as the sample data, the first threshold is a lower density limit of the text densities of the tag content in all the sample risk web pages, or a value smaller than the lower density limit.

In the embodiment of the application, the text density of the tag content in the target webpage is acquired, wherein the tag content is the content corresponding to the hypertext markup language tag; comparing the text density with a first threshold value to obtain a first comparison result; the method for identifying the webpage type of the target webpage according to at least the first comparison result and obtaining the target identification result achieves the purpose of judging whether the webpage is a risk webpage or not through acquiring and judging text density in the target webpage, so that the technical effect of improving the detection amount of the risk webpage based on analysis of the webpage structure is achieved, and the technical problem that recall rate of the risk webpage depends on a sensitive word stock greatly and affects recall rate of the risk webpage is solved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present application.

Example 2

In the above-mentioned operation environment, the present application provides a method for detecting the type of web page as shown in fig. 3. Fig. 3 is a flowchart of a method for detecting a web page type according to embodiment 2 of the present application. As shown in fig. 3, the method comprises the steps of:

Step S302, acquiring HTML content of a target webpage through a URL;

Step S304, preprocessing the HTML content;

Specifically, the target webpage is the webpage to be identified currently, the preprocessing can be the processing of characters in the webpage, in order to prevent the occurrence of messy codes in the webpage, the default code of the acquired webpage file is converted into UTF-8 character code in the webpage preprocessing stage, and if related code information cannot be acquired from the webpage, the related code information is forcedly converted into UTF-8 character code.

Besides, according to the basic grammar requirement of the HTML document, the three conditions of closing the opened tag, all attribute values must be bracketed by double quotation marks and completing the escape of special characters are ensured.

Step S306, dividing the webpage into blocks based on the preprocessed HTML content;

specifically, the web page may be partitioned based on tag content;

Step S308, extracting the text of the segmented web page;

Step S310, calculating the text density of the web page;

In some alternative embodiments of the application, the tag content includes tag and body text in the tag, the text density being the ratio of body text over the length of the tag content (including the tag).

Optionally, before calculating the web page text density, the following steps may be further performed: inputting a webpage into a first model for analysis to obtain a target label in the webpage, wherein the first model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: a risk webpage and a webpage label with text density larger than a first threshold value in the risk webpage;

in particular, the risk web page may be a web page involving illegal activities.

Alternatively, in a risk web page, the web page tags with text density greater than the first threshold may be one or more. The web page is input into the first model for analysis, and one or more target labels in the obtained web page are also possible.

In addition, step S306 may not be executed, and the text density of the web page may be calculated directly based on the blocking condition of the web page, for example: the text density of the web page is calculated directly based on the text class labels and other class labels.

Step S312, judging whether the text density value is larger than a first threshold value, if yes, determining that the target webpage contains violation information and is a risk webpage, and if not, determining that the target webpage does not contain violation information and is a legal webpage.

The present application provides a method of determining a first threshold as shown in fig. 4. Fig. 4 is a flowchart of a method of determining a first threshold according to embodiment 2 of the present application.

Step S402, acquiring a plurality of groups of sample web page HTML content through a Uniform Resource Locator (URL);

in particular, the multiple sets of sample web pages may all be legal web pages, or all be risk web pages.

Step S404, preprocessing the HTML content of a plurality of groups of sample web pages;

step S406, the webpage is segmented based on the preprocessed HTML content;

step S408, extracting the text of the segmented web page;

Step S410, calculating text density of a plurality of groups of sample web pages;

Step S412, determining a first threshold based on the plurality of sets of text densities obtained in step S410.

In some alternative embodiments of the present application, when the plurality of sets of sample web pages are all legal web pages, the first threshold is an upper density limit, or a value greater than the upper density limit, of text densities in all sample legal web pages.

In some alternative embodiments of the present application, when the plurality of sets of sample web pages are all risk web pages, the first threshold is a lower density limit, or a value less than the lower density limit, of text densities in all sample risk web pages.

Example 3

The application also provides a method for detecting the webpage type. Fig. 5 is a flowchart of a method for detecting a web page type according to embodiment 3 of the present application. As shown in fig. 5, the method at least comprises the following steps:

step S502, obtaining text density of label content in a target webpage, wherein the label content is corresponding to a hypertext markup language label;

specifically, the target webpage is the webpage to be identified currently, the tag content comprises a tag and text in the tag, and the text density is the ratio of the text in the tag content to the length (including the tag) of the whole tag content.

Step S504, comparing the text density with a first threshold value to obtain a first comparison result;

In step S506, when the first comparison result indicates that the text density is greater than the first threshold, a prompt message for indicating that the target web page is a web page of a specified type is generated, and the prompt message is displayed.

Specifically, the web page type includes at least one of: legal web pages, risk web pages (e.g., web pages involving illegal activities).

Before obtaining the text density of the tag content in the target web page, the following steps may be further performed: inputting a target webpage into a first model for analysis to obtain a target label in the target webpage, wherein the first model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: a risk webpage and a webpage label with text density larger than a first threshold value in the risk webpage; extracting text content corresponding to the target label; text density is determined based on the text content.

Alternatively, in a risk web page, the web page tags with text density greater than the first threshold may be one or more. The target web page is input into the first model for analysis, and one or more target labels in the target web page can be obtained.

Optionally, inputting the target webpage into the first model for analysis, so that a plurality of target labels in the target webpage can be obtained;

Optionally, the target tag is a plurality of; inputting the target webpage into a first model for analysis, and after obtaining a target label in the target webpage, executing the following steps: extracting text contents corresponding to the plurality of target tags, and determining text density in each target tag based on the text contents corresponding to each target tag in the plurality of target tags to obtain a plurality of text densities; and determining an average value of the text densities, and taking the average value as the text density of the label content in the target webpage.

Optionally, after generating the prompt information for indicating that the target webpage is a webpage of a specified type, the user may be prompted or warned in a popup window manner on a display screen of the intelligent terminal.

Example 4

According to an embodiment of the present application, there is also provided a device for detecting a web page type for implementing the method for detecting a web page type, as shown in fig. 6, where the device at least includes:

The obtaining module 62 is configured to obtain a text density of tag content in the target web page, where the tag content is a content corresponding to a hypertext markup language tag;

a comparison module 64, configured to compare the text density with a first threshold value to obtain a first comparison result;

the identifying module 66 is configured to identify a web page type of the target web page according to at least the first comparison result, and obtain a target identification result.

Optionally, the apparatus further comprises: the determining module is used for executing the following steps before acquiring the text density of the tag content in the target webpage: inputting a target webpage into a first model for analysis to obtain a target label in the target webpage, wherein the first model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: a risk webpage and a webpage label with text density larger than a first threshold value in the risk webpage; extracting text content corresponding to the target label; text density is determined based on the text content.

Optionally, the target tag is a plurality of; after the target webpage is input to the first model for analysis and the target label in the target webpage is obtained, the determining module is further configured to execute the following steps: extracting text contents corresponding to the plurality of target tags, and determining text density in each target tag based on the text contents corresponding to each target tag in the plurality of target tags to obtain a plurality of text densities; and determining an average value of the text densities, and taking the average value as the text density of the label content in the target webpage.

In addition, the recognition module 66 is further configured to determine that the web page type is a specified type when the first comparison result indicates that the text density is greater than the first threshold; when the first comparison result indicates that the text density is less than the first threshold, it is determined that the web page type is not the specified type.

Optionally, the identifying module 66 is further configured to determine a first identifying result of the target web page according to the first comparing result; determining whether to detect keywords in the target webpage according to the first recognition result; when the keywords in the target webpage are determined to be detected, the type of the target webpage is identified according to the keywords, a second identification result is obtained, and the target identification result is determined at least according to the second identification result.

In some alternative embodiments of the present application, the recognition module 66 is further configured to determine the target recognition result based on the first recognition result and the second recognition result.

Specifically, the recognition module 66 is further configured to determine a first evaluation index of the first recognition result and a second evaluation index of the second recognition result, where the first evaluation index and the second evaluation index are both used to indicate a probability that the target web page is of a specified type; determining a target evaluation index based on the first evaluation index and the second evaluation index, and the weight of the first evaluation index and the weight of the second evaluation index; comparing the target evaluation index with a second threshold value to obtain a second comparison result; determining a target recognition result according to the second comparison result, wherein when the second comparison result indicates that the target evaluation index is greater than a second threshold value, the target recognition result is determined to be of a specified type; and when the second comparison result indicates that the target evaluation index is smaller than the second threshold value, determining that the target recognition result is not of the specified type.

Optionally, the recognition module 66 is further configured to determine, when it is determined that the type of the web page indicated by the first recognition result is a specified type, a keyword in the detection target web page; and refusing to detect the keywords in the target webpage when the webpage type indicated by the first identification result is not the designated type.

In some optional embodiments of the application, the determining module is further configured to determine the first threshold in the following manner: acquiring a plurality of sets of sample data, wherein each set of sample data in the plurality of sets of sample data comprises: marking legal web pages as legal; and the text density of the tag content in the legal web page; a first threshold is determined based on the plurality of sets of sample data.

Specifically, the above apparatus can also perform other steps in embodiment 1, which will not be described herein.

The step corresponding to the acquisition module 62 is the step S21, the step corresponding to the comparison module 64 is the step S23, the step corresponding to the identification module 66 is the step S25,

The above modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in the above embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

Example 5

Embodiments of the present application may provide a computer device, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-mentioned computer device may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer device may be located in at least one network device among a plurality of network devices of the computer network. In this embodiment, the computer device includes a processor, and a memory, where the memory is connected to the processor and is configured to provide instructions for the processor to process the following processing steps:

Step 1, obtaining text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag;

step 2, comparing the text density with a first threshold value to obtain a first comparison result;

And 3, identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result.

Optionally, the above processor is further configured to, before acquiring the text density of the tag content in the target web page, perform the following steps: inputting a target webpage into a first model for analysis to obtain a target label in the target webpage, wherein the first model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: a risk webpage and a webpage label with text density larger than a first threshold value in the risk webpage; extracting text content corresponding to the target label; text density is determined based on the text content.

In some alternative embodiments of the application, the target tag is a plurality of; the processor is further configured to perform the following steps after inputting the target web page to the first model for analysis to obtain a target tag in the target web page: extracting text contents corresponding to the plurality of target tags, and determining text density in each target tag based on the text contents corresponding to each target tag in the plurality of target tags to obtain a plurality of text densities; and determining an average value of the text densities, and taking the average value as the text density of the label content in the target webpage.

Optionally, the above processor is further configured to perform the following steps: when the first comparison result indicates that the text density is greater than a first threshold value, determining that the webpage type is a specified type; when the first comparison result indicates that the text density is less than the first threshold, it is determined that the web page type is not the specified type.

Optionally, the processor is further configured to perform the steps of: determining a first identification result of the target webpage according to the first comparison result; determining whether to detect keywords in the target webpage according to the first recognition result; when the keywords in the target webpage are determined to be detected, the type of the target webpage is identified according to the keywords, a second identification result is obtained, and the target identification result is determined at least according to the second identification result.

Optionally, the processor is further configured to perform the steps of: and jointly determining a target recognition result according to the first recognition result and the second recognition result.

Optionally, the processor is further configured to perform the steps of: determining a first evaluation index of a first recognition result and a second evaluation index of a second recognition result, wherein the first evaluation index and the second evaluation index are used for indicating the probability that the target webpage is of a specified type; determining a target evaluation index based on the first evaluation index and the second evaluation index, and the weight of the first evaluation index and the weight of the second evaluation index; comparing the target evaluation index with a second threshold value to obtain a second comparison result; determining a target recognition result according to the second comparison result, wherein when the second comparison result indicates that the target evaluation index is greater than a second threshold value, the target recognition result is determined to be of a specified type; and when the second comparison result indicates that the target evaluation index is smaller than the second threshold value, determining that the target recognition result is not of the specified type.

Optionally, the processor is further configured to perform the following steps: when the webpage type indicated by the first identification result is determined to be the appointed type, determining keywords in the detection target webpage; and refusing to detect the keywords in the target webpage when the webpage type indicated by the first identification result is not the designated type.

Optionally, before comparing the text density with the first threshold, the processor is further configured to perform the following steps: the first threshold is determined as follows: acquiring a plurality of sets of sample data, wherein each set of sample data in the plurality of sets of sample data comprises: marking legal web pages as legal; and the text density of the tag content in the legal web page; a first threshold is determined based on the plurality of sets of sample data.

Among them, the structure of the computer device in embodiment 5 can refer to the structure of the computer terminal 10 in fig. 1.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for detecting a type of a web page in the embodiment of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the method for detecting a type of a web page. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, the remote memory being connectable to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 is only illustrative, and the computer device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm sound computer, and a Mobile internet device (Mobile INTERNET DEVICES, MID), a PAD, etc. Fig. 1 is not limited to the structure of the electronic device. For example, the computer device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in computer terminal 10 of FIG. 1, or have a different configuration than shown in FIG. 1.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

Embodiments of the present application also provide a nonvolatile storage medium. Alternatively, in this embodiment, the storage medium may be used to store program codes executed by the method for detecting a web page type provided in embodiment 1.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag; comparing the text density with a first threshold value to obtain a first comparison result; and identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result.

Optionally, in this embodiment, the storage medium is configured to store program code for, before obtaining the text density of the tag content in the target web page, performing the steps of: inputting a target webpage into a first model for analysis to obtain a target label in the target webpage, wherein the first model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: a risk webpage and a webpage label with text density larger than a first threshold value in the risk webpage; extracting text content corresponding to the target label; text density is determined based on the text content.

Optionally, the target tag is a plurality of; the storage medium is arranged to store program code for, after inputting the target web page into the first model for analysis, obtaining a target tag in the target web page, performing the steps of: extracting text contents corresponding to the plurality of target tags, and determining text density in each target tag based on the text contents corresponding to each target tag in the plurality of target tags to obtain a plurality of text densities; and determining an average value of the text densities, and taking the average value as the text density of the label content in the target webpage.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: when the first comparison result indicates that the text density is greater than a first threshold value, determining that the webpage type is a specified type; when the first comparison result indicates that the text density is less than the first threshold, it is determined that the web page type is not the specified type.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: determining a first identification result of the target webpage according to the first comparison result; determining whether to detect keywords in the target webpage according to the first recognition result; when the keywords in the target webpage are determined to be detected, the type of the target webpage is identified according to the keywords, a second identification result is obtained, and the target identification result is determined at least according to the second identification result.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: and jointly determining a target recognition result according to the first recognition result and the second recognition result.

Alternatively, in the present embodiment, the storage medium may be further configured to store program code for performing the steps of: determining a first evaluation index of a first recognition result and a second evaluation index of a second recognition result, wherein the first evaluation index and the second evaluation index are used for indicating the probability that the target webpage is of a specified type; determining a target evaluation index based on the first evaluation index and the second evaluation index, and the weight of the first evaluation index and the weight of the second evaluation index; comparing the target evaluation index with a second threshold value to obtain a second comparison result; determining a target recognition result according to the second comparison result, wherein when the second comparison result indicates that the target evaluation index is greater than a second threshold value, the target recognition result is determined to be of a specified type; and when the second comparison result indicates that the target evaluation index is smaller than the second threshold value, determining that the target recognition result is not of the specified type.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: when the webpage type indicated by the first identification result is determined to be the appointed type, determining keywords in the detection target webpage; and refusing to detect the keywords in the target webpage when the webpage type indicated by the first identification result is not the designated type.

Optionally, in this embodiment, the storage medium is arranged to store program code for, before comparing the text density with the first threshold value, obtaining a first comparison result, performing the steps of: the first threshold is determined as follows: acquiring a plurality of sets of sample data, wherein each set of sample data in the plurality of sets of sample data comprises: marking legal web pages as legal; and the text density of the tag content in the legal web page; a first threshold is determined based on the plurality of sets of sample data.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and are merely a logical functional division, and there may be other manners of dividing the apparatus in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. The method for detecting the webpage type is characterized by comprising the following steps of:

acquiring text density of tag content in a target webpage, wherein the tag content is corresponding to a hypertext markup language tag;

Comparing the text density with a first threshold value to obtain a first comparison result;

Identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result;

the method for identifying the webpage type of the target webpage at least according to the first comparison result comprises the following steps: determining a first identification result of the target webpage according to the first comparison result; determining whether to detect keywords in the target webpage according to the first recognition result; when determining to detect the keywords in the target webpage, identifying the type of the target webpage according to the keywords to obtain a second identification result, and determining the target identification result at least according to the second identification result.

2. The method of claim 1, wherein prior to obtaining the text density of the tag content in the target web page, the method further comprises:

Inputting the target webpage into a first model for analysis to obtain a target label in the target webpage, wherein the first model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: a risk webpage and a webpage label with text density larger than the first threshold value in the risk webpage;

extracting text content corresponding to the target label;

the text density is determined based on the text content.

3. The method of claim 2, wherein the target tag is a plurality of; inputting the target webpage into a first model for analysis, and obtaining a target label in the target webpage, wherein the method further comprises the following steps:

extracting text contents corresponding to the plurality of target tags, and determining text density in each target tag based on the text contents corresponding to each target tag in the plurality of target tags to obtain a plurality of text densities; and determining an average value of the text densities, and taking the average value as the text density of the label content in the target webpage.

4. The method of claim 1, wherein identifying the web page type of the target web page based at least on the first comparison result comprises:

Determining that the webpage type is a specified type when the first comparison result indicates that the text density is greater than the first threshold; and when the first comparison result indicates that the text density is smaller than the first threshold value, determining that the webpage type is not the specified type.

5. The method of claim 1, wherein determining the target recognition result based at least on the second recognition result comprises:

And jointly determining the target recognition result according to the first recognition result and the second recognition result.

6. The method of claim 5, wherein determining the target recognition result based on the first recognition result and the second recognition result together comprises:

determining a first evaluation index of the first recognition result and a second evaluation index of the second recognition result, wherein the first evaluation index and the second evaluation index are used for indicating the probability that the target webpage is of a specified type;

determining a target evaluation index based on the first and second evaluation indexes and the weights of the first and second evaluation indexes;

comparing the target evaluation index with a second threshold value to obtain a second comparison result;

determining the target recognition result according to the second comparison result, wherein when the second comparison result indicates that the target evaluation index is larger than the second threshold value, the target recognition result is determined to be of the appointed type; and when the second comparison result indicates that the target evaluation index is smaller than the second threshold value, determining that the target identification result is not the specified type.

7. The method of claim 1, wherein determining whether to detect keywords in the target web page based on the first recognition result comprises:

When the webpage type indicated by the first identification result is determined to be the appointed type, determining to detect keywords in the target webpage; and refusing to detect the keywords in the target webpage when the webpage type indicated by the first identification result is not the appointed type.

8. The method of any of claims 1 to 7, wherein before comparing the text density to a first threshold to obtain a first comparison result, the method further comprises:

the first threshold is determined as follows:

Acquiring a plurality of sets of sample data, wherein each set of sample data in the plurality of sets of sample data comprises: marking legal web pages as legal; and the text density of the tag content in the legal web page;

the first threshold is determined based on the plurality of sets of sample data.

9. The method according to claim 1, wherein the method further comprises:

when the first comparison result indicates that the text density is larger than a first threshold value, generating prompt information for indicating that the target webpage is a webpage of a specified type, and displaying the prompt information;

The method further comprises the steps of: determining a first identification result of the target webpage according to the first comparison result; determining whether to detect keywords in the target webpage according to the first recognition result; when determining to detect the keywords in the target webpage, identifying the type of the target webpage according to the keywords to obtain a second identification result, and determining the target identification result at least according to the second identification result.

10. A web page type detection device, comprising:

The acquisition module is used for acquiring the text density of tag content in the target webpage, wherein the tag content is corresponding to the hypertext markup language tag;

the comparison module is used for comparing the text density with a first threshold value to obtain a first comparison result;

The identifying module is configured to identify a web page type of the target web page according to at least the first comparison result, and obtain a target identification result, where identifying the web page type of the target web page according to at least the first comparison result, and obtaining the target identification result includes: determining a first identification result of the target webpage according to the first comparison result; determining whether to detect keywords in the target webpage according to the first recognition result; when determining to detect the keywords in the target webpage, identifying the type of the target webpage according to the keywords to obtain a second identification result, and determining the target identification result at least according to the second identification result.

11. A non-volatile storage medium, characterized in that the storage medium comprises a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of detecting a web page type according to any one of claims 1 to 9.

12. A computer device, comprising:

A processor; and

A memory, coupled to the processor, for providing instructions to the processor to process the following processing steps:

Step 3, identifying the webpage type of the target webpage at least according to the first comparison result to obtain a target identification result, wherein identifying the webpage type of the target webpage at least according to the first comparison result to obtain the target identification result comprises: determining a first identification result of the target webpage according to the first comparison result; determining whether to detect keywords in the target webpage according to the first recognition result; when determining to detect the keywords in the target webpage, identifying the type of the target webpage according to the keywords to obtain a second identification result, and determining the target identification result at least according to the second identification result.