CN114036367A

CN114036367A - Webpage risk identification method and device, electronic equipment and medium

Info

Publication number: CN114036367A
Application number: CN202111381872.3A
Authority: CN
Inventors: 李海涛; 袁瑞金; 马海娜; 裴佳; 王存玮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Dushang Software Technology Co.,Ltd.
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-02-11

Abstract

The disclosure provides a webpage risk identification method, a webpage risk identification device, a webpage risk identification medium and a webpage risk identification product, and relates to the technical field of computers, in particular to the technical field of big data, intelligent identification and the like. The webpage risk identification method comprises the following steps: determining at least one type of feature data associated with the target web page based on at least one of web page data of the target web page and operational data for the target web page; and based on at least one type of feature data, determining first type index data and second type index data, utilizing the first type index data, adjusting the second type index data to obtain initial risk data, and obtaining target risk data, wherein the target risk data represents the risk condition of the target webpage.

Description

Webpage risk identification method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, specifically to the technical fields of big data, intelligent identification, and the like, and more specifically, to a method, an apparatus, an electronic device, a medium, and a program product for identifying a risk of a web page.

Background

In the related art, it is generally required to identify a web page in order to determine whether the web page is at risk. For example, when a web page is identified as having an offending advertisement, it may be determined that the web page is at risk. The identification efficiency and the identification accuracy of the risk identification of the webpage by the related technology are low.

Disclosure of Invention

The disclosure provides a webpage risk identification method, a webpage risk identification device, an electronic device, a storage medium and a program product.

According to an aspect of the present disclosure, a method for identifying a risk of a webpage is provided, including: determining at least one type of feature data associated with a target web page based on at least one of web page data of the target web page and operational data for the target web page; determining a first type of index data and a second type of index data based on the at least one type of feature data; and adjusting the second type of index data to obtain initial risk data by using the first type of index data to obtain target risk data, wherein the target risk data represents the risk condition of the target webpage.

According to another aspect of the present disclosure, there is provided a web page risk identifying apparatus including: the device comprises a first determining module, a second determining module and an adjusting module. A first determination module, configured to determine at least one type of feature data associated with a target web page based on at least one of web page data of the target web page and operation data for the target web page; the second determination module is used for determining the first type of index data and the second type of index data based on the at least one type of feature data; and the adjusting module is used for adjusting the second type of index data to obtain initial risk data and obtain target risk data by using the first type of index data, wherein the target risk data represents the risk condition of the target webpage.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the web page risk identification method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above-described web page risk identification method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described web page risk identification method.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture of a web page risk identification and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a web page risk identification method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of a web page risk identification method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of determining a first metric threshold value according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a web page risk identification apparatus according to an embodiment of the present disclosure; and

FIG. 6 is a block diagram of an electronic device for performing web page risk identification used to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the disclosure provides a webpage risk identification method, which includes: at least one type of feature data associated with the target web page is determined based on at least one of web page data of the target web page and operational data for the target web page. Then, based on the at least one type of feature data, a first type of index data and a second type of index data are determined. And then, adjusting the second type of index data to obtain initial risk data by using the first type of index data to obtain target risk data, wherein the target risk data represents the risk condition of the target webpage.

Fig. 1 schematically illustrates a system architecture of a web page risk identification and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

clients

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

clients

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

clients

101, 102, 103 to interact with server 105 over network 104 to receive or send messages, etc. Various messaging client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (examples only) may be installed on the

clients

101, 102, 103.

Clients

101, 102, 103 may be a variety of electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablets, laptop and desktop computers, and the like. The

clients

101, 102, 103 of the disclosed embodiments may run applications, for example.

The server 105 may be a server that provides various services, such as a back-office management server (for example only) that provides support for websites browsed by users using the

clients

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e., the server 105 has a cloud computing function.

It should be noted that the web page risk identification method provided by the embodiment of the present disclosure may be executed by the server 105. Accordingly, the web page risk identification device provided by the embodiment of the present disclosure may be disposed in the server 105. The web page risk identification method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster different from the server 105 and capable of communicating with the

clients

101, 102, 103 and/or the server 105. Accordingly, the web page risk identification device provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

clients

101, 102, 103 and/or the server 105.

In one example, the server 105 may obtain web page data for a target web page and operational data for the target web page from the

clients

101, 102, 103 via the network 104 and process the web page data and operational data to determine whether the target web page is at risk.

It should be understood that the number of clients, networks, and servers in FIG. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for an implementation.

The embodiment of the present disclosure provides a method for identifying a risk of a webpage, and the method for identifying a risk of a webpage according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 4 in conjunction with the system architecture of fig. 1. The webpage risk identification method of the embodiment of the present disclosure may be performed by, for example, a server shown in fig. 1, which is, for example, the same as or similar to the electronic device below.

FIG. 2 schematically shows a flowchart of a web page risk identification method according to an embodiment of the present disclosure.

As shown in fig. 2, the web page risk identification method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S230.

At operation S210, at least one type of feature data associated with the target web page is determined based on at least one of web page data of the target web page and operation data for the target web page.

In operation S220, a first type of index data and a second type of index data are determined based on at least one type of feature data.

In operation S230, the first-type index data is used to adjust the initial risk data obtained from the second-type index data, so as to obtain the target risk data.

Illustratively, the web page data of the target web page includes, for example, attributes of the web page, content categories of the web page, the amount of content of each content category, and the like. The content category includes, for example, a text category, a picture category, and the like, the content number of the text category includes, for example, the number of texts, the content number of the picture category includes, for example, the number of pictures, and the like.

The operation data for the target webpage includes, for example, data obtained by a user operating the target webpage, and the operation includes, for example, a search operation, an operation of clicking related content in the target webpage, and the like.

After obtaining the web page data and the operational data, at least one of the web page data and the operational data may be processed to obtain at least one type of feature data associated with the target web page. The at least one type of feature data may include multiple types of feature data.

Then, at least one type of feature data is processed to obtain a plurality of index data. The plurality of index data includes, for example, a first type of index data and a second type of index data. The first type of index data may be used to characterize the target web page as a safe web page, and the second type of index data may be used to characterize the target web page as a risky web page.

Next, initial risk data may be determined from the second type of index data, which may characterize, to some extent, whether the target web page is at risk. And then, adjusting the initial risk data by using the first type of index data to obtain target risk data which finally represents whether the target webpage has risks. In an example, when the initial risk data obtained from the second type of index data represents that the target webpage is a risk webpage, if the first type of index data represents that the target webpage is a safe webpage, the target risk data obtained by adjusting the initial risk data by the first type of index data may represent that the target webpage is a safe webpage, for example. In another example, when the initial risk data represents that the target webpage is a risk webpage, if the first type of index data cannot represent that the target webpage is a safe webpage, the target risk data obtained by adjusting the initial risk data by the first type of index data may represent that the target webpage is a dangerous webpage or a safe webpage, for example.

According to the embodiment of the disclosure, the webpage data and the operation data are processed to obtain multi-class feature data, then the first class index data and the second class index data are determined based on the multi-class feature data, the initial risk data aiming at the target webpage are determined by the second class index data, and the initial risk data are adjusted by the first class index data to obtain the target risk data which finally represents whether the target webpage has risks. By the technical scheme, the identification efficiency and the identification accuracy for identifying whether the target webpage has risks are improved.

FIG. 3 schematically illustrates a schematic diagram of a web page risk identification method according to an embodiment of the present disclosure.

As shown in FIG. 3, for a target web page 310, web page data 321 and operational data 322 for the target web page 310 are determined.

Illustratively, based on the operational data 322 for the target web page 310, first type feature data 331 is determined, the first type feature data 331 characterizing search information associated with the target web page 310.

For example, the first-class feature data 331 is determined based on various raw data of the entire process of the user inputting a search term to the opening of the target web page 310. The raw data includes, for example, search terms, daily consumption data of the target web page, 7-day consumption data of the target web page, and the like.

For example, a classified word list of keywords is established, and the input search word is matched with the classified word list by using a fuzzy matching technology so as to determine the search word as a risk word, a safe word or an unknown word. In principle, the smaller the unknown word proportion, the better. For each search record, the search record may be classified by search term category as risk traffic (corresponding to risk terms), security traffic (corresponding to security terms), or unknown traffic (corresponding to unknown terms). For the target web page 310, the target web page 310 usually has a plurality of search records, and the total risk traffic and the total security traffic for the target web page 310 per day can be counted. A traffic can be considered a search record.

The targeted web page 310 typically has a promotional advertisement therein, and the first-type characteristic data 331 reflects, for example, a promotional strategy of the advertiser, such as indirectly reflecting the purpose of the promotional advertisement of the advertiser.

Illustratively, based on the web page data 321 of the target web page 310, a second type of feature data 332 is determined, the second type of feature data 332 characterizing, for example, attribute information of the target web page 310.

The second type of feature data 332 reflects, for example, information other than the textual content of the target web page 310. The information other than the text content includes, for example, an advertiser industry, the number of pictures in a web page, the number of characters, the number of thread converting components (a thread converting component is, for example, a component for collecting user information through which a user can interact with a background, and a thread converting component includes, for example, an information submitting component), the number of downloading components, the total number of business component categories (picture component category, video component category, payment component category, and the like), the number of web pages over-checked in an account of the advertiser, the number of web pages rejected, and the like. The second type of feature data 332 may be obtained from a database associated with the target web page 310, or may be obtained by analyzing the structural information of the target web page 310 through a crawler technique.

The second type of feature data 332 reflects, for example, the sophistication and complexity of the target web page 310, and also reflects whether the target web page 310 is suspected to bypass the auditing system at the technical level.

Illustratively, based on the operation data 322 for the target web page 310, a third class of feature data 333 is determined, the third class of feature data 333 characterizing, for example, interaction information associated with the target web page 310.

The third type of feature data 333 is derived, for example, based on data from a user interacting with the target web page 310. The third type of feature data 333 includes, for example, features such as web page dwell time, whether to submit a form, make a call, whether to copy a communication identification, whether to talk to customer service in a counseling tool, etc.

The third type of feature data 333 represents, for example, the user's reaction to the target web page 310, and indirectly represents the quality and credibility of the target web page 310.

According to the embodiment of the disclosure, the multi-class characteristic data is obtained based on the webpage data and the operation data, the first-class index data and the second-class index data are obtained based on the multi-class characteristic data, and the richness of the index data is improved. In addition, compared with the interference factors such as the fact that the text content is replaced by the picture in webpage risk identification based on the text content, the embodiment of the disclosure obtains the feature data based on the information except the text content of the target webpage, and improves the effect of identifying the webpage risk based on the feature data.

After the first type feature data 331, the second type feature data 332, and the third type feature data 333 are obtained, the first type index data 341 and the second type index data 342 may be obtained by abstracting the first type feature data 331, the second type feature data 332, and the third type feature data 333.

The first-type index data 341 is, for example, exemption-type index data, and the exemption-type index data is, for example, threshold-type index data, that is, a result obtained by performing a comprehensive calculation on a certain feature data or a plurality of feature data of the target web page 310 is taken as the first-type index data 341. When any one of the index data in the first type of index data 341 falls within a certain threshold range, it is indicated that the index data is effective. After the indicator data indicator takes effect, the indicator data indicates that the target webpage 310 is a safe, risk-free and legal webpage.

The first-type index data 341 includes, for example, a risk traffic proportion (calculated from the total risk traffic and the total security traffic in the first-type feature data), an account review page proportion, and the like.

The second type of index data 342 is, for example, risk type index data, which may be threshold type or boolean type index data. For a certain second type of index data 342, if the index data is boolean index data, the index data indicates that the index data is valid when the target web page 310 has a certain feature. The validated metric data characterizes the target web page 310 as a risky web page.

The second type of index data 342 includes, for example, the daily conversion number of web pages (the number of interactions performed by the user through the web pages), the daily consumption number, the 7-day popularization consumption variance, the number of used service components, the number of chinese characters, the number of pictures, the average height of the pictures, the average dwell time of the pages, and the like.

For the first index threshold value 351 corresponding to the first-type index data 341, when the first-type index data 341 includes a plurality of index data, one first index threshold value corresponds to each index data. Similarly, for the second index threshold 352 corresponding to the second type of index data 342, when the second type of index data 342 includes a plurality of index data, each index data corresponds to one second index threshold.

For example, the first weight 361 is determined based on the first-type index data 341 and the first index threshold value 351 corresponding to the first-type index data 341. For example, taking the first-type index data 341 as an example of a risk traffic proportion, the first index threshold 351 is 70%, for example, when the risk traffic proportion for the target webpage 310 is smaller than the first index threshold 351 (70%), it may indicate that the target webpage 310 hits the exemption-type index, that is, the risk traffic proportion indicates that the target webpage 310 is a safe webpage, and the first weight 361 is 0, for example. When the risk traffic ratio is greater than or equal to the first indicator threshold 351 (70%), it may indicate that the target web page 310 misses the exempt indicator, i.e. the risk traffic ratio indicates that the target web page 310 is a risk web page, and the first weight 361 is, for example, 1.

Next, initial risk data 380 is determined based on the second type of metric data 342 and a second metric threshold 352 corresponding to the second type of metric data 342.

For example, the second type of index data 342 includes a plurality of index data, determines target index data 362 from the plurality of index data based on the plurality of index data and a plurality of second index threshold values 352 corresponding one-to-one to the plurality of index data, and determines initial risk data 380 based on the target index data 362 and the precision ratio 371 for the target index data 362.

For example, taking the example that the second type of index data 342 includes 3 index data, the 3 index data includes, for example, average dwell time of web pages, 7-day popularization consumption variance, and number of chinese characters. The second index threshold 352 corresponding to the average web page staying time is, for example, a seconds, the second index threshold 352 corresponding to the 7-day popularization consumption variance is, for example, b, and the second index threshold 352 corresponding to the number of chinese characters is, for example, c.

For example, if the average web page dwell time for the target web page 310 is less than the second indicator threshold 352(a seconds), the indicator indicating the average web page dwell time is in effect (i.e., indicating that the target web page 310 is at risk). If the 7-day promotional consumption variance for the target web page 310 is greater than the second indicator threshold 352(b), this indicator indicates that the 7-day promotional consumption variance is in effect (i.e., indicates that the target web page 310 is at risk). If the number of Chinese words for the target web page 310 is less than the second indexing threshold 352 (c), the index representing the number of Chinese words is valid (i.e., representing that the target web page 310 is at risk).

Next, index data representing the risk of the target web page 310 is used as the target index data 362, for example, the target index data 362 includes the average dwell time of the web page and the 7-day popularization consumption variance. Each target index datum 362 corresponds to, for example, a precision 371, and the precision 371 is obtained from a plurality of historical web pages, which will be described later.

Based on the precision 371 for the target index data 362, a second weight 372 for the target index data 362 is determined. Then, based on the accuracy 371 and the second weight 372 for the target metric data 362, the initial risk data 380 is determined.

For example, the target index data 362 (average dwell time of web pages, 7-day spread consumption variance) are sorted according to the precision rate, and the higher the precision rate is, the higher the ranking is. The more top ranked target metric data 362 has a greater second weight. The initial risk data 380 is derived, for example, from equation (1) below.

Wherein p is_iThe precision rate of the ith target index data is obtained; w is a_iA second weight of the ith target index data; n is the number of target index data; p is the initial risk data 380 (also referred to as a risk coefficient).

The initial risk data 380 is then adjusted using the first weights 361 to obtain target risk data 390. For example, multiplying the first weight 361 by the initial risk data 380 (risk coefficient) results in the target risk data 390.

It can be understood that when the target web page 310 hits the first type indicator data 341 (exemption type indicator), the first type indicator data 341 indicates that the target web page 310 is a secure web page, at this time, the first weight 361 is, for example, 0, and the target risk data 390 obtained by multiplying the first weight 361 by the initial risk data 380 is, for example, also 0, which indicates that the target web page 310 is a secure web page. When the target web page 310 misses the first-class index data 341 (exempt-class index), the first-class index data 341 fails to characterize the target web page 310 as a secure web page, at this time, the first weight 361 is, for example, 1, and a value of the target risk data 390 obtained by multiplying the first weight 361 by the initial risk data 380 is, for example, a value of the initial risk data 380 (risk coefficient), and at this time, whether the target web page 310 is at risk is determined based on a specific value of the initial risk data 380 (risk coefficient).

According to the embodiment of the disclosure, the initial risk data for the target webpage is obtained through the second type of index data, and then the initial risk data is adjusted based on the first weight obtained through the first type of index data to obtain the target risk data, so that the accuracy of the target risk data is improved, and the accuracy of determining whether the target webpage has risks based on the target risk data is further realized.

For each index data in the first type of index data, a recall rate and a precision rate for each index data are calculated, and a first index threshold for each index data is calculated based on the recall rate and the precision rate. The specific process is as follows.

For example, M historical web pages are obtained, the M historical web pages comprise M₁Individual Risk History Web pages, M and M₁Are all integers greater than 0. Then, P initial threshold values are set for the first type index data, where P is an integer greater than 0. The first type of index data includes, for example, a plurality of index data, P initial thresholds may be set for each index data, and the number of initial thresholds set for each index data may be the same or different.

Based onM₁And determining the recall rate of each index data in the first type of index data according to the risk history webpage and P initial threshold values. For example, will be for M₁Each index data of each risk history webpage is compared with each initial threshold value in P initial threshold values corresponding to the index data so as to obtain M₁Determining m in individual risk history web page₁Individual risk history page, m₁Is an integer greater than 0. Then, m is put₁And M₁The ratio therebetween is determined as a recall rate for the index data.

Taking the first type of index data as the risk flow rate ratio as an example, a plurality of first index thresholds of 60%, 65%, 70% and 75% are set for the index data. The M historical web pages are, for example, 50000 historical web pages, of which 30000 (M) are₁) Historical web pages, for example, labeled risk historical web pages, are left 20000 (M)₂) Labeled Security History Web Page, M₂Is an integer greater than 0. For the first index threshold value of 60%, for example, 25000 (m) of 30000 risk history webpages₁) If the indicator risk traffic proportion of the risk history page is greater than 60%, the recall rate is 25000/30000, for example. For the first index threshold of 65%, for example, 20000 (m) of 30000 risk history webpages₁) And if the index risk flow rate of the risk history page is greater than 65%, the recall rate is 20000/30000. The thresholds are 70%, 75% similar for the first index. The recall ratio is shown in equation (2).

Wherein M is₁Representing the number of risk history web pages; m is₁For example, the number of risk history web pages hitting a certain index data in the first type of index data may be represented.

And determining the accuracy rate of the first type of index data based on the M historical webpages and the P initial thresholds aiming at each index data in the first type of index data.

For example, M history web pages also include M₂A security history web page to be directed to M₂AnEach index data (first-class index data) of the full-history webpage is compared with each initial threshold value in P initial threshold values corresponding to the index data so as to obtain M index data₂Determining m in security history web page₂A security history page, m₂Is an integer greater than 0. Then, m is put₁And m₂The sum, and the ratio between M, is determined as the precision rate for the index data.

Taking the first type of index data as the risk flow rate ratio as an example, a plurality of first index thresholds of 60%, 65%, 70% and 75% are set for the index data. The M historical web pages are, for example, 50000 historical web pages, of which 30000 (M) are₁) The historical web pages are labeled as risk historical web pages, and 20000 (M) are remained₂) Labeled as a security history web page. For the first index threshold value of 60%, for example, 10000 (m) of 30000 risk history webpages₁) The risk traffic proportion of the risk history web pages is greater than or equal to 60%, and for example, 15000 (m) of 20000 security history web pages₂) And if the index risk flow proportion of the safety history webpage is less than 60%, the accuracy rate of the index data of the risk flow proportion is (10000+ 15000)/50000. The thresholds are 65%, 70%, 75% similar for the first index. The accuracy is shown in equation (3).

Wherein M represents the number of historical web pages; m is₁For example, the number of risk history web pages hitting a certain index data in the first type of index data may be represented; m is₂For example, the number of security history web pages that miss a certain index data in the first type of index data may be represented.

After the recall rate and the precision rate for each index data in the first type of index data are obtained, a first index threshold value for each index data is determined based on the recall rate for the first type of index data, the precision rate for the first type of index data, and P initial threshold values corresponding to each index data, as specifically shown in fig. 4.

FIG. 4 schematically illustrates a schematic diagram of determining a first metric threshold according to an embodiment of the present disclosure.

As shown in fig. 4, the abscissa is, for example, Recall (Recall), and the ordinate is, for example, Precision (Precision). According to certain index data in the first type of index data, a plurality of first index threshold values of 60%, 65%, 70% and 75% are set for the index data, and the recall rate and the precision rate corresponding to each first index threshold value are calculated. Then, an ROC (Receiver operating characteristics) curve is used for finding a balance point and taking the best first index threshold value. For example, as shown in fig. 4, for a certain index data, 70% is determined as the final first index threshold value for the index data based on the tangent of the curve. It is understood that the determination of the final first index threshold value by using the ROC curve is only an example, and the determination manner of the final first index threshold value by the embodiments of the disclosure is not particularly limited. The final first metric threshold is used to determine the first weight mentioned above.

For each index data of the first type of index data, the higher the recall rate is, the higher the reliability of the index data is reflected; the higher the accuracy rate is, the better the identification performance of the index data is reflected.

According to the embodiment of the disclosure, the first index threshold value for the first type of index data is determined based on the recall rate and the accuracy rate, so that the first weight for the target webpage is determined based on the first index threshold value, and the determination accuracy of the first weight is improved.

For each index data in the second type of index data, an accuracy rate for each index data is calculated, and a second index threshold for each index data is calculated based on the accuracy rate. The specific process is as follows.

For example, N historical web pages are obtained, where N is an integer greater than 0. Then, Q initial thresholds are set for the second type index data, Q being an integer greater than 0. The second type of index data includes, for example, a plurality of index data, Q initial thresholds may be set for each index data, and the number of initial thresholds set for each index data may be the same or different. Next, a precision rate for each index data in the second type of index data is determined based on the N historical web pages and the Q initial thresholds, and a second index threshold for each index data is determined based on the precision rate for each index data and the Q initial thresholds. The N history web pages may be, for example, the same as the M history web pages described above.

Illustratively, the N historical web pages include, for example, N₁Individual risk history web page and N₂A security history web page, N₁And N₂Are all integers greater than 0.

Will be directed to N₁Each index data (second type index data) of each risk history webpage is compared with each initial threshold value in Q initial threshold values corresponding to each index data, so as to obtain the index data from N₁Determining n in individual risk history web page₁Individual risk history web page, n₁Is an integer greater than 0.

Will be directed to N₂Each index data (second type index data) of each safety history webpage is compared with each initial threshold value in Q initial threshold values corresponding to each index data, so as to obtain the index data from N₂Determining n in security history web page₂A security history page, n₂Is an integer greater than 0.

N is to be₁And n₂The sum, and the ratio between N, is determined as the precision rate for the index data. The accuracy is shown in equation (4). After the precision ratio of each index data (second type of index data) is obtained, the above-mentioned second weight may be determined based on the precision ratio.

Wherein N represents the number of historical web pages; n is₁For example, the number of risk history web pages hitting a certain index data in the second type of index data may be represented; n is₂For example, the number of security history web pages that miss a certain index data in the second type of index data may be represented. It can be understood that the calculation process of the precision ratio for a certain index data in the second type of index data and the precision ratio for a certain index data in the first type of index data shown in formula (3)The calculation process of the quasi rate is similar, and detailed description is omitted here.

And (3) aiming at certain index data in the second type of index data, setting a plurality of second index threshold values for the index data, calculating by using a formula (4) to obtain the accuracy rate corresponding to each second index threshold value, and determining the second index threshold value corresponding to the maximum accuracy rate as the final second index threshold value aiming at the index data. The higher the accuracy rate corresponding to the index data is, the more important the index data is, and the higher the contribution degree to the risk identification of the webpage is.

According to the embodiment of the disclosure, the second index threshold value for the second type of index data is determined based on the accuracy, so that the initial risk data (risk coefficient) for the target webpage is determined based on the second index threshold value, and the determination accuracy of the initial risk data (risk coefficient) is improved.

In another embodiment of the present disclosure, for a plurality of historical webpages used for determining the first index threshold value and the second index threshold value, the number of the plurality of historical webpages (for example, the number of the historical webpages is increased) and the annotation information of the historical webpages (the annotation information indicates that the historical webpages are risk historical webpages or security historical webpages) may be adjusted by manual verification. And iteratively determining a first index threshold value, a second index threshold value, a first weight and a second weight based on the adjusted plurality of historical webpages, so as to improve the accuracy of risk webpage identification.

Fig. 5 schematically shows a block diagram of a web page risk identification apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the web page risk identification apparatus 500 of the embodiment of the present disclosure includes, for example, a first determination module 510, a second determination module 520, and an adjustment module 530.

The first determination module 510 may be configured to determine at least one type of feature data associated with the target web page based on at least one of web page data of the target web page and operational data for the target web page. According to an embodiment of the present disclosure, the first determining module 510 may perform, for example, operation S210 described above with reference to fig. 2, which is not described herein again.

The second determination module 520 may be configured to determine the first type of metric data and the second type of metric data based on the at least one type of feature data. According to the embodiment of the present disclosure, the second determining module 520 may perform, for example, operation S220 described above with reference to fig. 2, which is not described herein again.

The adjusting module 530 may be configured to adjust the initial risk data obtained from the second type of index data by using the first type of index data to obtain target risk data, where the target risk data represents a risk condition existing in the target webpage. According to the embodiment of the present disclosure, the adjusting module 530 may, for example, perform the operation S230 described above with reference to fig. 2, which is not described herein again.

According to an embodiment of the present disclosure, the adjusting module 530 includes: a first determination submodule, a second determination submodule, and an adjustment submodule. The first determining submodule is used for determining a first weight based on the first type of index data and a first index threshold corresponding to the first type of index data; the second determining submodule is used for determining initial risk data based on the second type of index data and a second index threshold corresponding to the second type of index data; and the adjusting submodule is used for adjusting the initial risk data by using the first weight to obtain target risk data.

According to an embodiment of the present disclosure, the second type of index data includes a plurality of index data; the second determination submodule includes: a first determination unit and a second determination unit. A first determination unit configured to determine target index data from the plurality of index data based on the plurality of index data and a plurality of second index thresholds in one-to-one correspondence with the plurality of index data; a second determining unit for determining the initial risk data based on the target index data and the accuracy rate for the target index data.

According to an embodiment of the present disclosure, the second determination unit includes: a first determining subunit and a second determining subunit. A first determining subunit, configured to determine a second weight for the target index data based on the precision rate for the target index data; and the second determining subunit is used for determining the initial risk data based on the precision rate and the second weight aiming at the target index data.

According to an embodiment of the present disclosure, index data of the first type corresponds toThe first index threshold is obtained by: a first obtaining module, configured to obtain M historical webpages, where the M historical webpages include M₁Individual Risk History Web pages, M and M₁Are all integers greater than 0; the first setting module is used for setting P initial thresholds aiming at the first-class index data, wherein P is an integer larger than 0; a third determination module for determining M₁Determining the recall rate of the first-class index data according to the risk history webpage and P initial thresholds; the fourth determination module is used for determining the accuracy rate aiming at the first type of index data based on the M historical webpages and the P initial thresholds; and the fifth determining module is used for determining the first index threshold value based on the recall rate aiming at the first type of index data, the precision rate aiming at the first type of index data and the P initial threshold values.

According to an embodiment of the present disclosure, the third determining module includes: a first comparison sub-module and a third determination sub-module. A first comparison submodule for comparing M₁The first type of index data of each risk history web page is compared with each of P initial thresholds to obtain M₁Determining m in individual risk history web page₁Individual risk history page, m₁Is an integer greater than 0; a third determination submodule for comparing m₁And M₁The ratio therebetween is determined as the recall rate for the first type of indicator data.

According to an embodiment of the present disclosure, the M history web pages further include M₂A secure history page, M₂Is an integer greater than 0; the fourth determining module includes: a second comparison submodule and a fourth determination submodule. A second comparison submodule for comparing M₂Comparing the first type index data of each safety history webpage with each initial threshold value in P initial threshold values to obtain M₂Determining m in security history web page₂A security history page, m₂Is an integer greater than 0; a fourth determination submodule for determining m₁And m₂And the ratio of the sum to M is determined as the precision rate of the first type of index data.

According to an embodiment of the present disclosure, the second index threshold corresponding to the second type of index data is obtained by: the second acquisition module is used for acquiring N historical webpages, wherein N is an integer larger than 0; the second setting module is used for setting Q initial thresholds aiming at the second type index data, wherein Q is an integer larger than 0; a sixth determining module, configured to determine, based on the N historical webpages and the Q initial thresholds, an accuracy rate for the second type of index data; and the seventh determining module is used for determining the second index threshold value based on the precision rate aiming at the second type of index data and the Q initial threshold values.

According to an embodiment of the present disclosure, the N history web pages include N₁Individual risk history web page and N₂A security history web page, N₁And N₂Are all integers greater than 0; the sixth determining module includes: a third comparison submodule, a fourth comparison submodule, and a fifth determination submodule. A third comparison submodule for comparing N₁The second type of index data of each risk history web page is compared with each of the Q initial thresholds to obtain N₁Determining n in individual risk history web page₁Individual risk history web page, n₁Is an integer greater than 0; a fourth comparison submodule for comparing N₂The second type index data of each safety history webpage is compared with each initial threshold value in Q initial threshold values to obtain the index data from N₂Determining n in security history web page₂A security history page, n₂Is an integer greater than 0; a fifth determination submodule for dividing n₁And n₂And the ratio of the sum to N is determined as the precision rate of the second type of index data.

According to an embodiment of the present disclosure, the first determining module 510 includes at least one of: a sixth determination submodule, a seventh determination submodule, and an eighth determination submodule. A sixth determining sub-module, configured to determine first-class feature data based on operation data for the target web page, where the first-class feature data represents search information associated with the target web page; the seventh determining submodule is used for determining second type of feature data based on the webpage data of the target webpage, wherein the second type of feature data represents attribute information of the target webpage; and the eighth determining submodule is used for determining the third type of feature data based on the operation data aiming at the target webpage, wherein the third type of feature data represents the interaction information associated with the target webpage.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. The electronic device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the web page risk identification method. For example, in some embodiments, the web page risk identification method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the web page risk identification method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the web page risk identification method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable web page risk identification device such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A webpage risk identification method comprises the following steps:

determining at least one type of feature data associated with a target web page based on at least one of web page data of the target web page and operational data for the target web page;

determining a first type of index data and a second type of index data based on the at least one type of feature data; and

and adjusting the second type of index data to obtain initial risk data by using the first type of index data to obtain target risk data, wherein the target risk data represents the risk condition of the target webpage.

2. The method of claim 1, wherein the adjusting the first type of indicator data to obtain initial risk data from the second type of indicator data using the first type of indicator data to obtain target risk data comprises:

determining a first weight based on the first type of indicator data and a first indicator threshold corresponding to the first type of indicator data;

determining the initial risk data based on the second type of indicator data and a second indicator threshold corresponding to the second type of indicator data; and

and adjusting the initial risk data by using the first weight to obtain the target risk data.

3. The method of claim 2, wherein the second type of metric data comprises a plurality of metric data; the determining the initial risk data based on the second type of indicator data and a second indicator threshold corresponding to the second type of indicator data comprises:

determining target index data from the plurality of index data based on the plurality of index data and a plurality of second index threshold values in one-to-one correspondence with the plurality of index data; and

determining the initial risk data based on the target index data and a precision rate for the target index data.

4. The method of claim 3, wherein the determining the initial risk data based on the target metric data and a precision rate for the target metric data comprises:

determining a second weight for the target index data based on a precision rate for the target index data; and

determining the initial risk data based on the accuracy rate for the target index data and the second weight.

5. The method according to any one of claims 2-4, wherein the first metric threshold corresponding to the first type of metric data is obtained by:

obtaining M historical webpages, wherein the M historical webpages comprise M₁Individual Risk History Web pages, M and M₁Are all integers greater than 0;

setting P initial threshold values aiming at the first type index data, wherein P is an integer larger than 0;

based on the M₁Determining a recall rate for the first type of indicator data for each risk history web page and the P initial thresholds;

determining a precision rate for the first type of index data based on the M historical webpages and the P initial thresholds; and

determining the first metric threshold based on a recall rate for the first type of metric data, a precision rate for the first type of metric data, and the P initial thresholds.

6. The method of claim 5, wherein the basing on the M₁A risk calendarHistory web pages and the P initial thresholds, determining a recall rate for the first type of indicator data comprising:

will be directed to the M₁Comparing first type index data of each risk history webpage with each initial threshold value in the P initial threshold values to obtain M initial threshold values₁Determining m in individual risk history web page₁Individual risk history page, m₁Is an integer greater than 0; and

the m is₁And said M₁Is determined as a recall for the first type of indicator data.

7. The method of claim 6, wherein the M historical web pages further comprise M₂A secure history page, M₂Is an integer greater than 0; the determining, based on the M historical webpages and the P initial thresholds, a precision rate for the first type of indicator data comprises:

will be directed to the M₂Comparing first class indicator data of each security history web page with each of the P initial thresholds to obtain M₂Determining m in security history web page₂A security history page, m₂Is an integer greater than 0; and

the m is₁And m is said₂And the ratio of the sum to the M is determined as the precision rate of the first type of index data.

8. The method according to any one of claims 2-4, wherein the second indicator threshold corresponding to the second type of indicator data is obtained by:

acquiring N historical webpages, wherein N is an integer larger than 0;

setting Q initial threshold values aiming at the second type index data, wherein Q is an integer larger than 0;

determining an accuracy rate for the second type of indicator data based on the N historical web pages and the Q initial thresholds; and

determining the second index threshold based on the precision rate for the second type of index data and the Q initial thresholds.

9. The method of claim 8, wherein the N historical web pages comprise N₁Individual risk history web page and N₂A security history web page, N₁And N₂Are all integers greater than 0; the determining an accuracy rate for the second type of metric data based on the N historical web pages and the Q initial thresholds comprises:

will be directed to said N₁Comparing the second type of index data of each risk history webpage with each of the Q initial thresholds to obtain the index data of each risk history webpage from the N₁Determining n in individual risk history web page₁Individual risk history web page, n₁Is an integer greater than 0;

will be directed to said N₂Comparing the second type index data of each safety history webpage with each initial threshold value in the Q initial threshold values to obtain the index data from the N₂Determining n in security history web page₂A security history page, n₂Is an integer greater than 0; and

n is to be₁And said n₂And the ratio of the sum to the N is determined as the precision rate of the second type of index data.

10. The method of any of claims 1-9, wherein determining at least one type of feature data associated with the target web page based on at least one of web page data of the target web page and operational data for the target web page comprises at least one of:

determining first type feature data based on the operation data aiming at the target webpage, wherein the first type feature data represents search information associated with the target webpage;

determining second type feature data based on the webpage data of the target webpage, wherein the second type feature data represent attribute information of the target webpage; and

determining a third type of feature data based on the operation data for the target webpage, wherein the third type of feature data characterizes interaction information associated with the target webpage.

11. A web page risk identification device, comprising:

a first determination module, configured to determine at least one type of feature data associated with a target web page based on at least one of web page data of the target web page and operation data for the target web page;

the second determination module is used for determining the first type of index data and the second type of index data based on the at least one type of feature data; and

and the adjusting module is used for adjusting the second type of index data to obtain initial risk data and obtain target risk data by using the first type of index data, wherein the target risk data represents the risk condition of the target webpage.

12. The apparatus of claim 11, wherein the adjustment module comprises:

the first determining submodule is used for determining a first weight based on the first type of index data and a first index threshold corresponding to the first type of index data;

a second determining submodule, configured to determine the initial risk data based on the second type of index data and a second index threshold corresponding to the second type of index data; and

and the adjusting submodule is used for adjusting the initial risk data by using the first weight to obtain the target risk data.

13. The apparatus of claim 12, wherein the second type of metric data comprises a plurality of metric data; the second determination submodule includes:

a first determination unit configured to determine target index data from the plurality of index data based on the plurality of index data and a plurality of second index thresholds in one-to-one correspondence with the plurality of index data; and

a second determining unit configured to determine the initial risk data based on the target index data and a precision rate for the target index data.

14. The apparatus of claim 13, wherein the second determining unit comprises:

a first determining subunit configured to determine a second weight for the target index data based on a precision rate for the target index data; and

a second determining subunit, configured to determine the initial risk data based on the second weight and the accuracy rate for the target index data.

15. The apparatus of any of claims 12-14, wherein the first metric threshold corresponding to the first category of metric data is obtained by:

a first obtaining module, configured to obtain M historical webpages, where the M historical webpages include M₁Individual Risk History Web pages, M and M₁Are all integers greater than 0;

the first setting module is used for setting P initial threshold values aiming at the first type index data, wherein P is an integer larger than 0;

a third determination module to determine based on the M₁Determining a recall rate for the first type of indicator data for each risk history web page and the P initial thresholds;

a fourth determining module, configured to determine a precision rate for the first type of index data based on the M historical webpages and the P initial thresholds; and

a fifth determining module, configured to determine the first indicator threshold based on the recall rate for the first type of indicator data, the precision rate for the first type of indicator data, and the P initial thresholds.

16. The apparatus of claim 15, wherein the third determining means comprises:

a first comparison submodule for comparing the M₁Comparing first type index data of each risk history webpage with each initial threshold value in the P initial threshold values to obtain M initial threshold values₁Determining m in individual risk history web page₁Individual risk history page, m₁Is an integer greater than 0; and

a third determination submodule for determining m₁And said M₁Is determined as a recall for the first type of indicator data.

17. The apparatus of claim 16, wherein the M historical web pages further comprise M₂A secure history page, M₂Is an integer greater than 0; the fourth determining module includes:

a second comparison submodule for comparing M to M₂Comparing first class indicator data of each security history web page with each of the P initial thresholds to obtain M₂Determining m in security history web page₂A security history page, m₂Is an integer greater than 0; and

a fourth determination submodule for determining m₁And m is said₂And the ratio of the sum to the M is determined as the precision rate of the first type of index data.

18. The apparatus according to any one of claims 12-14, wherein the second metric threshold corresponding to the second type of metric data is obtained by:

the second acquisition module is used for acquiring N historical webpages, wherein N is an integer larger than 0;

a second setting module, configured to set Q initial thresholds for the second type of index data, where Q is an integer greater than 0;

a sixth determining module, configured to determine an accuracy rate for the second type of index data based on the N historical webpages and the Q initial thresholds; and

a seventh determining module, configured to determine the second index threshold based on the precision rate for the second type of index data and the Q initial thresholds.

19. The apparatus of claim 18, wherein the N historical web pages comprise N₁Individual risk history web page and N₂A security history web page, N₁And N₂Are all integers greater than 0; the sixth determining module includes:

a third comparison submodule for comparing the N₁Comparing the second type of index data of each risk history webpage with each of the Q initial thresholds to obtain the index data of each risk history webpage from the N₁Determining n in individual risk history web page₁Individual risk history web page, n₁Is an integer greater than 0;

a fourth comparison submodule for comparing the N₂Comparing the second type index data of each safety history webpage with each initial threshold value in the Q initial threshold values to obtain the index data from the N₂Determining n in security history web page₂A security history page, n₂Is an integer greater than 0; and

a fifth determination submodule for determining the n₁And said n₂And the ratio of the sum to the N is determined as the precision rate of the second type of index data.

20. The apparatus of any of claims 11-19, wherein the first determining means comprises at least one of:

a sixth determining sub-module, configured to determine first class feature data based on operation data for the target web page, where the first class feature data characterizes search information associated with the target web page;

a seventh determining sub-module, configured to determine second type feature data based on the web page data of the target web page, where the second type feature data represents attribute information of the target web page; and

an eighth determining sub-module, configured to determine a third type of feature data based on the operation data for the target web page, where the third type of feature data characterizes interaction information associated with the target web page.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.