WO2015165245A1

WO2015165245A1 - Webpage data processing method and device

Info

Publication number: WO2015165245A1
Application number: PCT/CN2014/090841
Authority: WO
Inventors: 王晓振; 田文
Original assignee: 广州市动景计算机科技有限公司; 优视科技有限公司
Priority date: 2014-04-30
Filing date: 2014-11-11
Publication date: 2015-11-05

Abstract

A webpage data processing method and device, the method comprising: acquiring a webpage to be tested; matching the webpage to be tested with a preset matching condition to obtain a matching result, the matching condition comprising keywords of an advertisement filtering rule and the advertisement filtering rule corresponding to the keywords, or the matching condition comprising an area with a first identifier preset in a preset webpage corresponding to the webpage address of the webpage to be tested; and determining the filtering condition of the webpage to be tested according to the matching result. Therefore, compared with a manual detection method, the method and device of the present invention quickly and timely detect the filtering problem of the webpage, thus improving detection efficiency, and being particularly suitable when there is a great number of webpages to be tested.

Description

Webpage data processing method and device

Technical field

The present invention relates to the field of mobile communication technologies, and in particular, to a webpage data processing method and apparatus.

Background technique

Website operators usually put data of certain businesses, such as advertisements, on the webpage to obtain the sponsorship of these merchants, thereby ensuring the normal operation and profitability of the website; but for the user, the data embedded in the webpage is It belongs to non-valid content, and its existence brings a lot of inconvenience to users. For example, when browsing a new webpage, users first need to distinguish between non-active content and effective content such as advertisements; or, because the advertisement content is valid for the corresponding webpage area The occlusion of the content makes it difficult for the user to obtain the valid content. In order to provide users with a clean network environment, most browsers have a filtering function to filter out non-valid content embedded in webpages, such as filtering advertisements. The filtering principle is generally: according to the layout style and frame of the webpage to be filtered. A feature such as a code formulates a corresponding filtering rule, which identifies non-valid content (such as an advertisement) in the webpage, and blocks the loading process of the non-effective content in the webpage or hides the non-effective content in the page, without performing display.

However, in actual applications, since the layout style of the webpage changes with the update of the website, or the website maintainer deliberately changes the layout style or frame code of the webpage to prevent the data embedded therein from being filtered, the preset is caused. The filtering rules no longer apply to the updated webpage, which causes filtering problems such as filtering failures and incorrect filtering of valid content. Therefore, it is necessary to discover the above filtering problem in time in order to optimize the filtering method and improve the filtering accuracy.

In general, manual detection is used to determine whether there is a filtering problem on the webpage, which can ensure the accuracy of the detection results. However, due to the huge number of websites and the fact that each website may be updated ten or more times a day, the manual detection method cannot guarantee timely. Every time a filtering problem is detected, the detection efficiency is extremely low.

In addition, on the browser of the webpage, the ad filter plugin adblock is a widely used ad filter plugin. The basic principle is to set a series of filtering rules. Before the browser sends a resource request to request web resources, check whether its Uniform Resource Locator (URL) hits a filtering rule. If a filter is hit, The rule can determine that the resource requested by the browser is an advertisement, and the browser does not need to request the resource.

In order to achieve better filtering results, it is usually necessary to set more filtering rules. For example, adblock provides more than 20,000 filtering rules. The current browser advertisement filtering method is: when a user inputs a certain url through a browser, the url is used to match the filtering rules one by one, and if a filtering rule is matched, it returns true (indicating that advertisement filtering is required), otherwise Returns false (indicating that no ad filtering is required). Since the filtering rules of a large number of advertisements are set in the browser, each time the browser requests the network, it matches with a large number of filtering rules one by one, so that the performance of the advertisement filtering performance The overhead is large, and because of the large number of filtering rules, each advertisement filter takes a long time.

Summary of the invention

The embodiment of the invention provides a webpage data processing method and device, which solves the problem that the detection of the webpage filtering problem is not timely and inefficient, and realizes the filtering problem quickly and effectively.

In order to achieve the above object, according to an aspect of the present invention, an advertisement filtering method is provided. The browser advertisement filtering method according to the present invention includes: acquiring a webpage to be tested; matching the webpage to be tested with a matching condition set in advance to obtain a matching result, wherein the matching condition includes a keyword of the advertisement filtering rule and the keyword Corresponding advertisement filtering rules, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, where the first webpage is preset in the preset webpage: and the webpage to be tested is determined according to the matching result. Filter the situation.

Further, the method further includes: acquiring the preset webpage corresponding to the webpage address of the webpage to be tested, and matching the webpage to be tested with a preset matching condition to obtain a matching result, The method further includes: setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested, and matching the webpage to be tested with a pre-set matching condition, and obtaining a matching result includes: determining the preset webpage. And determining whether the area of the webpage to be tested is matched with the area of the webpage to be tested, and determining the filtering condition of the webpage to be tested according to the matching result: if the preset webpage and the webpage to be tested are provided with the first identifier If there is a match, it is determined that there is no filtering problem in the webpage to be tested, otherwise it is determined that the webpage to be tested has a filtering problem.

Further, the obtaining the webpage to be tested includes: obtaining the uniform resource locator of the webpage to be tested, and matching the webpage to be tested with the matching condition set in advance, and obtaining the matching result includes: using the keyword of the advertisement filtering rule to the unified resource The locator performs matching; if the uniform resource locator matches the keyword, the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, and the filtering condition of the webpage to be tested is determined according to the matching result: If the uniform resource locator matches the advertisement filtering rule corresponding to the keyword, the advertisement filtering rule is used to filter the advertisement.

In order to achieve the above object, according to another aspect of the present invention, a web page data processing apparatus is provided. The webpage data processing apparatus according to the present invention includes a processor, the processor is configured to execute the following program module: a webpage obtaining unit, configured to acquire a webpage to be tested, and a webpage matching unit, configured to perform the matching webpage with the preset matching condition Matching, the matching result is obtained, wherein the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, the preset The area of the webpage is pre-set with the first identifier: and the result determining unit is configured to determine, according to the foregoing matching result, the filtering condition of the webpage to be tested.

Further, the webpage obtaining unit is further configured to: acquire the preset webpage corresponding to the webpage address of the webpage to be tested, and the device further includes: a webpage marking unit, respectively, in the preset webpage and The first identifier of the area where the actual content exists in the webpage to be tested, the webpage matching unit is further configured to determine whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested, and the result determining unit further uses When the preset webpage matches the area where the first identifier is set in the webpage to be tested, it is determined that the webpage to be tested does not have a filtering problem, otherwise, it is determined that the webpage to be tested has a filtering problem.

Further, the webpage obtaining unit includes: a first acquiring unit, configured to acquire a uniform resource locator of the webpage to be tested, where the webpage matching unit includes: a first matching unit, configured to use the keyword of the advertisement filtering rule to The resource locator performs matching; the second matching unit is configured to: when the uniform resource locator matches the keyword, the foregoing uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, and the result determining unit includes: The filtering unit is configured to perform advertisement filtering by using the advertisement filtering rule when the uniform resource locator matches the advertisement filtering rule corresponding to the keyword.

In order to achieve the above object, in accordance with another aspect of the present invention, a computer readable medium having program code executable by a processor is provided for use in a web page data processing apparatus, the program code causing the processor to perform the following steps: Obtaining a webpage to be tested; matching the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, wherein the preset webpage is pre-set with the area of the first identifier: and determining the filtering condition of the webpage to be tested according to the matching result.

As can be seen from the above technical solution, the embodiment of the present invention obtains a matching result by matching the webpage to be tested with a pre-set matching condition, wherein the matching condition includes the keyword of the advertisement filtering rule and the advertisement corresponding to the keyword. The filtering rule, or the matching condition, includes a preset webpage corresponding to the webpage address of the webpage to be tested, and the preset webpage is pre-set with the area of the first identifier: and determining the filtering condition of the webpage to be tested according to the matching result. Therefore, compared with the manual detection method, the embodiment can quickly and timely detect the filtering problem of the webpage (such as filtering failure, error filtering, etc.), and improve the detection efficiency, and is particularly suitable for the occasion where the number of web pages to be tested is huge.

DRAWINGS

The accompanying drawings, which are incorporated in the claims In the drawing:

1 is a schematic flowchart of a webpage data processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for implementing step S13 in FIG. 1 according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for determining a type of filtering problem based on the method shown in FIG. 2 according to an embodiment of the present invention;

4(a) is a schematic diagram of a preset webpage processed by the embodiment of the present invention;

4(b) is a schematic diagram of a webpage to be tested processed by the embodiment of the present invention;

FIG. 4(c) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention;

FIG. 4(d) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention;

FIG. 4(e) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention; FIG.

FIG. 5 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention;

FIG. 6(a) is a schematic diagram of a webpage not processed by an embodiment of the present invention;

FIG. 6(b) is a schematic diagram showing the step S22 shown in FIG. 5 after performing the webpage shown in FIG. 6(a);

Figure 6 (c) is a schematic diagram of further processing the actual content in the web page shown in Figure 6 (b);

FIG. 7 is a flowchart of a method for implementing step S23 in FIG. 5 according to an embodiment of the present invention;

8 is a schematic diagram of preset comparison points in the embodiment shown in FIG. 7;

FIG. 9 is a flowchart of another method for implementing step S23 in FIG. 5 according to an embodiment of the present invention;

FIG. 10 is a flowchart of a method for implementing steps S341-S342 of FIG. 9 based on webpage interlaced scanning according to an embodiment of the present invention;

FIG. 11 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of a webpage with a border as a first identifier according to an embodiment of the present invention; FIG.

FIG. 13 is a flowchart of a method for implementing step S33 in FIG. 11 according to an embodiment of the present invention;

FIG. 14 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention;

FIG. 15 is a schematic diagram of a partitioning result of a preset webpage and a webpage to be tested according to an embodiment of the present invention;

FIG. 16 is a schematic structural diagram of a webpage data processing apparatus according to an embodiment of the present invention;

FIG. 17 is a schematic structural diagram of another webpage data processing apparatus according to an embodiment of the present invention;

FIG. 18 is a schematic diagram of a webpage data processing apparatus according to a first embodiment of the present invention; FIG.

19 is a schematic diagram of a webpage data processing apparatus according to a second embodiment of the present invention;

20 is a schematic diagram of a web page data processing apparatus according to a third embodiment of the present invention;

21 is a flowchart of a web page data processing method according to a first embodiment of the present invention;

22 is a flowchart of a web page data processing method according to a second embodiment of the present invention;

23 is a flow chart of a preferred web page data processing method in accordance with an embodiment of the present invention.

detailed description

It should be noted that the embodiments in the present invention and the features in the embodiments may be combined with each other without conflict. The invention will be described in detail below with reference to the drawings in conjunction with the embodiments.

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

It is to be understood that the terms "first", "second" and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It will be understood that the data so used may be interchanged where appropriate to facilitate the embodiments of the invention described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.

The embodiment of the invention provides a webpage data processing method and device, which solves the problem that the detection of the webpage filtering problem is not timely and the efficiency is low.

The above-mentioned objects, features, and advantages of the embodiments of the present invention will become more apparent and understood. Give further details.

FIG. 1 is a flowchart of a method for processing webpage data according to an embodiment of the present invention. Referring to FIG. 1, a webpage data processing method provided by an embodiment of the present invention includes the following steps:

S11: Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;

The preset webpage and the webpage to be tested are two webpages corresponding to the webpage address at different times, and the preset webpage may be a webpage corresponding to the webpage address at a certain historical moment, that is, the webpage corresponding to the webpage, that is, the corresponding webpage It is a web page in the case of normal filtering, and there is no problem of false filtering or filtering failure.

S12: setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested respectively;

The above actual content includes both valid content and non-valid content such as advertisements. The area where the first identifier is set on the preset webpage is an aspect of the matching condition, and matching the webpage to be tested with the matching condition includes the determining manner of the following step S13. Optionally, the matching condition may further include a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, which will be described later.

S13: determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested, if yes, step S14 is performed, otherwise step S15 is performed;

S14: determining that the webpage to be tested does not have a filtering problem;

S15: Determine that the webpage to be tested has a filtering problem.

According to the foregoing steps, the embodiment of the present invention obtains the preset webpage and the webpage to be tested corresponding to the same webpage address, and sets the first identifier in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively, by using the foregoing preset Determining, by the webpage, whether the area in which the first identifier is set in the webpage to be tested matches the area in which the first identifier is set in the preset webpage, and determining whether the webpage to be tested has a filtering problem according to the determination result; By setting a corresponding preset webpage for different webpage addresses, it is possible to automatically detect the filtering problem of webpages corresponding to multiple websites and multiple webpage addresses; after the webpage layout style and/or frame code corresponding to a webpage address is changed, Simply change the default webpage corresponding to the webpage address to continue to perform automatic detection accurately. Therefore, compared with the manual detection method, the embodiment can quickly and timely detect the webpage filtering problem (such as the problem of false filtering or filtering failure), and improve the detection efficiency, and is particularly suitable for the occasion where the number of web pages to be tested is huge.

In a possible embodiment of the present invention, the preset webpage and the webpage to be tested processed in step S12 may be stored as a picture format, and the determining step described in S13 is performed on the preset webpage and the webpage to be tested.

In another possible embodiment of the present invention, the preset webpage and the webpage to be tested may not be imaged, but the determining step described in S13 may be implemented directly according to the result processed through step S12.

In the webpage to be tested, the first identifier is set in the webpage to be tested, and the first logo is set in the preset webpage. The matching of the area means that if a certain identifier exists in an area of the preset webpage, the corresponding area in the webpage to be tested should also have the first identifier, and if a certain area in the preset webpage does not exist first, If the identifier is specified, the corresponding area in the web page to be tested should also have no first identifier.

In an actual application, there are various implementation manners for determining whether the area where the first identifier is set in the webpage to be tested and the area in which the first identifier is set in the preset webpage are matched, and FIG. 2 illustrates a A viable implementation.

Referring to FIG. 2, in a webpage data processing method according to a possible embodiment of the present invention, determining whether an area in which a first identifier is set in a webpage to be tested matches an area in which a first identifier is set in a preset webpage includes the following steps:

S331. Calculate, respectively, a first total area of the area where the first identifier is set in the preset webpage, and a second total area of the area where the first identifier is set in the webpage to be tested.

S332. Calculate a third ratio between the first total area and the second total area.

S333, determining whether the third ratio is within a preset range, if yes, executing step S334, otherwise performing step S335;

S334: Determine that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier.

S335. Determine that the preset webpage does not match an area where the first identifier is set in the webpage to be tested.

Strictly speaking, when the area to be tested and the preset web page are completely matched with the area where the first identifier is set, the first total area should be equal to the second total area, that is, the third ratio should be 1, also That is, the preset range should be set to a threshold value, and the threshold value is 1; however, considering the existence of the calculation error or the work load caused by avoiding frequent modification of the filtering rule, it may be set as long as the third ratio is In the preset range with the "1" as the core, the preset webpage is considered to match the area in which the first identifier is set in the webpage to be tested. The determination of the maximum value and the minimum value of the preset range may be determined according to actual detection requirements. The higher the detection accuracy requirement is, the larger the minimum value of the preset range is, and the smaller the maximum value is; for example, the detection accuracy is If the requirement is not high, the preset range can be set to [0.75, 1.35]. In the case where the detection accuracy is high, the preset range can be set to [0.95, 1.05]. Of course, the specific values of the above-mentioned preset ranges are only one possible implementation manner based on the principles of the present invention, and should not be construed as limiting the scope of the present invention.

In another possible embodiment of the present invention, when the embodiment shown in FIG. 2 is used, it is determined that the preset webpage does not match the area in which the first identifier is set in the webpage to be tested, that is, the to-be-tested There is a filtering problem on the webpage, and you can continue to perform the steps shown in Figure 3 to determine the specific type of filtering problem:

S631, determining whether the third ratio is less than the minimum value of the preset range, if yes, proceeding to step S632, otherwise performing step S633;

S632. Determine that the webpage to be tested has a filtering failure.

S633: Determine whether the third ratio is greater than a maximum value of the preset range, and if yes, determine that the webpage to be tested has error filtering.

The examples of the two preset ranges listed in the above embodiment [0.75, 1.35] and [0.95, 1.05] are equal to the difference between the maximum value and the minimum value of each preset range and 1; alternatively, According to the different detection precisions of the two types of filtering problems, the maximum and minimum values of the preset range are respectively set; for example, if the detection accuracy of the filtering failure phenomenon is high, and the detection precision of the false filtering phenomenon is required Lower, set a larger minimum and a larger maximum, such as can be set to [0.95, 1.35],

Embodiments of the present invention shown in Figs. 2 and 3 will be described below with reference to Figs. 4(a) to 4(e).

4(a) is a schematic diagram of a preset webpage processed by step S12, and four regions are provided with the first identifier, which are labeled as A1, B1, C1, and D1 in FIG. 4(a), respectively. For description, wherein the area values of A1, B1, C1, and D1 are 2, 1, 1, and 1.5, respectively; then the total area of the area in which the first identifier is set in the preset webpage, that is, the first total area S1= A1+B1+C1+D1=5.5.

Scenario 1: If the schematic diagram of the web page to be tested processed in step S12 is as shown in FIG. 4(b), there are also four areas in the web page to be tested with the first identifier, and the labels are A2, B2, C2 and D2, and A1 and A2, B1 and B2, C1 and C2, D1 and D2 match, respectively. Wherein, the areas of A2, B2, C2 and D2 are respectively 2, 1, 1, 1.5; then the total area of the area in which the first identifier is set in the webpage to be tested shown in FIG. 4(b) can be calculated, that is, the above The total area S2=A2+B2+C2+D2=5.5; further, the third ratio is calculated as S1/S2=1, that is, in the case shown in FIG. 4(b), the third ratio is within a preset range. It can be determined that there is no filtering problem in the webpage to be tested, which is consistent with the results obtained by directly comparing FIG. 4(a) and FIG. 4(b).

Scenario 2: If the web page to be tested after being processed in step S12 is as shown in FIG. 4(c), there are only three areas in the web page to be tested with the first identifier, and the labels are A3, B3 and C3 respectively. Wherein, the areas of A3, B3, and C3 are 2, 1, and 1, respectively; then, the total area of the area where the first identifier is set in the webpage to be tested shown in FIG. 4(b), that is, the number in step S331, can be calculated. The total area S3=A3+B3+C3=4; further, the third ratio is calculated as S1/S3=1.375. If the preset range is set to [0.75, 1.35], in the case shown in FIG. 4(c), the calculated third ratio is not within the preset range, and it is determined that the webpage has a filtering problem. Further, since 1.375>1.35, that is, the third ratio is greater than the maximum value of the preset range, it can be determined that the webpage to be tested shown in FIG. 4(c) has error filtering, and directly compares FIG. 4(a) and FIG. 4(b). The results obtained are consistent.

Scenario 3: If the schematic diagram of the web page to be tested processed in step S12 is as shown in FIG. 4(d), there are four areas in the web page to be tested with the first identifier, and the labels are A4, B4, C4 and D4 respectively. . Wherein, the areas of A4, B4, C4 and D4 are respectively 2, 1, 1, 2; then the total area of the area in which the first mark is set in the webpage to be tested shown in FIG. 4(b) can be calculated, that is, the above The total area S2=A4+B4+C4+D4=6; further, the third ratio is calculated as S1/S4≈0.92. If the preset range is set to [0.75, 1.35], in the case shown in FIG. 4(d), the third ratio is within the preset range, and it can be determined that the webpage to be tested does not have a filtering problem. In this case, the calculated third ratio is not 1, that is, the preset webpage of FIG. 4(a) does not completely match the webpage to be tested of FIG. 4(d), but the detection accuracy is small due to the small difference. If the requirements are not high, it can be considered that there is no filtering problem in the web page to be tested in FIG. 4(d).

Scenario 4: If the schematic diagram of the web page to be tested processed in step S12 is as shown in FIG. 4(e), there are also four areas in the web page to be tested with the first identifier, and the labels are A5, B5, C5 and D5. Wherein, the areas of A5, B5, C5 and D5 are respectively 2, 1, 1 and 4; then the total area of the area in which the first mark is set in the webpage to be tested shown in Fig. 4(e) can be calculated, that is, the above The total area S5=A5+B5+C5+D5=8; further, the third ratio is calculated as S 1/S5 ≈ 0.69. If the preset range is set to [0.75, 1.35], in the case shown in FIG. 4(e), the calculated third ratio is not within the preset range, and it is determined that the webpage has a filtering problem. Further, since 0.69<0.75, that is, the third ratio is smaller than the minimum value of the preset range, it can be determined that the webpage to be tested shown in FIG. 4(e) has filtering failure, and directly compares FIG. 4(a) and FIG. 4(e). The results obtained are consistent.

Optionally, in another feasible embodiment of the present invention, after obtaining the first total area and the second total area, the area difference between the two (the first total area minus the second total area) may be calculated and The fourth ratio of the first total area (or the second total area), if the absolute value of the fourth ratio is less than the preset threshold, determining that the webpage to be tested does not have a filtering problem, and vice versa, there is a filtering problem; If the absolute value of the fourth ratio is not less than (ie, greater than or equal to) the preset threshold, and the fourth ratio is less than zero, determining that the webpage to be tested has a filter failure; if the absolute value of the fourth ratio is not If the preset threshold is less than (or greater than or equal to), and the fourth ratio is greater than zero, it is determined that the webpage to be tested has a false filtering phenomenon.

FIG. 5 is a flowchart of a method for processing webpage data according to another embodiment of the present invention. Referring to FIG. 5, the webpage data processing method described in this embodiment includes the following steps:

S21: Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;

S22: Set a background color of an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, as a preset color;

S23: determining whether the preset webpage and the background color of the webpage to be tested match the area of the preset color, if yes, step S24 is performed, otherwise step S25 is performed;

S24: determining that the webpage to be tested does not have a filtering problem;

S25: Determine that the webpage to be tested has a filtering problem.

Corresponding to the embodiment shown in FIG. 1 , the embodiment shown in FIG. 5 uses the preset color as the first identifier, and is used to mark an area in the webpage where the actual content exists.

In another possible embodiment of the present invention, when the background color of the area where the actual content exists in the preset webpage and the webpage to be tested is set as the preset color, the actual content in the two webpages may also be executed as follows. Processing: When the actual content is text, the color of the text is also set to the above preset color; when the actual content is a picture, the picture is deleted.

Since the two different colors are superimposed, a third color different from the two colors is obtained, and the image content in the webpage covers the background color of the corresponding area. Therefore, the color of the text itself is eliminated by the above processing of the actual content. And the effect of the color of the image on the color of the webpage, ensuring that the color of the area in which the actual content exists in the webpage is the same as the background color of the area, and the color of the corresponding webpage can be directly obtained, and the webpage to be tested and the preset are determined according to the obtained color. Whether the webpage matches, whether it is determined whether the acquired color is the background color of the corresponding area, or the background color of the corresponding area is obtained by other complicated methods.

For example, the black color is the preset color, and the step S22 is performed on the webpage shown in FIG. 6(a), and the background color of the area where the actual content exists in the webpage becomes black, and the webpage shown in FIG. 6(b) can be obtained; It can be seen from FIG. 6(b) that if the color of the text in the webpage is different from the preset color (black), the actual color of the area obtained by superimposing the color of the text and the background color of the corresponding area is also the preset color (black). Differently, if there is a picture in the webpage, the picture will completely cover the background color of the area, and the actual color of the area can only be expressed as the color in the picture is not convenient for color comparison; therefore, the embodiment of the present invention is shown in FIG. 6(b). On the basis of the processing result shown in the figure, the processing result shown in FIG. 6(c) is obtained by deleting the picture content in the webpage and setting the color of the text in the webpage to the preset color (black) which is the same as the background color; It can be seen from FIG. 6(c) that the area where the actual content exists in the final processed webpage is uniformly displayed as a pure black block, which is advantageous for the execution of the subsequent steps.

In a possible embodiment of the present invention, the method shown in FIG. 2 may be used to determine whether the preset webpage and the background color of the webpage to be tested are the preset color in the webpage to be tested. Matching, that is, calculating a total area M1 of the area in which the background color is the preset color in the preset webpage, and a total area M2 of the area in which the background color of the webpage to be tested is the preset color, and calculating the ratio M1/M2, If the M1/M2 is within the preset range, determining that the preset webpage matches an area in the webpage to be tested whose background color is the preset color, otherwise determining the preset webpage and the webpage to be tested. The area in which the background color is the preset color does not match, and there is a filtering problem. Correspondingly, after determining that the webpage to be tested has a filtering problem, the type of the filtering problem (filtering failure or false filtering) may be further determined by the method shown in FIG. 3.

In another possible embodiment of the present invention, the determining, by the process shown in FIG. 7, the determining that the background color of the preset webpage and the webpage to be tested is the preset color is performed in S23. Whether it matches:

S311: Compare whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;

The preset comparison point refers to a pixel point in the webpage whose coordinates are preset coordinate values. For example, referring to FIG. 8 , the xy coordinate system can be established with the upper left corner of the webpage as the origin, and the horizontal right direction is the x-axis direction. The direction of the straight downward direction is the y-axis direction; wherein the pixel point P1 with coordinates (3, 2) can be used as a preset comparison point, and the pixel point P2 with coordinates (8, 4) can also be used as a preset comparison. Point; the same preset comparison point is respectively mapped to the preset webpage and the two regions (pixels) obtained in the webpage to be tested as a pair of corresponding regions, and step S311 compares the colors of each pair of corresponding regions. If the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same, it indicates that the two areas corresponding to the preset comparison point match, that is, both have valid content, or none exist. Effective content.

In order to ensure the accuracy of the detection, the total number of preset comparison points should not be too small, and the specific values can be set according to actual application requirements.

S312: Calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;

S313: determining whether the first ratio is smaller than the first preset ratio, if the first ratio is less than the first preset ratio, step S314 is performed, otherwise step S315 is performed;

S314: Determine that the preset webpage matches an area of the webpage to be tested whose background color is the preset color.

S315: Determine that the preset webpage does not match an area in the webpage to be tested whose background color is the preset color.

The larger the first ratio is, the more the number of preset comparison points is different for the color comparison result, and correspondingly, the area that does not match between the preset web page and the web page to be tested is larger. Therefore, the first preset ratio may be set according to the detection precision requirement (the maximum ratio of the unmatched area between the allowed preset webpage and the webpage to be tested to the entire webpage), when the first ratio is greater than the first preset ratio The ratio of the unmatched area between the preset webpage and the webpage to be tested is too large, so that the filtering problem of the webpage to be tested may be determined. Conversely, it may be determined that the webpage to be tested does not have a filtering problem.

In a possible embodiment of the present invention, when the method shown in FIG. 7 determines that the preset webpage does not match the area in which the first identifier is set in the webpage to be tested, that is, the webpage to be tested is filtered. Problem, you can continue to perform the following steps to determine the specific type of filtering problem:

Determining, in the webpage to be tested, whether the color of the first region corresponding to the different preset comparison points is the same as the preset color;

If the color of the first area is the same as the preset color, it is determined that the first area has a filtering failure problem, otherwise the first area is determined to have a false filtering problem.

For example, the color comparison result of the preset comparison point P1 (3, 2) is different, that is, the color of the pixel with the coordinate of (3, 2) in the webpage to be tested, and the coordinate of the preset webpage is (3, 2). The color of the pixel is different. Under this premise, if the color of the pixel with the coordinates (3, 2) in the web page to be tested is the same as the preset color, the pixel with the coordinates of (3, 2) in the corresponding preset web page. The color of the preset color is different from the preset color, and the actual content is not present in the preset webpage. The actual content exists in the corresponding area of the webpage to be tested. Therefore, it can be determined that the webpage to be tested is in the corresponding area of the preset comparison point. There is non-valid content, that is, filtering failure occurs. Conversely, if the color of the pixel with the coordinates (3, 2) in the web page to be tested is different from the preset color, the color of the pixel with the coordinates (3, 2) in the preset webpage is the same as the preset color, indicating The area where the actual content exists in the webpage does not exist in the corresponding area in the webpage to be tested. Therefore, it can be determined that the effective content of the webpage to be tested in the area corresponding to the preset comparison point is filtered out, that is, an error occurs. filter.

Optionally, in another possible embodiment of the present invention, based on the principle of the method shown in FIG. 2, the determining the preset webpage and the to-be-tested according to step S23 may be implemented by the method flow shown in FIG. Whether the background color of the webpage matches the area of the preset color; referring to FIG. 9, the method includes the following steps:

S341. Compare, in the preset webpage and the webpage to be tested, a background color of an area corresponding to the same preset comparison point with the preset color;

S342, recording the number M1 of regions in the preset webpage with the same background color as the preset color, and the number M2 of the regions in the webpage to be tested that have the same background color as the preset color;

S343, calculating a ratio M1/M2 of the M1 and M2;

S344, it is determined whether M1/M2 is within a preset range, if yes, step S345 is performed, otherwise step S346 is performed;

S345. Determine that the preset webpage matches an area of the webpage to be tested whose background color is the preset color.

S346. Determine that the preset webpage does not match an area in which the background color of the webpage to be tested is the preset color.

Strictly speaking, when the preset webpage and the background color of the webpage to be tested completely match the area of the preset color, there should be M1=M2, that is, M1/M2=1, that is, the preset in step S344. The range should be set to a value of 1, which is 1. However, according to the detection accuracy requirement in practical applications, the preset range may be set to a numerical interval including “1”; and the higher the detection accuracy requirement, the larger the minimum value of the preset range and the smaller the maximum value.

Further, when it is determined by the method shown in FIG. 9 that the preset webpage does not match the area in which the background color of the webpage to be tested is the preset color, that is, when the webpage to be tested has a filtering problem, the following may continue to be performed as follows: Steps to determine the specific type of filtering problem:

If M1>M2, it is determined that the webpage to be tested has error filtering; if M1<M2, it is determined that the webpage to be tested has filtering invalidity.

In order to better implement the automatic detection and the quick completion of the comparison between the preset webpage and the webpage to be tested, in a specific embodiment of the present invention, the method for performing the webpage interlaced scanning method shown in FIG. 10 is performed on the webpage to be tested and the preset webpage respectively. In order to acquire M1 and M2, steps S341 to S342 shown in Fig. 9 are realized.

Referring to Figure 10, the method includes the following steps:

S1: setting the scanning parameters by using the upper left corner of the webpage to be scanned as the coordinate origin, including: the abscissa X (initial value is 0), the ordinate Y (initial value is 0), the horizontal scanning step length ΔW, and the longitudinal scanning step length ΔH, the width W of the web page, and the height H of the web page;

S2: determining whether the color of the preset comparison point whose coordinates are (X, Y) is the same as the preset color, if yes, executing step S3, otherwise performing step S4;

S3: Record the comparison result corresponding to the preset comparison point (X, Y) as 1, and perform step S5;

S4: Record the comparison result corresponding to the preset comparison point (X, Y) as 0, and perform step S5;

S5: increasing the value of the ordinate Y by a longitudinal scanning step size ΔH;

That is, the assignment operation Y=Y+ΔH is performed.

S6: determining whether the ordinate Y is greater than H, if yes, proceeding to step S7, otherwise returning to step S2;

S7: increasing the value of the abscissa X by a horizontal scanning step size ΔW, and setting the value of the ordinate Y to 0;

That is, the assignment operation X=X+ΔW is performed, and Y=0.

S8: determining whether the abscissa X is greater than W, if yes, proceeding to step S9, otherwise returning to step S2;

S9: Calculating the number M of the comparison result is “1”; wherein, when the webpage to be scanned is the preset webpage, M=M1, when the webpage to be scanned is the webpage to be tested, M= M2.

It can be seen that, in the method of FIG. 10, the scan point is the preset comparison point, and the total number of scan points can be adjusted by adjusting the horizontal scan step size ΔW and/or the vertical scan step size ΔH, that is, adjusting the preset comparison. The number of points is simple and flexible. At the same time, it is automatically compared with the preset color in the corresponding area of each preset comparison point during the scanning process, and the processing efficiency can be improved.

Optionally, in another feasible embodiment of the present invention, the comparison result in the method shown in FIG. 10 may be stored by means of a digital matrix. For example, during the scanning process, the abscissa X has a total of 20 values, and the ordinate Y A total of five values, you can get a matrix of 5 rows and 20 columns as shown below:

0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

0,0,0,0,1,1,1,0,0,1,1,1,1,1,0,0,0,1,1,1

In the above digital matrix, each number corresponds to one scanning point, that is, a preset comparison point.

FIG. 11 is a flowchart of a method for processing webpage data according to another embodiment of the present invention. Referring to FIG. 11, the webpage data processing method described in this embodiment includes the following steps:

S31: Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;

S32. Set a border in an area where the actual content exists in the preset webpage and the webpage to be tested respectively;

Wherein the border coincides with a boundary of the area where the actual content exists. FIG. 12 is a schematic diagram of a webpage after the frame is set in the area of the “column” in the webpage shown in FIG. 6(a); it should be noted that the frame used in the embodiment of the present invention is not limited to FIG. The dashed box in the middle of the sample.

S33, determining whether the preset webpage and the area of the webpage to be tested are matched with the border, if yes, step S34 is performed, otherwise step S35 is performed;

S34: determining that the webpage to be tested does not have a filtering problem;

S35: Determine that the webpage to be tested has a filtering problem.

Corresponding to the embodiment shown in FIG. 1, the embodiment shown in FIG. 11 uses a border as the first identifier, and is used to mark an area in the webpage where the actual content exists.

Optionally, whether the preset webpage is matched with the area in which the border is set in the webpage to be tested is determined in the foregoing step S33, and may be implemented by using the method shown in FIG.

S321: Calculate an area of the preset webpage where the border is disposed, and an area of the portion of the webpage to be tested that does not overlap with the border, and an area of the preset webpage where the border is disposed. a second ratio between the total areas;

S322, determining whether the second ratio is less than the second preset ratio, if yes, proceeding to step S323, otherwise performing step S324;

S323: determining that the preset webpage matches an area in the webpage to be tested that is provided with the border;

S324: Determine that the preset webpage does not match an area in the webpage to be tested that is provided with the border.

The larger the second ratio is, the more the non-overlapping part is, and the corresponding area between the preset webpage and the webpage to be tested is larger, and the second ratio is smaller, indicating that the overlapping part is more. The larger the matching area between the default web page and the web page to be tested.

It should be noted that the specific form of the first identifier used to mark the area where the actual content exists in the webpage according to the embodiment of the present invention is not limited to the preset color in the embodiment shown in FIG. 5, and FIG. The polygonal frame in the embodiment, all other embodiments obtained by other marking methods obtained by those skilled in the art without creative efforts should fall within the protection scope of the present invention.

In a possible embodiment of the present invention, when the method shown in FIG. 13 determines that the preset webpage does not match the area in which the first identifier is set in the webpage to be tested, that is, the webpage to be tested is filtered. Problem, you can continue to perform the following steps to determine the specific type of filtering problem:

When the preset webpage is not located in an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure;

And determining, in the preset webpage, that the second region has a false filter when the border corresponding to the second region where the border is not disposed in the webpage to be tested is set.

FIG. 14 is a flowchart of a method for processing webpage data according to another possible embodiment of the present invention, including the following steps:

S41. Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;

S42. Set a first identifier of the preset webpage and the area where the actual content exists in the webpage to be tested, respectively.

S43. The preset webpage and the webpage to be tested are respectively divided into a plurality of comparison areas corresponding to one-to-one correspondence;

As shown in FIG. 15 , a preset web page and a partition result of the web page to be tested are divided into four comparison areas: Q1, Q2, Q3, and Q4. Correspondingly, the test result is also divided into four. The regions are the region Z1 corresponding to Q1, the region Z2 corresponding to Q2, the region Z3 corresponding to Q3, and the region Z4 corresponding to Q4.

S44, respectively, determining whether the area of the comparison area corresponding to the preset webpage and the webpage to be tested is matched with the first identifier, if yes, step S45 is performed, otherwise step S46 is performed;

Taking FIG. 15 as an example, whether the areas in which the first identifier is set in Q1 and Z1 are respectively matched, and whether the areas in which the first identifier is set in Q2 and Z2 are matched, and the number is set in Q3 and Z3. Whether an identified area matches, and whether the areas in the Q4 and Z4 in which the first identification is set match.

S45. Determine that there is no filtering problem in the comparison area of the comparison area that belongs to the webpage to be tested.

S46. Determine a filtering problem in the comparison area that belongs to the webpage to be tested in the comparison area.

In the foregoing technical solution, by determining a partition between the preset webpage and the webpage to be tested, and determining whether the area in which the first identifier is set in each pair of regions is matched, the scheme may be reduced compared with the comparison of the entire webpage. Detection error.

Through the description of the above method embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for causing a A computer device (which may be a personal computer, server, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes various types of media that can store program codes, such as a read only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

Corresponding to the embodiment of the webpage data processing method provided by the present invention, the present invention further provides a webpage data processing apparatus.

FIG. 16 is a schematic structural diagram of a webpage data processing apparatus according to a possible embodiment of the present invention. Referring to FIG. 16, the webpage data processing apparatus includes a webpage obtaining unit 810, a webpage marking unit 820, a webpage matching unit 830, and a result determining unit 840.

The webpage obtaining unit 810 is configured to obtain a webpage to be tested and a preset webpage corresponding to the webpage address of the webpage to be tested.

The webpage marking unit 820 is configured to set a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested.

The webpage matching unit 830 is configured to determine whether the preset webpage matches an area in the webpage to be tested in which the first identifier is disposed.

The result determining unit 840 is configured to determine that the webpage to be tested does not have a filtering problem when the preset webpage matches the area in which the first identifier is set in the webpage to be tested, and otherwise determine that the webpage to be tested exists Filter the problem.

It can be seen that, in the embodiment of the present invention, the preset webpage corresponding to the same webpage address and the webpage to be tested are obtained, and the first identifier is set in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively. Setting a webpage as a reference, determining whether the area in which the first identifier is set in the webpage to be tested matches the area in which the first identifier is set in the preset webpage, and determining whether the webpage to be tested has a filtering problem according to the determination result; applying the embodiment of the present invention By setting a corresponding preset webpage for different webpage addresses, it is possible to automatically detect the filtering problem of webpages corresponding to multiple websites and multiple webpage addresses; after the webpage layout style and/or frame code corresponding to a webpage address is changed, , you only need to change the default webpage corresponding to the webpage address to continue to perform automatic detection accurately. Therefore, compared with the manual detection method, the embodiment can detect the filtering problem quickly and timely, and improve the detection efficiency, and is particularly suitable for occasions where the number of web pages to be tested is huge.

In a possible embodiment of the present invention, the webpage matching unit 830 may include:

An area calculating unit, configured to separately calculate a first total area of the area in which the first identifier is set in the preset webpage, and a second total area of the area in which the first identifier is disposed in the webpage to be tested ;

a third calculating unit, configured to calculate a third ratio between the first total area and the second total area;

a third determining unit, configured to determine whether the third ratio is within a preset range; if the third ratio is within a preset range, determining that the preset webpage and the webpage to be tested are set in the The area of the first identifier is matched, and the preset webpage is determined to not match the area in which the first identifier is set in the webpage to be tested.

In addition, the webpage processing apparatus may further include: a third sub-determining unit, configured to compare the third ratio, the minimum of the preset range, after the result determining unit determines that the webpage to be tested has a filtering problem a value, and a maximum value of the preset range, and when the third ratio is less than a minimum value of the preset range, determining that the webpage to be tested has a filter failure, if the third ratio is greater than the When the maximum value of the preset range is determined, it is determined that the webpage to be tested has error filtering.

In another possible embodiment of the present invention, the webpage marking unit 820 may include:

a background setting unit, configured to respectively set a background color of an area where the actual content exists in the preset webpage and the webpage to be tested as a preset color;

a word processing unit, configured to set a color of the text to be the preset color when the actual content in the preset webpage and/or the webpage to be tested is a text;

The picture processing unit is configured to delete the picture when the actual content in the preset webpage and/or the webpage to be tested is a picture.

Correspondingly, in the foregoing embodiment, the webpage matching unit 830 may include:

a color comparison unit, configured to compare whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;

a first calculating unit, configured to calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;

a first determining unit, configured to determine whether the first ratio is smaller than a first preset ratio, and when the first ratio is greater than the first preset ratio, determining that the preset webpage is set in the webpage to be tested The area with the first identifier does not match, otherwise it is determined that the preset webpage matches the area of the webpage to be tested in which the first identifier is set.

In addition, the webpage data processing apparatus provided in the foregoing embodiment may further include: a first sub-determining unit, configured to determine, after the result determining unit determines that the webpage to be tested has a filtering problem, the color comparison in the webpage to be tested The result is that the color of the first area corresponding to the different preset comparison points is the same as the preset color, and when the color of the first area is the same as the preset color, it is determined that the first area has filtering failure. Otherwise, it is determined that there is false filtering in the first area.

a second calculating unit, configured to calculate a second ratio between an area of a portion of the preset webpage and the webpage to be tested that does not overlap with a polygon frame and a total area of the polygon frame in the preset webpage;

a second determining unit, configured to determine, when the second ratio is not greater than the second preset ratio, that the preset webpage does not match an area in the webpage to be tested that is provided with the first identifier, otherwise The preset webpage matches an area of the webpage to be tested in which the first identifier is disposed.

Correspondingly, the webpage matching unit 830 can include:

In addition, the webpage data processing apparatus provided in the foregoing embodiment may further include: a second sub-determining unit, configured to: after the result determining unit determines that the webpage to be tested has a filtering problem, perform the following determination:

If the preset webpage is not provided with an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure; if the preset In the webpage, when the border is set in an area corresponding to the second area where the border is not set in the webpage to be tested, it is determined that the second area has error filtering.

In general, the webpage matching unit 830 directly determines whether the matching is performed by using the entire webpage. In another possible embodiment of the present invention, the webpage data processing apparatus may further include: an area dividing unit, respectively, The preset webpage and the webpage to be tested are divided into a plurality of corresponding comparison areas; correspondingly, the webpage matching unit 830 includes: a first sub-matching unit, configured to respectively determine between the preset webpage and the webpage to be tested Whether the regions in which the first identifier is disposed in each pair of comparison regions corresponding to each other match.

In the above embodiment, by determining the webpage to be tested and the preset webpage, and judging whether each area is matched, the error caused by the numerical calculation and other factors in the judging process can be reduced, and the detection accuracy is improved.

For the convenience of description, the above devices are described separately by function into various units. Of course, the functions of the various units may be implemented in one or more software and/or hardware in the practice of the invention.

Additionally, the present invention provides a computer readable medium having program code executable by a processor, which, when executed, causes the processor to perform the steps of:

Obtaining a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;

Setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested respectively;

Determining whether the preset webpage matches an area in the webpage to be tested in which the first identifier is set;

If the preset webpage matches the area in which the first identifier is set in the webpage to be tested, it is determined that the webpage to be tested does not have a filtering problem, otherwise, it is determined that the webpage to be tested has a filtering problem.

In a possible embodiment of the present invention, determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: separately calculating that the preset webpage is set in the preset a first total area of the first identified area, and a second total area of the area in the web page to be tested in which the first identifier is disposed; calculating a third between the first total area and the second total area Determining whether the third ratio is within a preset range; if the third ratio is within a preset range, determining the preset webpage and the area of the webpage to be tested that is provided with the first identifier Matching, otherwise determining that the preset webpage does not match an area in the webpage to be tested in which the first identifier is set.

In addition, after determining that the webpage to be tested has a filtering problem, the following step may be performed: if the third ratio is less than a minimum value of the preset range, determining that the webpage to be tested has a filtering failure; If the third ratio is greater than the maximum value of the preset range, it is determined that the webpage to be tested has error filtering.

In another possible embodiment of the present invention, the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively presenting the preset webpage and the webpage to be tested The background color of the area of the actual content is set to a preset color; when the actual content is text, the color of the text is set as the preset color; when the actual content is a picture, the picture is deleted.

Correspondingly, the determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested includes: comparing the preset webpage and the webpage to be tested with the same preset comparison point Whether the color of the corresponding area is the same; calculating a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points; determining whether the first ratio is smaller than a first preset ratio; if the first ratio is smaller than the first preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the pre-determination The webpage is not matched with the area in which the first identifier is set in the webpage to be tested.

In addition, after determining that the webpage to be tested has a filtering problem, the following step may be performed: determining, in the webpage to be tested, that the color comparison result is the color of the first region corresponding to the different preset comparison point, and whether The preset color is the same; if the color of the first area is the same as the preset color, it is determined that the first area has a filtering failure problem, otherwise the first area is determined to have a false filtering problem.

In another possible embodiment of the present invention, the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively presenting in the preset webpage and the webpage to be tested A locale border of the actual content; wherein the border coincides with a boundary of the area where the actual content exists.

Correspondingly, determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: calculating an area in which the border is set in the preset webpage, and testing a second ratio between an area of a portion of the webpage where the area of the border does not overlap, and a total area of the area of the preset webpage where the border is disposed; determining whether the second ratio is smaller than the second a preset ratio; if the second ratio is smaller than the second preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the preset webpage The area in which the first identifier is set in the webpage to be tested does not match.

In addition, after determining that the webpage to be tested has a filtering problem, the following step may be performed: in the preset webpage, an area corresponding to the first area in which the border is set in the webpage to be tested is not set. When the border is set, it is determined that the first area has a filter failure; when the preset webpage is set with the border corresponding to the second area of the webpage to be tested where the border is not disposed, It is determined that there is false filtering in the second region.

In another possible embodiment of the present invention, before determining whether the preset webpage and the area of the webpage to be tested are matched with the first identifier, the step of: respectively: performing the preset The webpage and the webpage to be tested are divided into a plurality of comparison areas corresponding one by one.

Correspondingly, determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: determining, respectively, that the preset webpage corresponds to the webpage to be tested Whether each of the pair of comparison areas in which the first identifier is set matches.

In addition, referring to FIG. 17, a webpage data processing apparatus according to another embodiment of the present invention includes a processor 101 and a computer readable medium 102. The computer readable medium 102 stores program code that can be executed by the processor 101, and processes The program 101 reads program code within the computer readable medium 102 for implementing the steps or unit functions described above.

In addition, it should be understood that the computer readable medium (eg, memory) described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example and not limitation, non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash. Memory. Volatile memory can include random access memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM can be obtained in a variety of forms, such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM) and direct Rambus RAM (DRRAM). Storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Figure 18 is a diagram showing a web page data processing apparatus according to a first embodiment of the present invention. As shown in FIG. 18, the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40.

The first obtaining unit 10 is configured to obtain a uniform resource locator of the webpage to be tested.

The browser can be a personal computer (PC) browser, or a browser on the mobile terminal. The user can input a Uniform Resource Locator (URL) on the browser. Get the url to determine if ad filtering is required.

The first matching unit 20 is configured to match the uniform resource locators by using keywords of the advertisement filtering rule.

After obtaining the input uniform resource locator, the uniform resource locator can be matched by using the keyword of the advertisement filtering rule. The url may be segmented first, for example, by passing the url into the disconnector, and by setting a predetermined rule in the segmenter to segment the url to obtain a plurality of segmented characters. Then, multiple segment characters are passed into the keyword matcher, and multiple segment characters are matched by using the preset keywords in the keyword matcher, and each segment character is judged one by one to hit the keyword matcher. Keyword. The keyword can be matched with multiple advertisement filtering rules, so that when the keyword matches the url, only the advertisement filtering rule corresponding to the keyword can be matched with the url, and there is no need to match each advertisement filtering rule.

The second matching unit 30 is configured to match the uniform resource locator with the advertisement filtering rule corresponding to the keyword when the uniform resource locator matches the keyword.

When the uniform resource locator matches the keyword, the uniform resource locator is matched with the keyword filtering rule corresponding to the keyword, and the uniform resource locator is not matched with all the advertisement filtering rules.

The uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, wherein the uniform resource locator may be a uniform resource locator matching the keyword. Specifically, the segment character matching the url of the keyword may be introduced into the rule rule matcher, where the rule matcher has a correspondence between the keyword and the advertisement filter rule. Matching the segment character of the url matching the keyword to the advertisement filtering rule in the rule matcher may first match the segment character of the url with the advertisement filter rule of the white list, and then segment the character of the url with black. The list of advertisement filtering rules is matched, wherein the whitelist indicates a list of advertisement filtering rules that do not filter resources matching the rule, and the blacklist indicates filtering a list of resource advertisement filtering rules that match the rule. If the advertisement filtering rule of the whitelist is matched, the resource corresponding to the url corresponding to the segmentation character may be requested; if the advertisement filtering rule of the blacklist is matched, the resource corresponding to the url corresponding to the segmentation character is not required. If none match, the next segment character can be matched in the same way.

The matching of the url in the rule matcher may first convert the corresponding advertisement filtering rule of the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to determine whether the url is related to the advertisement filtering rule. match.

The filtering unit 40 is configured to perform advertisement filtering by using an advertisement filtering rule when the uniform resource locator matches the advertisement filtering rule corresponding to the keyword.

After matching the uniform resource locator with the keyword filtering rule corresponding to the keyword, if the 'Uniform Resource Locator matching the keyword matches the 'Ad filter rule corresponding to the keyword', the matching advertisement may be output. Filtering rules', using this ad filtering rule for ad filtering. That is, if it is determined that the resource requested by the url is an advertisement, the browser does not need to request the resource.

According to the embodiment of the present invention, the url is matched by the keyword of the advertisement filtering rule, and then the url matching the keyword is matched with the advertisement filtering rule corresponding to the keyword, so as to avoid filtering the url and all the advertisements. The rules are matched one by one, which reduces the number of matched advertisement filtering rules, thereby solving the problem that each advertisement filtering time is long due to the large number of filtering rules, ensuring effective filtering of the advertising space, and achieving the effect of reducing the advertising filtering time.

For example, if there are 20,000 advertisement filtering rules, the existing advertisement filtering needs to match the url with the 20,000 advertisement filtering rules one by one. If an advertisement filtering rule is matched, the advertisement filtering is performed. In the embodiment of the present invention, the url is first matched with the keyword of the advertisement filter rule. If the matched keyword A corresponds to 100 advertisement filtering rules, only the url and the 100 advertisement filtering rules need to be performed. Matching greatly reduces the time of matching.

In the embodiment of the present invention, the webpage data processing apparatus may be used for advertisement filtering of a PC browser, or may be used on a browser on a mobile terminal, and may implement its function through a PC or a mobile terminal itself, or may be through a cloud server. (such as middleware) to achieve its function. The webpage data processing apparatus of the embodiment of the present invention can produce a better effect when the rules of advertisement rules that can be supported on the mobile terminal are limited.

Preferably, the web page data processing apparatus includes an incoming unit and a segmented unit. The incoming unit is configured to pass the uniform resource locator to the segmenter after obtaining the uniform resource locator of the web page to be tested in the browser. The segmentation unit is configured to segment the uniform resource locator in the segmenter to obtain a plurality of segmentation characters. The first matching unit includes a second matching module, and the second matching module is configured to match the plurality of segment characters to the keywords in the keyword matcher one by one. A segmenter is used to segment the uniform resource locator.

In the segmenter, segmentation may be performed according to a preset segmentation rule. The preset segmentation rule may include: first, segmenting the url by using a “/” as a separator, and then segmenting the first segment into a domain name. The remaining segments are segmented for each path; then, for the domain name segmentation, the domain name segmentation is further divided by "." as a delimiter; finally, for non-domain name segmentation, further divided into segments according to special characters, wherein , special characters can be special characters including '.', '_', '-', '? ', ':', ‘=’, ‘;’, ‘&’, ‘+’, etc. By segmenting the url according to predetermined rules, the effect of filtering the advertisement can be further ensured.

Figure 19 is a diagram showing a web page data processing apparatus in accordance with a second embodiment of the present invention. This embodiment can be taken as a preferred embodiment of the above embodiment. As shown in FIG. 19, the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40. The webpage data processing apparatus further includes a second obtaining unit 50 and an establishing unit 60. The first matching unit 20 includes an obtaining module 201 and a first judging module 202.

The second obtaining unit 50 is configured to acquire a keyword corresponding to the advertisement filtering rule before the uniform resource locator is matched by using the keyword of the advertisement filtering rule.

The keyword in the keyword matcher may be initialized before the keyword is matched by the keyword of the advertisement filter rule. The specific initialization process may be: first obtaining a keyword corresponding to the advertisement filter rule. For example, the keyword is extracted from the file of the advertisement filtering rule, so that after the url matches the keyword, the advertisement filtering rule corresponding to the keyword can be queried.

The establishing unit 60 is configured to establish a dictionary tree of keywords corresponding to the advertisement filtering rules.

The dictionary tree, the Trie tree, is a distributed concept query method. The basic idea is to record the prefix information of all keywords in the table, so the number of comparisons can be greatly reduced when querying. This method is especially useful when the number of keywords is large. The keywords are organized by establishing a dictionary tree of keywords corresponding to the advertisement filtering rules, and the trie tree is used to further optimize the time of consumption of the advertisement filtering.

In order to achieve the fastest lookup effect through the trie tree, the keyword can be stored in a sequential manner to improve the speed of the search. The nodes in the trie tree contain empty links (null pointers), which represent the current trie tree. There are no keywords in the location to facilitate the fastest lookup.

The obtaining module 201 is configured to acquire keywords in the dictionary tree.

After the dictionary tree is built, matching the url with the keyword may first obtain the keywords in the dictionary tree to match the url with the keywords in the dictionary tree.

The first determining module 202 is configured to determine whether the uniform resource locator matches a keyword in the dictionary tree.

Determine whether the uniform resource locator matches the keyword in the dictionary tree, that is, use the keyword dictionary tree to match the url. When the segment character of the url is passed to the keyword matcher, the keyword matcher of the advertisement filtering rule searches the trie tree for the segment character according to the segment character passed in the url segmenter. Word matching, where the match includes an exact match and a partial match. An exact match means that the segmented character is exactly the same as a keyword, and a partial match is when a keyword is a prefix of a segmented character. For example, when searching for a keyword in a trie tree, if there is an in the keyword, when the segment character is as or ask, the query matching success can be returned. When the corresponding advertisement filter rule keyword is queried in the trie tree, the corresponding advertisement filter rule can be found by using the keyword, and the searched advertisement filter rule is used for the advertisement filter.

According to the embodiment of the present invention, by using the dictionary tree of the keyword to match the url and the keyword, the time consumption of the url in matching the keyword is reduced, thereby further reducing the advertisement filtering time.

Preferably, the second acquisition unit 50 includes a reading module and an extraction module. The read module is used to read the files of the ad filter rules. The extraction module is used to extract keywords from the files of the advertisement filtering rules. The establishing unit 60 includes a first establishing module and a second establishing module. The first establishing module is used to establish a correspondence between keywords and advertisement filtering rules. The second building module is configured to build a dictionary tree based on the extracted keywords.

Specifically, the file of the advertisement filtering rule may be read into the memory from the disk in the PC or the mobile terminal or the cloud server. Then extract the keywords from the file of the ad filter rule and establish the corresponding relationship between the keyword and the ad filter rule. The rules for extracting keywords from the files of the advertisement filtering rule may include:

1) Does not include characters occupied by adblock rules, such as ‘@’, ‘|’, ‘*’, etc.

2) Does not include the part of the option in the advertisement filtering rule (option is part of the rule defined by adblock, which is used to indicate that the rule is applied/not applied to certain domain names or types of resources).

3) Qualified keywords may contain characters of ‘0 to 9 digits, ‘a to z 26 English letters’, ‘.’, ‘_’, ‘-’, ‘? ', ':', ‘=’, ‘;’, ‘&’, ‘+’, etc.

4) When selecting keywords from the ad filter rules, the ad filter rules either start with a domain name or start with a special character, including special characters ‘.’, ‘_’, ‘-’, ‘? ', ‘;’, ‘=’, ‘:’, ‘/’, ‘&’, ‘+’, etc.

5) The character length of the keyword is greater than or equal to 3 and less than 32.

6) Strings that appear frequently in urls such as http, https, .html, .jpg, etc. cannot be keywords.

7) The rules of regular expressions do not extract keywords.

Specifically, the Key (keyword) extraction process includes: traversing the character string in the advertisement filtering rule file until a first character in the above-mentioned extraction rule set is found, and is recorded as the starting position of the keyword, and continues to traverse until The end of the string, or the character in the next extraction rule above, is recorded as the end position.

The character between the start position and the end position is used as an alternative keyword. It is checked whether the candidate keyword satisfies the above-mentioned extraction conditions 4), 5), and 6), and if so, returns the keyword as the final keyword.

After returning the keyword, you can check whether the string in the ad filter rule file ends. If it ends, it returns no suitable keyword. Otherwise, continue to extract the keyword.

When an appropriate keyword cannot be extracted from an advertisement filtering rule, the advertisement filtering rule is added to the global queue, and the advertisement filtering rule in the global queue indicates that there is no advertisement filtering rule associated with the corresponding keyword. Correct The ad filtering rules in the global queue, each url needs to be matched. By checking the actual advertisement filtering rules in adblock, it is rare to extract the keywords that meet the requirements in the advertisement filtering rules. Currently, no more than 20 keywords cannot be extracted in the 11285 rules.

Figure 20 is a diagram showing a web page data processing apparatus in accordance with a third embodiment of the present invention. This embodiment can be taken as a preferred embodiment of the above embodiment. As shown in FIG. 20, the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40. The first matching unit 20 includes a second determining module 203, and the second matching unit 30 includes a first matching module 301.

The second judging module 203 is configured to determine whether the uniform resource locator matches the keyword of the advertisement filtering rule, and if it is determined that the uniform resource locator matches the keyword of the advertisement filtering rule, the advertisement filtering rule corresponding to the keyword is converted. Is a regular expression.

The first matching module 301 is configured to match the uniform resource locator with the regular expression.

The filtering unit 40 is further configured to: when the uniform resource locator matched by the keyword matches the regular expression, output an advertisement filter rule corresponding to the regular expression, and output the 'advertising filter rule corresponding to the regular expression' Ad filtering.

In the embodiment of the present invention, the matching of the url in the rule matcher may first convert the advertisement filtering rule corresponding to the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to facilitate judgment. Whether the url matches the ad filter rules. Preferably, the embodiment of the present invention converts the advertisement filtering rule corresponding to the keyword into a regular expression only when it is determined that the url matches the keyword, and does not need to convert all the advertisement filtering rules into regular expressions when starting the advertisement filtering. .

In the embodiment of the present invention, only the advertisement filtering rule corresponding to the keyword needs to be converted into a regular expression, since it is necessary to consume a certain time, for example, in the mobile terminal browser. It took about 1.5 seconds to get started. Since the average number of advertisement filtering rules corresponding to each keyword is small, usually no more than 2 and no more than 10, the conversion analysis time is short. If the resolution time of the 1w advertisement filtering rule is 1.5s, the average parsing time per strip is 0.15ms, so the matching time is increased by at most 1.5ms. At the same time, the embodiment of the present invention may also cache the parsing result of the advertisement filtering rule after hitting the advertisement filtering rule for the first time, so that there is no parsing overhead subsequently, thereby further reducing the time consumption.

The embodiment of the invention further provides a webpage data processing method. It should be noted that the webpage data processing method of the embodiment of the present invention may be performed by the webpage data processing apparatus provided by the embodiment of the present invention, and the webpage data processing apparatus of the embodiment of the present invention may also be used to perform the embodiment provided by the present invention. Web page data processing method.

21 is a flow chart of a web page data processing method according to a first embodiment of the present invention. As shown in FIG. 21, the browser webpage data processing method includes the following steps:

Step S402, obtaining a uniform resource locator input in the browser.

The browser can be a browser on a personal computer (PC) or a browser on the mobile terminal. The user can input the Uniform Resource Locator (URL) of the web page to be tested on the browser. ). Get the url to determine if ad filtering is required.

Step S404, matching the uniform resource locator by using the keyword of the advertisement filtering rule.

After obtaining the input uniform resource locator, the uniform resource locator can be matched by using the keyword of the advertisement filtering rule. The url may be segmented first, for example, by passing the url into the disconnector, and by setting a predetermined rule in the segmenter to segment the url to obtain a plurality of segmented characters. Then, multiple segment characters are passed into the keyword matcher, and multiple segment characters are matched by using the preset keywords in the keyword matcher, and each segment character is judged one by one to hit the keyword matcher. Keyword. The preset keyword may correspond to multiple advertisement filtering rules, so that when the keyword matches the url, only the advertisement filtering rule corresponding to the keyword may be matched with the url, and no matching of each advertisement filtering rule is required. .

Step S406: If the uniform resource locator matches the keyword, the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword.

When the uniform resource locator matches the keyword, the advertisement filtering rule corresponding to the keyword can be obtained, so that the uniform resource locator and the keyword filtering rule corresponding to the keyword can be obtained, and the uniform resource locator and all the advertisement filtering rules need not be used. Make a match.

The uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, wherein the uniform resource locator may be a 'uniform resource locator matching the keyword'. Specifically, the segment character matching the url of the keyword may be introduced into the rule rule matcher, where the rule matcher has a correspondence between the keyword and the advertisement filter rule. Matching the segment character of the url matching the keyword to the advertisement filtering rule in the rule matcher may first match the segment character of the url with the advertisement filter rule of the white list, and then segment the character of the url with black. The list of advertisement filtering rules is matched, wherein the whitelist indicates a list of advertisement filtering rules that do not filter the resources matched by the rule, and the blacklist indicates that the list of resource advertisement filtering rules matched by the rule is filtered. If the advertisement filtering rule of the whitelist is matched, the resource corresponding to the url corresponding to the segmentation character may be requested; if the advertisement filtering rule of the blacklist is matched, the resource corresponding to the url corresponding to the segmentation character is not required. If none match, the next segment character can be matched in the same way.

Matching the url in the rule matcher may first convert the ad filter rule corresponding to the matched keyword into Regular expressions, then use the interface of the regular expression to query the ad filter rules to determine if the url matches the ad filter rules.

Step S408: If the uniform resource locator matches the advertisement filtering rule corresponding to the keyword, the advertisement filtering rule is used to perform advertisement filtering.

After the uniform resource locator is matched with the keyword filtering rule corresponding to the keyword, if the keyword matching uniform resource locator matches the keyword filtering rule corresponding to the keyword, the matched advertisement filtering rule may be output, and the advertisement is utilized. Filter rules for ad filtering. That is, if it is determined that the resource requested by the url is an advertisement, the browser does not need to request the resource.

According to the embodiment of the present invention, the url is matched by the keyword of the advertisement filtering rule, and the url matching the keyword is matched with the advertisement filtering rule corresponding to the keyword, so as to avoid the url and all the advertisement filtering rules are performed one by one. The matching reduces the number of matching advertisement filtering rules, thereby solving the problem that each advertisement filtering time is long due to the large number of filtering rules, ensuring effective filtering of the advertising space, and achieving the effect of reducing the advertising filtering time.

For example, if there are 20,000 advertisement filtering rules, the existing advertisement filtering needs to match the url with the 20,000 advertisement filtering rules one by one. If an advertisement filtering rule is matched, the advertisement filtering is performed. In the embodiment of the present invention, the url is first matched with the keyword of the advertisement filtering rule, and if the matched keyword A corresponds to 100 advertisement filtering rules, only the url needs to be matched with the 100 advertisement filtering rules. Significantly reduces the time of matching.

In the embodiment of the present invention, the webpage data processing method can be used for advertisement filtering of a PC browser, or can be used on a browser on a mobile terminal, and can be implemented by a PC or a mobile terminal itself, or can be implemented by a cloud server. (such as middleware) to achieve its function. The webpage data processing method of the embodiment of the present invention can produce a better effect when the advertisement rule that can be supported on the mobile terminal is limited.

Preferably, after obtaining the uniform resource locator input in the browser, the browser webpage data processing method comprises: transmitting the uniform resource locator to the segmenter; and segmenting the uniform resource locator in the segmenter And obtaining a plurality of segment characters, wherein the matching the uniform resource locators by using the keywords of the advertisement filtering rule comprises: matching the plurality of segment characters one by one with the keywords in the keyword matcher. A segmenter is used to segment the uniform resource locator.

In the segmenter, segmentation may be performed according to a preset segmentation rule. The preset segmentation rule may include: first, segmenting the url by using a “/” as a separator, and then segmenting the first segment into a domain name. The remaining segments are segmented for each path; then, for the domain name segmentation, the domain name segmentation is further divided by "." as a delimiter; finally, for non-domain name segmentation, further divided into segments according to special characters, wherein Special characters can include ".", "_", "-", "?", ";", "=", ":", "/", "&", "+", etc. By segmenting the url according to predetermined rules, the effect of filtering the advertisement can be further ensured.

Figure 22 is a flowchart of a web page data processing method in accordance with a second embodiment of the present invention. The browser webpage data processing method of this embodiment may be a preferred embodiment of the browser webpage data processing method of the above embodiment. As shown in FIG. 22, the browser webpage data processing method includes the following steps:

Step S502 is the same as step S402 shown in FIG. 21, and details are not described herein.

Step S504, acquiring a keyword corresponding to the advertisement filtering rule.

Before the uniform resource locator is matched by using the preset keyword, the keyword in the keyword matcher may be initialized first. The specific initialization process may be: first obtaining a keyword corresponding to the advertisement filtering rule. For example, the keyword is extracted from the file of the advertisement filtering rule, so that after the url matches the keyword, the advertisement filtering rule corresponding to the keyword can be queried.

Step S506, a dictionary tree of keywords corresponding to the advertisement filtering rule is established.

In order to achieve the fastest lookup effect through the trie tree, the keyword can be stored in a sequential manner to improve the speed of the search. The nodes in the trie tree contain empty links (null pointers), which represent the trie tree. There are no keywords in the current location to facilitate the fastest lookup.

Step S508, acquiring keywords in the dictionary tree.

Step S510, determining whether the uniform resource locator matches the keyword in the dictionary tree.

Determine whether the uniform resource locator matches the keyword in the dictionary tree, that is, use the keyword dictionary tree to match the url. When the segment character of the url is passed to the keyword matcher, the keyword matcher of the advertisement filter rule searches for the segment character passed in according to the url segmenter, and finds whether the segment character is associated with the segment character in the trie tree. Keyword matching, where the match includes an exact match and a partial match. An exact match means that the segmented character is exactly the same as a keyword, and a partial match is when a keyword is a prefix of a segmented character. For example, when searching for a keyword in a trie tree, if there is an in the keyword, when the segment character is as or ask, the query matching success can be returned. When the corresponding advertisement filter rule keyword is queried in the trie tree, the corresponding advertisement filter rule can be found by using the keyword, and the searched advertisement filter rule is used for the advertisement filter.

Steps S512 and S514 are the same in steps S406 and S408 shown in FIG. 21, and are not described herein.

Preferably, acquiring a keyword corresponding to the advertisement filtering rule comprises: reading a file of the advertisement filtering rule; and extracting the keyword from a file of the advertisement filtering rule. Establishing a dictionary tree of keywords corresponding to the advertisement filtering rule includes: establishing a correspondence between the keyword and the advertisement filtering rule; and establishing the dictionary tree according to the extracted keyword.

Specifically, the file of the advertisement filtering rule may be read into the memory from the disk in the PC or the mobile terminal or the cloud server. Then extract the keywords from the ad filter rules file and create a correspondence between the keywords and the ad filter rules. The rules for extracting keywords from the files of the advertisement filtering rule may include:

1) Does not include characters occupied by adblock rules, such as "@", "|", "*", etc.

2) Does not include the part of the option in the ad filter rule.

7) The rules of regular expressions do not extract keywords.

When an appropriate keyword cannot be extracted from an advertisement filtering rule, the advertisement filtering rule is added to the global queue, and the advertisement filtering rule in the global queue indicates that there is no advertisement filtering rule associated with the corresponding keyword. For the ad filtering rules in the global queue, each url needs to be matched. By checking the actual advertisement filtering rules in adblock, it is rare to extract the keywords that meet the requirements in the advertisement filtering rules. Currently, no more than 20 keywords cannot be extracted in the 11285 rules.

Preferably, the matching the uniform resource locator by using the preset keyword comprises: determining whether the uniform resource locator matches the used preset keyword, wherein if the uniform resource locator is determined And matching the preset keyword, the advertisement filtering rule corresponding to the keyword is converted into a regular expression. Matching the uniform resource locator matching the keyword with the advertisement filtering rule corresponding to the keyword includes: matching the uniform resource locator matching the keyword with the regular expression. If the uniform resource locator matched by the keyword matches the regular expression, the advertisement filtering rule corresponding to the regular expression is output, and the advertisement filtering rule corresponding to the regular expression is output for advertisement filtering. .

23 is a flow chart of a preferred web page data processing method in accordance with an embodiment of the present invention. As shown in FIG. 23, the browser webpage data processing method includes:

In step S601, a url is input in the browser.

In step S602, the url is input into the segmenter to segment the url. The url is segmented according to a predetermined rule in the segmenter; the segmentation characters obtained from all segments are saved.

The predetermined rule may be: first, segment the url by using "/" as a separator, then the first segment is the domain name after segmentation, and the remaining segments are segmentation of each path; then, for the domain name segmentation, further "." The separator is divided into domain name segments. Finally, for non-domain segmentation, it is further divided into segments according to special characters. Among them, special characters can include special characters such as '.', '_', '-', '? ', ':', ‘=’, ‘;’, ‘&’, ‘+’, etc. By segmenting the url according to predetermined rules, the effect of filtering the advertisement can be further ensured.

In step S603, the segmented url is input to the keyword matcher. The url segmenter in turn passes each segment character to the keyword matcher of the filter rule.

In step S604, it is determined step by step whether to hit the keyword in the keyword matcher. In the keyword matcher of the filter rule, it is determined whether the keyword corresponding to the filter rule is hit. If there is no hit, step S605 is performed; if the hit is performed, step S606 is performed.

In step S605, it is determined whether there are still segment characters not matched. If yes, go back to step S603; if no, go to step S606.

Step S606, returning Flase. Indicates that no filtering is required and resources can be requested.

Step S607, the URL corresponding to the hit segment is passed to the rule matcher. Then step S608 is performed. The rule matcher stores a correspondence between the keyword and the filter rule.

In step S608, it is determined whether the URL hits the blacklist and does not hit the whitelist. Since the URL that hits the blacklist also includes some URLs that do not need to be filtered, the whitelist is set to match those URLs that do not need to be filtered. If the URL hits the blacklist and does not hit the whitelist, step S610 is performed. Otherwise, if the blacklist is missed and the whitelist is not hit, step S609 is performed.

In step S609, False is returned.

Step S610, outputting a corresponding filtering rule.

In step S611, the advertisement filtering is performed by using the corresponding filtering rule.

Compared with the prior art advertisement filtering time, the following effects can be achieved by using the embodiments of the present invention:

Time spent on traditional ad filtering:

After using the invention, the advertisement filtering takes time:

It can be seen from the above table that the webpage data processing method proposed by the present invention can significantly reduce the time spent on advertisement filtering when accessing a webpage, and improve the user experience.

It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence. Because certain steps may be performed in other sequences or concurrently in accordance with the present invention. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

In the above embodiments, the descriptions of the various embodiments are different, and the details that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, mobile terminal, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for a device or system embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment. The apparatus and system embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie It can be located in one place or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.

The above description is only a specific embodiment of the present application, so that those skilled in the art can understand or implement the present application. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the application is not limited to the embodiments shown herein, but is to be accorded the broadest scope of the principles and novel features disclosed herein.

Claims

A webpage data processing method, comprising:

Obtain the web page to be tested;

Matching the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, where the first webpage is preset in the preset webpage;

Determining, according to the matching result, a filtering situation of the webpage to be tested.
A web page data processing method according to claim 1, wherein

The method further includes: acquiring a preset webpage corresponding to the webpage address of the webpage to be tested;

The method further includes: setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, before the matching the webpage to be matched with the pre-set matching condition to obtain a matching result.

Matching the webpage to be tested with the pre-set matching condition, and obtaining a matching result includes: determining whether the preset webpage matches an area in the webpage to be tested, where the first identifier is set,

Determining, according to the matching result, the filtering situation of the webpage to be tested includes: if the preset webpage matches an area in which the first identifier is set in the webpage to be tested, determining that the webpage to be tested does not have a filtering problem Otherwise, it is determined that the webpage to be tested has a filtering problem.
A web page data processing method according to claim 2, wherein

The first identifier is a preset color, and the first identifier is set in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively, the actual webpage and the webpage to be tested are actually present The background color of the area of the content is set to a preset color; when the actual content is text, the color of the text is set as the preset color; when the actual content is a picture, the picture is deleted; or,

The first identifier is a border, and the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively, the actual content exists in the preset webpage and the webpage to be tested A locale border; wherein the border coincides with a boundary of the area where the actual content exists.
The webpage data processing method according to claim 2, wherein determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested includes:

Calculating, respectively, a first total area of the area where the first identifier is set in the preset webpage, and a second total area of the area where the first identifier is set in the webpage to be tested;

Calculating a third ratio between the first total area and the second total area;

Determining whether the third ratio is within a preset range;

If the third ratio is within the preset range, determining that the preset webpage matches the area in which the first identifier is set in the webpage to be tested, otherwise determining the preset webpage and the to-be-tested The area in the web page where the first identifier is set does not match.
The webpage data processing method according to claim 4, wherein after determining that the webpage to be tested has a filtering problem, the method further comprises:

If the third ratio is less than the minimum value of the preset range, determining that the webpage to be tested has a filter failure;

If the third ratio is greater than the maximum value of the preset range, it is determined that the webpage to be tested has error filtering.
The webpage data processing method according to claim 3, wherein when the first identifier is a preset color, determining the preset webpage and the area of the webpage to be tested that is provided with the first identifier Whether it matches, including:

Comparing whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;

Calculating a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;

Determining whether the first ratio is smaller than a first preset ratio;

If the first ratio is smaller than the first preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the preset webpage and the waiting The area in the webpage where the first identifier is set does not match.
The webpage data processing method according to claim 6, wherein after determining that the webpage to be tested has a filtering problem, the method further includes:

Determining, in the webpage to be tested, whether the color of the first region corresponding to the different preset comparison points is the same as the preset color;

If the color of the first area is the same as the preset color, it is determined that the first area has a filter failure, otherwise it is determined that the first area has a false filter.
The webpage data processing method according to claim 3, wherein when the first identifier is a border, it is determined whether the preset webpage and the area of the webpage to be tested in which the first identifier is disposed are Matches, including:

Calculating an area of a portion of the preset webpage where the border is disposed, and a portion of the webpage to be tested that does not overlap with the area where the border is disposed, and a total area of the area where the border is disposed in the preset webpage a second ratio between;

Determining whether the second ratio is smaller than a second preset ratio;

If the second ratio is smaller than the second preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the preset webpage and the waiting The area in the webpage where the first identifier is set does not match.
The webpage data processing method according to claim 8, wherein after determining that the webpage to be tested has a filtering problem, the method further comprises:

When the preset webpage is not located in an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure;

And determining, in the preset webpage, that the second region has a false filter when the border corresponding to the second region where the border is not disposed in the webpage to be tested is set.
The webpage data processing method according to any one of claims 2 to 9, wherein before determining whether the preset webpage and the area of the webpage to be tested are provided with the first identifier, The webpage data processing method further includes:

Separating the preset webpage and the webpage to be tested into a plurality of comparison areas corresponding one by one;

Correspondingly, the determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested includes:

Determining, respectively, whether the areas in which the first identifier is disposed in each pair of comparison areas corresponding to the preset webpage and the webpage to be tested are matched.
The webpage data processing method according to claim 1, wherein

Obtaining a webpage to be tested includes: obtaining a uniform resource locator of the webpage to be tested,

Matching the webpage to be tested with a pre-set matching condition, and obtaining a matching result includes: matching the uniform resource locator by using a keyword of the advertisement filtering rule; if the uniform resource locator matches the keyword And matching the uniform resource locator with an advertisement filtering rule corresponding to the keyword,

Determining, according to the matching result, the filtering situation of the webpage to be tested includes: if the uniform resource locator matches an advertisement filtering rule corresponding to the keyword, performing advertisement filtering by using the advertisement filtering rule.
The web page data processing method according to claim 11, wherein

Before the matching the uniform resource locator by using the keyword of the advertisement filtering rule, the method further includes: acquiring a keyword corresponding to the advertisement filtering rule; and establishing a keyword corresponding to the advertisement filtering rule Dictionary tree

The step of performing the matching of the uniform resource locator by using the keyword of the advertisement filtering rule includes: acquiring a keyword in the dictionary tree; determining whether the uniform resource locator and the keyword in the dictionary tree are match.
The web page data processing method according to claim 12, characterized in that

The acquiring a keyword corresponding to the advertisement filtering rule includes: reading a file of the advertisement filtering rule; and extracting the keyword from a file of the advertisement filtering rule;

The dictionary tree for establishing a keyword corresponding to the advertisement filtering rule includes: establishing a correspondence between the keyword and the advertisement filtering rule; and establishing the dictionary tree according to the extracted keyword.
The web page data processing method according to claim 11, wherein

The matching the uniform resource locator by using the keyword of the advertisement filtering rule includes: determining whether the uniform resource locator matches a keyword of the advertisement filtering rule, wherein if the uniform resource locator is determined Matching the keyword of the advertisement filtering rule, converting the advertisement filtering rule corresponding to the keyword into a regular expression;

Matching the uniform resource locator with the advertisement filtering rule corresponding to the keyword includes: matching the uniform resource locator with the regular expression;

If the uniform resource locator matches the regular expression, the advertisement filtering rule corresponding to the regular expression is output, and the advertisement filtering rule is performed by outputting the advertisement filtering rule corresponding to the regular expression.
The webpage data processing method according to claim 14, wherein after the obtaining the uniform resource locator of the webpage to be tested, the method further comprises:

Transmitting the uniform resource locator to the segmenter;

Segmenting the uniform resource locator in the segmenter to obtain a plurality of segment characters;

The matching the uniform resource locator by using the keyword of the advertisement filtering rule includes: matching the plurality of segment characters to keywords in the keyword matcher one by one.
A webpage data processing apparatus, comprising: a processor, wherein the processor is configured to execute the following program modules:

a webpage obtaining unit, configured to obtain a webpage to be tested;

a webpage matching unit, configured to match the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or The matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, and the preset webpage is preset with an area of the first identifier:

a result determining unit, configured to determine, according to the matching result, a filtering situation of the webpage to be tested.
A web page data processing apparatus according to claim 16, wherein

The webpage obtaining unit is further configured to acquire a preset webpage corresponding to the webpage address of the webpage to be tested while acquiring the webpage to be tested;

The device further includes: a webpage marking unit, configured to respectively set a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested;

The webpage matching unit is further configured to determine whether the preset webpage matches an area in the webpage to be tested, where the first identifier is set;

The result determining unit is further configured to: when the preset webpage matches the area where the first identifier is set in the webpage to be tested, determine that the webpage to be tested does not have a filtering problem, otherwise determine the webpage to be tested. There is a filtering issue.
A web page data processing apparatus according to claim 17, wherein:

The first identifier includes a preset color, and the webpage marking unit includes:

a background setting unit, configured to respectively set a background color of an area where the actual content exists in the preset webpage and the webpage to be tested as a preset color;

a word processing unit, configured to set a color of the text to be the preset color when the actual content in the preset webpage and/or the webpage to be tested is a text;

a picture processing unit, configured to delete the picture when the actual content in the preset webpage and/or the webpage to be tested is a picture; or

The first identifier includes a border, and the webpage marking unit includes:

a frame setting unit, configured to respectively set a border in an area where the actual content exists in the preset webpage and the webpage to be tested; wherein the border overlaps with a boundary of the area where the actual content exists.
The webpage data processing apparatus according to claim 17, wherein the webpage matching unit comprises:

An area calculating unit, configured to separately calculate a first total area of the area in which the first identifier is set in the preset webpage, and a second total area of the area in which the first identifier is disposed in the webpage to be tested ;

a third calculating unit, configured to calculate a third ratio between the first total area and the second total area;

a third determining unit, configured to determine whether the third ratio is within a preset range; if the third ratio is within a preset range, determining that the preset webpage and the webpage to be tested are set in the The area of the first identifier is matched, and the preset webpage is determined to not match the area in which the first identifier is set in the webpage to be tested.
The webpage data processing apparatus according to claim 19, further comprising:

a third sub-determination unit, configured to compare the third ratio, a minimum value of the preset range, and a maximum value of the preset range after the result determining unit determines that the webpage to be tested has a filtering problem And determining, when the third ratio is less than the minimum value of the preset range, that the webpage to be tested has a filter failure, and determining that the third ratio is greater than a maximum value of the preset range. There is error filtering on the web page to be tested.
A web page data processing apparatus according to claim 18, wherein

When the first identifier is a preset color, the webpage matching unit includes:

a color comparison unit, configured to compare whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;

a first calculating unit, configured to calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;

a first determining unit, configured to determine whether the first ratio is smaller than a first preset ratio, and when the first ratio is greater than the first preset ratio, determining that the preset webpage is set in the webpage to be tested The area that has the first identifier does not match, and the preset webpage is determined to match the area in which the first identifier is set in the webpage to be tested;

When the first identifier is a border, the webpage matching unit includes:

a second calculating unit, configured to calculate a second ratio between an area of a portion of the preset webpage and the webpage to be tested that does not overlap with a polygon frame and a total area of the polygon frame in the preset webpage;

a second determining unit, configured to determine, when the second ratio is not greater than the second preset ratio, that the preset webpage does not match an area in the webpage to be tested that is provided with the first identifier, otherwise The preset webpage matches an area of the webpage to be tested in which the first identifier is disposed.
A web page data processing apparatus according to claim 21, wherein

When the first identifier is a preset color, the webpage data processing apparatus further includes:

a first sub-determining unit, configured to determine, after the result determining unit determines that the webpage to be tested has a filtering problem, the color comparison result is a color of the first region corresponding to different preset comparison points Whether it is the same as the preset color, and when the color of the first area is the same as the preset color, determining that the first area has a filter failure, otherwise determining that the first area has a false filter;

When the first identifier is a border, the webpage data processing apparatus further includes:

The second sub-determining unit is configured to: after the result determining unit determines that the webpage to be tested has a filtering problem, perform the following determination:

If the preset webpage is not provided with an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure;

If the preset webpage is provided with the border corresponding to the second area where the border is not disposed in the webpage to be tested, it is determined that the second area has error filtering.
The webpage data processing apparatus according to any one of claims 17 to 22, further comprising:

a region dividing unit, configured to respectively divide the preset webpage and the webpage to be tested into a plurality of comparison regions corresponding to one-to-one correspondence;

Correspondingly, the webpage matching unit comprises:

The first sub-matching unit is configured to determine whether the regions in which the first identifier is disposed in each pair of comparison regions corresponding to the preset webpage and the webpage to be tested are respectively matched.
A web page data processing apparatus according to claim 16, wherein

The webpage obtaining unit includes: a first acquiring unit, configured to acquire a uniform resource locator of the webpage to be tested,

The webpage matching unit includes: a first matching unit, configured to use the keyword of the advertisement filtering rule to match the uniform resource locator; and a second matching unit, configured to: when the uniform resource locator and the keyword Matching, the uniform resource locator is matched with an advertisement filtering rule corresponding to the keyword;

The result determining unit includes: a filtering unit, configured to perform advertisement filtering by using the advertisement filtering rule when the uniform resource locator matches an advertisement filtering rule corresponding to the keyword.
A web page data processing apparatus according to claim 24, wherein

The device further includes: a second acquiring unit, configured to acquire a keyword corresponding to the advertisement filtering rule before the matching the uniform resource locator by using a keyword of the advertisement filtering rule; a dictionary tree for establishing a keyword corresponding to the advertisement filtering rule;

The first matching unit includes: an obtaining module, configured to acquire a keyword in the dictionary tree; and a first determining module, configured to determine whether the uniform resource locator matches a keyword in the dictionary tree .
A web page data processing apparatus according to claim 25, wherein

The second obtaining unit includes: a reading module, configured to read a file of the advertisement filtering rule; and an extracting module, configured to extract the keyword from a file of the advertisement filtering rule;

The establishing unit includes: a first establishing module, configured to establish a correspondence between the keyword and the advertisement filtering rule; and a second establishing module, configured to establish the dictionary tree according to the extracted keyword.
A web page data processing apparatus according to claim 24, wherein

The first matching unit includes: a second determining module, configured to determine whether the uniform resource locator matches a keyword of the advertisement filtering rule, where if the uniform resource locator is determined to be filtered by the advertisement The keyword matching of the rule converts the advertisement filtering rule corresponding to the keyword into a regular expression;

The second matching unit includes: a first matching module, configured to match the uniform resource locator with the regular expression;

The filtering unit is further configured to: when the uniform resource locator matches the regular expression, output an advertisement filtering rule corresponding to the regular expression, and advertise by outputting an advertisement filtering rule corresponding to the regular expression. filter.
The webpage data processing apparatus according to claim 27, wherein the apparatus further comprises:

An incoming unit, configured to: after obtaining the uniform resource locator input in the browser, the uniform resource locator to the segmenter;

a segmentation unit, configured to segment the uniform resource locator in the segmenter to obtain a plurality of segment characters;

The first matching unit includes: a second matching module, configured to match the plurality of segment characters to keywords in the keyword matcher one by one.
A computer readable medium having processor-executable program code for use in a web page data processing apparatus, wherein the program code causes the processor to perform the steps of:

Obtain the web page to be tested;

Matching the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, where the first webpage is preset in the preset webpage:

Determining, according to the matching result, a filtering situation of the webpage to be tested.
A computer readable medium according to claim 29, wherein

While acquiring the webpage to be tested, the program code further causes the processor to acquire a preset webpage corresponding to the webpage address of the webpage to be tested,

The program code further causes the processor to set the first identifier in the area where the actual content exists in the preset webpage and the webpage to be tested, before the matching webpage is matched with the pre-set matching condition to obtain a matching result. ;

Matching the webpage to be tested with the pre-set matching condition, and obtaining a matching result includes: determining whether the preset webpage matches an area in the webpage to be tested, where the first identifier is set;

Determining, according to the matching result, the filtering situation of the webpage to be tested includes: if the preset webpage matches an area in which the first identifier is set in the webpage to be tested, determining that the webpage to be tested does not have a filtering problem Otherwise, it is determined that the webpage to be tested has a filtering problem.
A computer program for performing the web page data processing method according to any one of claims 1 to 15.