WO2015165245A1 - Webpage data processing method and device - Google Patents
Webpage data processing method and device Download PDFInfo
- Publication number
- WO2015165245A1 WO2015165245A1 PCT/CN2014/090841 CN2014090841W WO2015165245A1 WO 2015165245 A1 WO2015165245 A1 WO 2015165245A1 CN 2014090841 W CN2014090841 W CN 2014090841W WO 2015165245 A1 WO2015165245 A1 WO 2015165245A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- webpage
- tested
- preset
- area
- keyword
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Definitions
- the present invention relates to the field of mobile communication technologies, and in particular, to a webpage data processing method and apparatus.
- Website operators usually put data of certain businesses, such as advertisements, on the webpage to obtain the sponsorship of these merchants, thereby ensuring the normal operation and profitability of the website; but for the user, the data embedded in the webpage is It belongs to non-valid content, and its existence brings a lot of inconvenience to users. For example, when browsing a new webpage, users first need to distinguish between non-active content and effective content such as advertisements; or, because the advertisement content is valid for the corresponding webpage area The occlusion of the content makes it difficult for the user to obtain the valid content.
- most browsers have a filtering function to filter out non-valid content embedded in webpages, such as filtering advertisements.
- the filtering principle is generally: according to the layout style and frame of the webpage to be filtered.
- a feature such as a code formulates a corresponding filtering rule, which identifies non-valid content (such as an advertisement) in the webpage, and blocks the loading process of the non-effective content in the webpage or hides the non-effective content in the page, without performing display.
- non-valid content such as an advertisement
- manual detection is used to determine whether there is a filtering problem on the webpage, which can ensure the accuracy of the detection results.
- the manual detection method cannot guarantee timely. Every time a filtering problem is detected, the detection efficiency is extremely low.
- the ad filter plugin adblock is a widely used ad filter plugin.
- the basic principle is to set a series of filtering rules. Before the browser sends a resource request to request web resources, check whether its Uniform Resource Locator (URL) hits a filtering rule. If a filter is hit, The rule can determine that the resource requested by the browser is an advertisement, and the browser does not need to request the resource.
- URL Uniform Resource Locator
- adblock provides more than 20,000 filtering rules.
- the current browser advertisement filtering method is: when a user inputs a certain url through a browser, the url is used to match the filtering rules one by one, and if a filtering rule is matched, it returns true (indicating that advertisement filtering is required), otherwise Returns false (indicating that no ad filtering is required). Since the filtering rules of a large number of advertisements are set in the browser, each time the browser requests the network, it matches with a large number of filtering rules one by one, so that the performance of the advertisement filtering performance The overhead is large, and because of the large number of filtering rules, each advertisement filter takes a long time.
- the embodiment of the invention provides a webpage data processing method and device, which solves the problem that the detection of the webpage filtering problem is not timely and inefficient, and realizes the filtering problem quickly and effectively.
- an advertisement filtering method includes: acquiring a webpage to be tested; matching the webpage to be tested with a matching condition set in advance to obtain a matching result, wherein the matching condition includes a keyword of the advertisement filtering rule and the keyword Corresponding advertisement filtering rules, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, where the first webpage is preset in the preset webpage: and the webpage to be tested is determined according to the matching result. Filter the situation.
- the method further includes: acquiring the preset webpage corresponding to the webpage address of the webpage to be tested, and matching the webpage to be tested with a preset matching condition to obtain a matching result,
- the method further includes: setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested, and matching the webpage to be tested with a pre-set matching condition, and obtaining a matching result includes: determining the preset webpage.
- the obtaining the webpage to be tested includes: obtaining the uniform resource locator of the webpage to be tested, and matching the webpage to be tested with the matching condition set in advance, and obtaining the matching result includes: using the keyword of the advertisement filtering rule to the unified resource The locator performs matching; if the uniform resource locator matches the keyword, the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, and the filtering condition of the webpage to be tested is determined according to the matching result: If the uniform resource locator matches the advertisement filtering rule corresponding to the keyword, the advertisement filtering rule is used to filter the advertisement.
- a web page data processing apparatus includes a processor, the processor is configured to execute the following program module: a webpage obtaining unit, configured to acquire a webpage to be tested, and a webpage matching unit, configured to perform the matching webpage with the preset matching condition Matching, the matching result is obtained, wherein the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, the preset The area of the webpage is pre-set with the first identifier: and the result determining unit is configured to determine, according to the foregoing matching result, the filtering condition of the webpage to be tested.
- the webpage obtaining unit is further configured to: acquire the preset webpage corresponding to the webpage address of the webpage to be tested, and the device further includes: a webpage marking unit, respectively, in the preset webpage and The first identifier of the area where the actual content exists in the webpage to be tested, the webpage matching unit is further configured to determine whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested, and the result determining unit further uses When the preset webpage matches the area where the first identifier is set in the webpage to be tested, it is determined that the webpage to be tested does not have a filtering problem, otherwise, it is determined that the webpage to be tested has a filtering problem.
- the webpage obtaining unit includes: a first acquiring unit, configured to acquire a uniform resource locator of the webpage to be tested, where the webpage matching unit includes: a first matching unit, configured to use the keyword of the advertisement filtering rule to The resource locator performs matching; the second matching unit is configured to: when the uniform resource locator matches the keyword, the foregoing uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, and the result determining unit includes: The filtering unit is configured to perform advertisement filtering by using the advertisement filtering rule when the uniform resource locator matches the advertisement filtering rule corresponding to the keyword.
- a computer readable medium having program code executable by a processor for use in a web page data processing apparatus, the program code causing the processor to perform the following steps: Obtaining a webpage to be tested; matching the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, wherein the preset webpage is pre-set with the area of the first identifier: and determining the filtering condition of the webpage to be tested according to the matching result.
- the embodiment of the present invention obtains a matching result by matching the webpage to be tested with a pre-set matching condition, wherein the matching condition includes the keyword of the advertisement filtering rule and the advertisement corresponding to the keyword.
- the filtering rule, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, and the preset webpage is pre-set with the area of the first identifier: and determining the filtering condition of the webpage to be tested according to the matching result. Therefore, compared with the manual detection method, the embodiment can quickly and timely detect the filtering problem of the webpage (such as filtering failure, error filtering, etc.), and improve the detection efficiency, and is particularly suitable for the occasion where the number of web pages to be tested is huge.
- FIG. 1 is a schematic flowchart of a webpage data processing method according to an embodiment of the present invention
- FIG. 2 is a flowchart of a method for implementing step S13 in FIG. 1 according to an embodiment of the present invention
- FIG. 3 is a flowchart of a method for determining a type of filtering problem based on the method shown in FIG. 2 according to an embodiment of the present invention
- FIG. 4(a) is a schematic diagram of a preset webpage processed by the embodiment of the present invention.
- 4(b) is a schematic diagram of a webpage to be tested processed by the embodiment of the present invention.
- FIG. 4(c) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention.
- FIG. 4(d) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention.
- FIG. 4(e) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention.
- FIG. 5 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention.
- FIG. 6(a) is a schematic diagram of a webpage not processed by an embodiment of the present invention.
- FIG. 6(b) is a schematic diagram showing the step S22 shown in FIG. 5 after performing the webpage shown in FIG. 6(a);
- Figure 6 (c) is a schematic diagram of further processing the actual content in the web page shown in Figure 6 (b);
- FIG. 7 is a flowchart of a method for implementing step S23 in FIG. 5 according to an embodiment of the present invention.
- FIG. 8 is a schematic diagram of preset comparison points in the embodiment shown in FIG. 7;
- FIG. 9 is a flowchart of another method for implementing step S23 in FIG. 5 according to an embodiment of the present invention.
- FIG. 10 is a flowchart of a method for implementing steps S341-S342 of FIG. 9 based on webpage interlaced scanning according to an embodiment of the present invention
- FIG. 11 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention.
- FIG. 12 is a schematic diagram of a webpage with a border as a first identifier according to an embodiment of the present invention.
- FIG. 13 is a flowchart of a method for implementing step S33 in FIG. 11 according to an embodiment of the present invention.
- FIG. 14 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention.
- FIG. 15 is a schematic diagram of a partitioning result of a preset webpage and a webpage to be tested according to an embodiment of the present invention
- FIG. 16 is a schematic structural diagram of a webpage data processing apparatus according to an embodiment of the present invention.
- FIG. 17 is a schematic structural diagram of another webpage data processing apparatus according to an embodiment of the present invention.
- FIG. 18 is a schematic diagram of a webpage data processing apparatus according to a first embodiment of the present invention.
- FIG. 19 is a schematic diagram of a webpage data processing apparatus according to a second embodiment of the present invention.
- FIG. 20 is a schematic diagram of a web page data processing apparatus according to a third embodiment of the present invention.
- FIG. 21 is a flowchart of a web page data processing method according to a first embodiment of the present invention.
- FIG. 22 is a flowchart of a web page data processing method according to a second embodiment of the present invention.
- FIG. 23 is a flow chart of a preferred web page data processing method in accordance with an embodiment of the present invention.
- the embodiment of the invention provides a webpage data processing method and device, which solves the problem that the detection of the webpage filtering problem is not timely and the efficiency is low.
- FIG. 1 is a flowchart of a method for processing webpage data according to an embodiment of the present invention.
- a webpage data processing method provided by an embodiment of the present invention includes the following steps:
- S11 Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;
- the preset webpage and the webpage to be tested are two webpages corresponding to the webpage address at different times, and the preset webpage may be a webpage corresponding to the webpage address at a certain historical moment, that is, the webpage corresponding to the webpage, that is, the corresponding webpage It is a web page in the case of normal filtering, and there is no problem of false filtering or filtering failure.
- the above actual content includes both valid content and non-valid content such as advertisements.
- the area where the first identifier is set on the preset webpage is an aspect of the matching condition, and matching the webpage to be tested with the matching condition includes the determining manner of the following step S13.
- the matching condition may further include a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, which will be described later.
- step S13 determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested, if yes, step S14 is performed, otherwise step S15 is performed;
- the embodiment of the present invention obtains the preset webpage and the webpage to be tested corresponding to the same webpage address, and sets the first identifier in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively, by using the foregoing preset Determining, by the webpage, whether the area in which the first identifier is set in the webpage to be tested matches the area in which the first identifier is set in the preset webpage, and determining whether the webpage to be tested has a filtering problem according to the determination result;
- the embodiment can quickly and timely detect the webpage filtering problem (such as the problem of false filtering or filtering failure), and improve the detection efficiency, and
- the preset webpage and the webpage to be tested processed in step S12 may be stored as a picture format, and the determining step described in S13 is performed on the preset webpage and the webpage to be tested.
- the preset webpage and the webpage to be tested may not be imaged, but the determining step described in S13 may be implemented directly according to the result processed through step S12.
- the first identifier is set in the webpage to be tested, and the first logo is set in the preset webpage.
- the matching of the area means that if a certain identifier exists in an area of the preset webpage, the corresponding area in the webpage to be tested should also have the first identifier, and if a certain area in the preset webpage does not exist first, If the identifier is specified, the corresponding area in the web page to be tested should also have no first identifier.
- FIG. 2 illustrates a A viable implementation.
- determining whether an area in which a first identifier is set in a webpage to be tested matches an area in which a first identifier is set in a preset webpage includes the following steps:
- step S333 determining whether the third ratio is within a preset range, if yes, executing step S334, otherwise performing step S335;
- S334 Determine that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier.
- the first total area should be equal to the second total area, that is, the third ratio should be 1, also That is, the preset range should be set to a threshold value, and the threshold value is 1; however, considering the existence of the calculation error or the work load caused by avoiding frequent modification of the filtering rule, it may be set as long as the third ratio is In the preset range with the "1" as the core, the preset webpage is considered to match the area in which the first identifier is set in the webpage to be tested.
- the determination of the maximum value and the minimum value of the preset range may be determined according to actual detection requirements.
- the preset range can be set to [0.75, 1.35]. In the case where the detection accuracy is high, the preset range can be set to [0.95, 1.05].
- the specific values of the above-mentioned preset ranges are only one possible implementation manner based on the principles of the present invention, and should not be construed as limiting the scope of the present invention.
- step S631 determining whether the third ratio is less than the minimum value of the preset range, if yes, proceeding to step S632, otherwise performing step S633;
- S633 Determine whether the third ratio is greater than a maximum value of the preset range, and if yes, determine that the webpage to be tested has error filtering.
- the examples of the two preset ranges listed in the above embodiment [0.75, 1.35] and [0.95, 1.05] are equal to the difference between the maximum value and the minimum value of each preset range and 1; alternatively, According to the different detection precisions of the two types of filtering problems, the maximum and minimum values of the preset range are respectively set; for example, if the detection accuracy of the filtering failure phenomenon is high, and the detection precision of the false filtering phenomenon is required Lower, set a larger minimum and a larger maximum, such as can be set to [0.95, 1.35],
- FIG. 4(a) is a schematic diagram of a preset webpage processed by step S12, and four regions are provided with the first identifier, which are labeled as A1, B1, C1, and D1 in FIG. 4(a), respectively.
- Scenario 1 If the schematic diagram of the web page to be tested processed in step S12 is as shown in FIG. 4(b), there are also four areas in the web page to be tested with the first identifier, and the labels are A2, B2, C2 and D2, and A1 and A2, B1 and B2, C1 and C2, D1 and D2 match, respectively. Wherein, the areas of A2, B2, C2 and D2 are respectively 2, 1, 1, 1.5; then the total area of the area in which the first identifier is set in the webpage to be tested shown in FIG.
- the calculated third ratio is not within the preset range, and it is determined that the webpage has a filtering problem. Further, since 1.375>1.35, that is, the third ratio is greater than the maximum value of the preset range, it can be determined that the webpage to be tested shown in FIG. 4(c) has error filtering, and directly compares FIG. 4(a) and FIG. 4(b). The results obtained are consistent.
- the third ratio is within the preset range, and it can be determined that the webpage to be tested does not have a filtering problem.
- the calculated third ratio is not 1, that is, the preset webpage of FIG. 4(a) does not completely match the webpage to be tested of FIG. 4(d), but the detection accuracy is small due to the small difference. If the requirements are not high, it can be considered that there is no filtering problem in the web page to be tested in FIG. 4(d).
- the calculated third ratio is not within the preset range, and it is determined that the webpage has a filtering problem. Further, since 0.69 ⁇ 0.75, that is, the third ratio is smaller than the minimum value of the preset range, it can be determined that the webpage to be tested shown in FIG. 4(e) has filtering failure, and directly compares FIG. 4(a) and FIG. 4(e). The results obtained are consistent.
- the area difference between the two may be calculated and The fourth ratio of the first total area (or the second total area), if the absolute value of the fourth ratio is less than the preset threshold, determining that the webpage to be tested does not have a filtering problem, and vice versa, there is a filtering problem; If the absolute value of the fourth ratio is not less than (ie, greater than or equal to) the preset threshold, and the fourth ratio is less than zero, determining that the webpage to be tested has a filter failure; if the absolute value of the fourth ratio is not If the preset threshold is less than (or greater than or equal to), and the fourth ratio is greater than zero, it is determined that the webpage to be tested has a false filtering phenomenon.
- FIG. 5 is a flowchart of a method for processing webpage data according to another embodiment of the present invention.
- the webpage data processing method described in this embodiment includes the following steps:
- step S23 determining whether the preset webpage and the background color of the webpage to be tested match the area of the preset color, if yes, step S24 is performed, otherwise step S25 is performed;
- the embodiment shown in FIG. 5 uses the preset color as the first identifier, and is used to mark an area in the webpage where the actual content exists.
- the actual content in the two webpages may also be executed as follows. Processing: When the actual content is text, the color of the text is also set to the above preset color; when the actual content is a picture, the picture is deleted.
- the black color is the preset color
- the step S22 is performed on the webpage shown in FIG. 6(a)
- the background color of the area where the actual content exists in the webpage becomes black
- the webpage shown in FIG. 6(b) can be obtained;
- FIG. 6(b) if the color of the text in the webpage is different from the preset color (black), the actual color of the area obtained by superimposing the color of the text and the background color of the corresponding area is also the preset color (black).
- the picture will completely cover the background color of the area, and the actual color of the area can only be expressed as the color in the picture is not convenient for color comparison; therefore, the embodiment of the present invention is shown in FIG.
- the processing result shown in FIG. 6(c) is obtained by deleting the picture content in the webpage and setting the color of the text in the webpage to the preset color (black) which is the same as the background color; It can be seen from FIG. 6(c) that the area where the actual content exists in the final processed webpage is uniformly displayed as a pure black block, which is advantageous for the execution of the subsequent steps.
- the method shown in FIG. 2 may be used to determine whether the preset webpage and the background color of the webpage to be tested are the preset color in the webpage to be tested.
- Matching that is, calculating a total area M1 of the area in which the background color is the preset color in the preset webpage, and a total area M2 of the area in which the background color of the webpage to be tested is the preset color, and calculating the ratio M1/M2, If the M1/M2 is within the preset range, determining that the preset webpage matches an area in the webpage to be tested whose background color is the preset color, otherwise determining the preset webpage and the webpage to be tested.
- the area in which the background color is the preset color does not match, and there is a filtering problem.
- the type of the filtering problem may be further determined by the method shown in FIG. 3.
- the determining, by the process shown in FIG. 7, the determining that the background color of the preset webpage and the webpage to be tested is the preset color is performed in S23. Whether it matches:
- the preset comparison point refers to a pixel point in the webpage whose coordinates are preset coordinate values.
- the xy coordinate system can be established with the upper left corner of the webpage as the origin, and the horizontal right direction is the x-axis direction.
- the direction of the straight downward direction is the y-axis direction; wherein the pixel point P1 with coordinates (3, 2) can be used as a preset comparison point, and the pixel point P2 with coordinates (8, 4) can also be used as a preset comparison.
- step S311 compares the colors of each pair of corresponding regions. If the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same, it indicates that the two areas corresponding to the preset comparison point match, that is, both have valid content, or none exist. Effective content.
- the total number of preset comparison points should not be too small, and the specific values can be set according to actual application requirements.
- S312 Calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;
- step S313 determining whether the first ratio is smaller than the first preset ratio, if the first ratio is less than the first preset ratio, step S314 is performed, otherwise step S315 is performed;
- S314 Determine that the preset webpage matches an area of the webpage to be tested whose background color is the preset color.
- S315 Determine that the preset webpage does not match an area in the webpage to be tested whose background color is the preset color.
- the first preset ratio may be set according to the detection precision requirement (the maximum ratio of the unmatched area between the allowed preset webpage and the webpage to be tested to the entire webpage), when the first ratio is greater than the first preset ratio.
- the ratio of the unmatched area between the preset webpage and the webpage to be tested is too large, so that the filtering problem of the webpage to be tested may be determined. Conversely, it may be determined that the webpage to be tested does not have a filtering problem.
- the color of the first area is the same as the preset color, it is determined that the first area has a filtering failure problem, otherwise the first area is determined to have a false filtering problem.
- the color comparison result of the preset comparison point P1 (3, 2) is different, that is, the color of the pixel with the coordinate of (3, 2) in the webpage to be tested, and the coordinate of the preset webpage is (3, 2).
- the color of the pixel is different.
- the color of the preset color is different from the preset color, and the actual content is not present in the preset webpage. The actual content exists in the corresponding area of the webpage to be tested. Therefore, it can be determined that the webpage to be tested is in the corresponding area of the preset comparison point.
- the determining the preset webpage and the to-be-tested according to step S23 may be implemented by the method flow shown in FIG. Whether the background color of the webpage matches the area of the preset color; referring to FIG. 9, the method includes the following steps:
- step S345 it is determined whether M1/M2 is within a preset range, if yes, step S345 is performed, otherwise step S346 is performed;
- the range should be set to a value of 1, which is 1.
- the preset range may be set to a numerical interval including “1”; and the higher the detection accuracy requirement, the larger the minimum value of the preset range and the smaller the maximum value.
- M1>M2 it is determined that the webpage to be tested has error filtering; if M1 ⁇ M2, it is determined that the webpage to be tested has filtering invalidity.
- the method for performing the webpage interlaced scanning method shown in FIG. 10 is performed on the webpage to be tested and the preset webpage respectively.
- steps S341 to S342 shown in Fig. 9 are realized.
- the method includes the following steps:
- S1 setting the scanning parameters by using the upper left corner of the webpage to be scanned as the coordinate origin, including: the abscissa X (initial value is 0), the ordinate Y (initial value is 0), the horizontal scanning step length ⁇ W, and the longitudinal scanning step length ⁇ H, the width W of the web page, and the height H of the web page;
- step S2 determining whether the color of the preset comparison point whose coordinates are (X, Y) is the same as the preset color, if yes, executing step S3, otherwise performing step S4;
- step S4 Record the comparison result corresponding to the preset comparison point (X, Y) as 0, and perform step S5;
- step S6 determining whether the ordinate Y is greater than H, if yes, proceeding to step S7, otherwise returning to step S2;
- step S8 determining whether the abscissa X is greater than W, if yes, proceeding to step S9, otherwise returning to step S2;
- the scan point is the preset comparison point
- the total number of scan points can be adjusted by adjusting the horizontal scan step size ⁇ W and/or the vertical scan step size ⁇ H, that is, adjusting the preset comparison.
- the number of points is simple and flexible. At the same time, it is automatically compared with the preset color in the corresponding area of each preset comparison point during the scanning process, and the processing efficiency can be improved.
- the comparison result in the method shown in FIG. 10 may be stored by means of a digital matrix.
- the abscissa X has a total of 20 values, and the ordinate Y A total of five values, you can get a matrix of 5 rows and 20 columns as shown below:
- each number corresponds to one scanning point, that is, a preset comparison point.
- FIG. 11 is a flowchart of a method for processing webpage data according to another embodiment of the present invention.
- the webpage data processing method described in this embodiment includes the following steps:
- S31 Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;
- FIG. 12 is a schematic diagram of a webpage after the frame is set in the area of the “column” in the webpage shown in FIG. 6(a); it should be noted that the frame used in the embodiment of the present invention is not limited to FIG. The dashed box in the middle of the sample.
- step S33 determining whether the preset webpage and the area of the webpage to be tested are matched with the border, if yes, step S34 is performed, otherwise step S35 is performed;
- the embodiment shown in FIG. 11 uses a border as the first identifier, and is used to mark an area in the webpage where the actual content exists.
- step S33 whether the preset webpage is matched with the area in which the border is set in the webpage to be tested is determined in the foregoing step S33, and may be implemented by using the method shown in FIG.
- S321 Calculate an area of the preset webpage where the border is disposed, and an area of the portion of the webpage to be tested that does not overlap with the border, and an area of the preset webpage where the border is disposed. a second ratio between the total areas;
- step S322 determining whether the second ratio is less than the second preset ratio, if yes, proceeding to step S323, otherwise performing step S324;
- S324 Determine that the preset webpage does not match an area in the webpage to be tested that is provided with the border.
- the specific form of the first identifier used to mark the area where the actual content exists in the webpage according to the embodiment of the present invention is not limited to the preset color in the embodiment shown in FIG. 5, and FIG.
- FIG. 14 is a flowchart of a method for processing webpage data according to another possible embodiment of the present invention, including the following steps:
- the preset webpage and the webpage to be tested are respectively divided into a plurality of comparison areas corresponding to one-to-one correspondence;
- a preset web page and a partition result of the web page to be tested are divided into four comparison areas: Q1, Q2, Q3, and Q4.
- the test result is also divided into four.
- the regions are the region Z1 corresponding to Q1, the region Z2 corresponding to Q2, the region Z3 corresponding to Q3, and the region Z4 corresponding to Q4.
- step S44 respectively, determining whether the area of the comparison area corresponding to the preset webpage and the webpage to be tested is matched with the first identifier, if yes, step S45 is performed, otherwise step S46 is performed;
- the scheme may be reduced compared with the comparison of the entire webpage. Detection error.
- the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
- Implementation Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for causing a A computer device (which may be a personal computer, server, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present invention.
- the foregoing storage medium includes various types of media that can store program codes, such as a read only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
- the present invention further provides a webpage data processing apparatus.
- FIG. 16 is a schematic structural diagram of a webpage data processing apparatus according to a possible embodiment of the present invention.
- the webpage data processing apparatus includes a webpage obtaining unit 810, a webpage marking unit 820, a webpage matching unit 830, and a result determining unit 840.
- the webpage obtaining unit 810 is configured to obtain a webpage to be tested and a preset webpage corresponding to the webpage address of the webpage to be tested.
- the webpage marking unit 820 is configured to set a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested.
- the webpage matching unit 830 is configured to determine whether the preset webpage matches an area in the webpage to be tested in which the first identifier is disposed.
- the result determining unit 840 is configured to determine that the webpage to be tested does not have a filtering problem when the preset webpage matches the area in which the first identifier is set in the webpage to be tested, and otherwise determine that the webpage to be tested exists Filter the problem.
- the preset webpage corresponding to the same webpage address and the webpage to be tested are obtained, and the first identifier is set in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively.
- the webpage matching unit 830 may include:
- An area calculating unit configured to separately calculate a first total area of the area in which the first identifier is set in the preset webpage, and a second total area of the area in which the first identifier is disposed in the webpage to be tested ;
- a third calculating unit configured to calculate a third ratio between the first total area and the second total area
- a third determining unit configured to determine whether the third ratio is within a preset range; if the third ratio is within a preset range, determining that the preset webpage and the webpage to be tested are set in the The area of the first identifier is matched, and the preset webpage is determined to not match the area in which the first identifier is set in the webpage to be tested.
- the webpage processing apparatus may further include: a third sub-determining unit, configured to compare the third ratio, the minimum of the preset range, after the result determining unit determines that the webpage to be tested has a filtering problem a value, and a maximum value of the preset range, and when the third ratio is less than a minimum value of the preset range, determining that the webpage to be tested has a filter failure, if the third ratio is greater than the When the maximum value of the preset range is determined, it is determined that the webpage to be tested has error filtering.
- a third sub-determining unit configured to compare the third ratio, the minimum of the preset range, after the result determining unit determines that the webpage to be tested has a filtering problem a value, and a maximum value of the preset range, and when the third ratio is less than a minimum value of the preset range, determining that the webpage to be tested has a filter failure, if the third ratio is greater than the When the maximum value of the preset range is determined, it is determined that the webpage to be tested
- the webpage marking unit 820 may include:
- a background setting unit configured to respectively set a background color of an area where the actual content exists in the preset webpage and the webpage to be tested as a preset color
- a word processing unit configured to set a color of the text to be the preset color when the actual content in the preset webpage and/or the webpage to be tested is a text
- the picture processing unit is configured to delete the picture when the actual content in the preset webpage and/or the webpage to be tested is a picture.
- the webpage matching unit 830 may include:
- a color comparison unit configured to compare whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;
- a first calculating unit configured to calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points
- a first determining unit configured to determine whether the first ratio is smaller than a first preset ratio, and when the first ratio is greater than the first preset ratio, determining that the preset webpage is set in the webpage to be tested The area with the first identifier does not match, otherwise it is determined that the preset webpage matches the area of the webpage to be tested in which the first identifier is set.
- the webpage data processing apparatus may further include: a first sub-determining unit, configured to determine, after the result determining unit determines that the webpage to be tested has a filtering problem, the color comparison in the webpage to be tested The result is that the color of the first area corresponding to the different preset comparison points is the same as the preset color, and when the color of the first area is the same as the preset color, it is determined that the first area has filtering failure. Otherwise, it is determined that there is false filtering in the first area.
- a first sub-determining unit configured to determine, after the result determining unit determines that the webpage to be tested has a filtering problem, the color comparison in the webpage to be tested The result is that the color of the first area corresponding to the different preset comparison points is the same as the preset color, and when the color of the first area is the same as the preset color, it is determined that the first area has filtering failure. Otherwise, it is determined that there is false filtering in the first area.
- the webpage marking unit 820 may include:
- a second calculating unit configured to calculate a second ratio between an area of a portion of the preset webpage and the webpage to be tested that does not overlap with a polygon frame and a total area of the polygon frame in the preset webpage;
- a second determining unit configured to determine, when the second ratio is not greater than the second preset ratio, that the preset webpage does not match an area in the webpage to be tested that is provided with the first identifier, otherwise The preset webpage matches an area of the webpage to be tested in which the first identifier is disposed.
- the webpage matching unit 830 can include:
- a second calculating unit configured to calculate a second ratio between an area of a portion of the preset webpage and the webpage to be tested that does not overlap with a polygon frame and a total area of the polygon frame in the preset webpage;
- a second determining unit configured to determine, when the second ratio is not greater than the second preset ratio, that the preset webpage does not match an area in the webpage to be tested that is provided with the first identifier, otherwise The preset webpage matches an area of the webpage to be tested in which the first identifier is disposed.
- the webpage data processing apparatus may further include: a second sub-determining unit, configured to: after the result determining unit determines that the webpage to be tested has a filtering problem, perform the following determination:
- the preset webpage is not provided with an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure; if the preset In the webpage, when the border is set in an area corresponding to the second area where the border is not set in the webpage to be tested, it is determined that the second area has error filtering.
- the webpage matching unit 830 directly determines whether the matching is performed by using the entire webpage.
- the webpage data processing apparatus may further include: an area dividing unit, respectively, The preset webpage and the webpage to be tested are divided into a plurality of corresponding comparison areas; correspondingly, the webpage matching unit 830 includes: a first sub-matching unit, configured to respectively determine between the preset webpage and the webpage to be tested Whether the regions in which the first identifier is disposed in each pair of comparison regions corresponding to each other match.
- the present invention provides a computer readable medium having program code executable by a processor, which, when executed, causes the processor to perform the steps of:
- the preset webpage matches the area in which the first identifier is set in the webpage to be tested, it is determined that the webpage to be tested does not have a filtering problem, otherwise, it is determined that the webpage to be tested has a filtering problem.
- determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: separately calculating that the preset webpage is set in the preset a first total area of the first identified area, and a second total area of the area in the web page to be tested in which the first identifier is disposed; calculating a third between the first total area and the second total area Determining whether the third ratio is within a preset range; if the third ratio is within a preset range, determining the preset webpage and the area of the webpage to be tested that is provided with the first identifier Matching, otherwise determining that the preset webpage does not match an area in the webpage to be tested in which the first identifier is set.
- the following step may be performed: if the third ratio is less than a minimum value of the preset range, determining that the webpage to be tested has a filtering failure; If the third ratio is greater than the maximum value of the preset range, it is determined that the webpage to be tested has error filtering.
- the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively presenting the preset webpage and the webpage to be tested
- the background color of the area of the actual content is set to a preset color; when the actual content is text, the color of the text is set as the preset color; when the actual content is a picture, the picture is deleted.
- the determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested includes: comparing the preset webpage and the webpage to be tested with the same preset comparison point Whether the color of the corresponding area is the same; calculating a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points; determining whether the first ratio is smaller than a first preset ratio; if the first ratio is smaller than the first preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the pre-determination The webpage is not matched with the area in which the first identifier is set in the webpage to be tested.
- the following step may be performed: determining, in the webpage to be tested, that the color comparison result is the color of the first region corresponding to the different preset comparison point, and whether The preset color is the same; if the color of the first area is the same as the preset color, it is determined that the first area has a filtering failure problem, otherwise the first area is determined to have a false filtering problem.
- the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively presenting in the preset webpage and the webpage to be tested A locale border of the actual content; wherein the border coincides with a boundary of the area where the actual content exists.
- determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: calculating an area in which the border is set in the preset webpage, and testing a second ratio between an area of a portion of the webpage where the area of the border does not overlap, and a total area of the area of the preset webpage where the border is disposed; determining whether the second ratio is smaller than the second a preset ratio; if the second ratio is smaller than the second preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the preset webpage The area in which the first identifier is set in the webpage to be tested does not match.
- the following step may be performed: in the preset webpage, an area corresponding to the first area in which the border is set in the webpage to be tested is not set.
- the border it is determined that the first area has a filter failure; when the preset webpage is set with the border corresponding to the second area of the webpage to be tested where the border is not disposed, It is determined that there is false filtering in the second region.
- determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: determining, respectively, that the preset webpage corresponds to the webpage to be tested Whether each of the pair of comparison areas in which the first identifier is set matches.
- a webpage data processing apparatus includes a processor 101 and a computer readable medium 102.
- the computer readable medium 102 stores program code that can be executed by the processor 101, and processes
- the program 101 reads program code within the computer readable medium 102 for implementing the steps or unit functions described above.
- non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash.
- ROM read only memory
- PROM programmable ROM
- EPROM electrically programmable ROM
- EEPROM electrically erasable programmable ROM
- flash flash.
- Volatile memory can include random access memory (RAM), which can act as external cache memory.
- RAM can be obtained in a variety of forms, such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM) and direct Rambus RAM (DRRAM).
- DRAM synchronous RAM
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDR SDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- SLDRAM Synchronous Link DRAM
- DRRAM direct Rambus RAM
- Storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
- Figure 18 is a diagram showing a web page data processing apparatus according to a first embodiment of the present invention.
- the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40.
- the first obtaining unit 10 is configured to obtain a uniform resource locator of the webpage to be tested.
- the browser can be a personal computer (PC) browser, or a browser on the mobile terminal.
- the user can input a Uniform Resource Locator (URL) on the browser. Get the url to determine if ad filtering is required.
- URL Uniform Resource Locator
- the first matching unit 20 is configured to match the uniform resource locators by using keywords of the advertisement filtering rule.
- the uniform resource locator can be matched by using the keyword of the advertisement filtering rule.
- the url may be segmented first, for example, by passing the url into the disconnector, and by setting a predetermined rule in the segmenter to segment the url to obtain a plurality of segmented characters. Then, multiple segment characters are passed into the keyword matcher, and multiple segment characters are matched by using the preset keywords in the keyword matcher, and each segment character is judged one by one to hit the keyword matcher. Keyword.
- the keyword can be matched with multiple advertisement filtering rules, so that when the keyword matches the url, only the advertisement filtering rule corresponding to the keyword can be matched with the url, and there is no need to match each advertisement filtering rule.
- the second matching unit 30 is configured to match the uniform resource locator with the advertisement filtering rule corresponding to the keyword when the uniform resource locator matches the keyword.
- the uniform resource locator matches the keyword, the uniform resource locator is matched with the keyword filtering rule corresponding to the keyword, and the uniform resource locator is not matched with all the advertisement filtering rules.
- the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, wherein the uniform resource locator may be a uniform resource locator matching the keyword.
- the segment character matching the url of the keyword may be introduced into the rule rule matcher, where the rule matcher has a correspondence between the keyword and the advertisement filter rule.
- Matching the segment character of the url matching the keyword to the advertisement filtering rule in the rule matcher may first match the segment character of the url with the advertisement filter rule of the white list, and then segment the character of the url with black.
- the list of advertisement filtering rules is matched, wherein the whitelist indicates a list of advertisement filtering rules that do not filter resources matching the rule, and the blacklist indicates filtering a list of resource advertisement filtering rules that match the rule.
- the advertisement filtering rule of the whitelist is matched, the resource corresponding to the url corresponding to the segmentation character may be requested; if the advertisement filtering rule of the blacklist is matched, the resource corresponding to the url corresponding to the segmentation character is not required. If none match, the next segment character can be matched in the same way.
- the matching of the url in the rule matcher may first convert the corresponding advertisement filtering rule of the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to determine whether the url is related to the advertisement filtering rule. match.
- the filtering unit 40 is configured to perform advertisement filtering by using an advertisement filtering rule when the uniform resource locator matches the advertisement filtering rule corresponding to the keyword.
- the matching advertisement may be output. Filtering rules', using this ad filtering rule for ad filtering. That is, if it is determined that the resource requested by the url is an advertisement, the browser does not need to request the resource.
- the url is matched by the keyword of the advertisement filtering rule, and then the url matching the keyword is matched with the advertisement filtering rule corresponding to the keyword, so as to avoid filtering the url and all the advertisements.
- the rules are matched one by one, which reduces the number of matched advertisement filtering rules, thereby solving the problem that each advertisement filtering time is long due to the large number of filtering rules, ensuring effective filtering of the advertising space, and achieving the effect of reducing the advertising filtering time.
- the existing advertisement filtering needs to match the url with the 20,000 advertisement filtering rules one by one. If an advertisement filtering rule is matched, the advertisement filtering is performed.
- the url is first matched with the keyword of the advertisement filter rule. If the matched keyword A corresponds to 100 advertisement filtering rules, only the url and the 100 advertisement filtering rules need to be performed. Matching greatly reduces the time of matching.
- the webpage data processing apparatus may be used for advertisement filtering of a PC browser, or may be used on a browser on a mobile terminal, and may implement its function through a PC or a mobile terminal itself, or may be through a cloud server. (such as middleware) to achieve its function.
- the webpage data processing apparatus of the embodiment of the present invention can produce a better effect when the rules of advertisement rules that can be supported on the mobile terminal are limited.
- the web page data processing apparatus includes an incoming unit and a segmented unit.
- the incoming unit is configured to pass the uniform resource locator to the segmenter after obtaining the uniform resource locator of the web page to be tested in the browser.
- the segmentation unit is configured to segment the uniform resource locator in the segmenter to obtain a plurality of segmentation characters.
- the first matching unit includes a second matching module, and the second matching module is configured to match the plurality of segment characters to the keywords in the keyword matcher one by one.
- a segmenter is used to segment the uniform resource locator.
- segmentation may be performed according to a preset segmentation rule.
- FIG. 19 is a diagram showing a web page data processing apparatus in accordance with a second embodiment of the present invention. This embodiment can be taken as a preferred embodiment of the above embodiment.
- the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40.
- the webpage data processing apparatus further includes a second obtaining unit 50 and an establishing unit 60.
- the first matching unit 20 includes an obtaining module 201 and a first judging module 202.
- the second obtaining unit 50 is configured to acquire a keyword corresponding to the advertisement filtering rule before the uniform resource locator is matched by using the keyword of the advertisement filtering rule.
- the keyword in the keyword matcher may be initialized before the keyword is matched by the keyword of the advertisement filter rule.
- the specific initialization process may be: first obtaining a keyword corresponding to the advertisement filter rule. For example, the keyword is extracted from the file of the advertisement filtering rule, so that after the url matches the keyword, the advertisement filtering rule corresponding to the keyword can be queried.
- the establishing unit 60 is configured to establish a dictionary tree of keywords corresponding to the advertisement filtering rules.
- the dictionary tree is a distributed concept query method.
- the basic idea is to record the prefix information of all keywords in the table, so the number of comparisons can be greatly reduced when querying. This method is especially useful when the number of keywords is large.
- the keywords are organized by establishing a dictionary tree of keywords corresponding to the advertisement filtering rules, and the trie tree is used to further optimize the time of consumption of the advertisement filtering.
- the keyword can be stored in a sequential manner to improve the speed of the search.
- the nodes in the trie tree contain empty links (null pointers), which represent the current trie tree. There are no keywords in the location to facilitate the fastest lookup.
- the obtaining module 201 is configured to acquire keywords in the dictionary tree.
- matching the url with the keyword may first obtain the keywords in the dictionary tree to match the url with the keywords in the dictionary tree.
- the first determining module 202 is configured to determine whether the uniform resource locator matches a keyword in the dictionary tree.
- the keyword matcher of the advertisement filtering rule searches the trie tree for the segment character according to the segment character passed in the url segmenter.
- Word matching where the match includes an exact match and a partial match.
- An exact match means that the segmented character is exactly the same as a keyword, and a partial match is when a keyword is a prefix of a segmented character. For example, when searching for a keyword in a trie tree, if there is an in the keyword, when the segment character is as or ask, the query matching success can be returned.
- the corresponding advertisement filter rule keyword is queried in the trie tree, the corresponding advertisement filter rule can be found by using the keyword, and the searched advertisement filter rule is used for the advertisement filter.
- the dictionary tree of the keyword to match the url and the keyword, the time consumption of the url in matching the keyword is reduced, thereby further reducing the advertisement filtering time.
- the second acquisition unit 50 includes a reading module and an extraction module.
- the read module is used to read the files of the ad filter rules.
- the extraction module is used to extract keywords from the files of the advertisement filtering rules.
- the establishing unit 60 includes a first establishing module and a second establishing module.
- the first establishing module is used to establish a correspondence between keywords and advertisement filtering rules.
- the second building module is configured to build a dictionary tree based on the extracted keywords.
- the file of the advertisement filtering rule may be read into the memory from the disk in the PC or the mobile terminal or the cloud server. Then extract the keywords from the file of the ad filter rule and establish the corresponding relationship between the keyword and the ad filter rule.
- the rules for extracting keywords from the files of the advertisement filtering rule may include:
- the character length of the keyword is greater than or equal to 3 and less than 32.
- the Key (keyword) extraction process includes: traversing the character string in the advertisement filtering rule file until a first character in the above-mentioned extraction rule set is found, and is recorded as the starting position of the keyword, and continues to traverse until The end of the string, or the character in the next extraction rule above, is recorded as the end position.
- the character between the start position and the end position is used as an alternative keyword. It is checked whether the candidate keyword satisfies the above-mentioned extraction conditions 4), 5), and 6), and if so, returns the keyword as the final keyword.
- the advertisement filtering rule When an appropriate keyword cannot be extracted from an advertisement filtering rule, the advertisement filtering rule is added to the global queue, and the advertisement filtering rule in the global queue indicates that there is no advertisement filtering rule associated with the corresponding keyword. Correct The ad filtering rules in the global queue, each url needs to be matched. By checking the actual advertisement filtering rules in adblock, it is rare to extract the keywords that meet the requirements in the advertisement filtering rules. Currently, no more than 20 keywords cannot be extracted in the 11285 rules.
- FIG. 20 is a diagram showing a web page data processing apparatus in accordance with a third embodiment of the present invention. This embodiment can be taken as a preferred embodiment of the above embodiment.
- the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40.
- the first matching unit 20 includes a second determining module 203
- the second matching unit 30 includes a first matching module 301.
- the second judging module 203 is configured to determine whether the uniform resource locator matches the keyword of the advertisement filtering rule, and if it is determined that the uniform resource locator matches the keyword of the advertisement filtering rule, the advertisement filtering rule corresponding to the keyword is converted. Is a regular expression.
- the first matching module 301 is configured to match the uniform resource locator with the regular expression.
- the filtering unit 40 is further configured to: when the uniform resource locator matched by the keyword matches the regular expression, output an advertisement filter rule corresponding to the regular expression, and output the 'advertising filter rule corresponding to the regular expression' Ad filtering.
- the matching of the url in the rule matcher may first convert the advertisement filtering rule corresponding to the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to facilitate judgment. Whether the url matches the ad filter rules.
- the embodiment of the present invention converts the advertisement filtering rule corresponding to the keyword into a regular expression only when it is determined that the url matches the keyword, and does not need to convert all the advertisement filtering rules into regular expressions when starting the advertisement filtering. .
- the advertisement filtering rule corresponding to the keyword needs to be converted into a regular expression, since it is necessary to consume a certain time, for example, in the mobile terminal browser. It took about 1.5 seconds to get started. Since the average number of advertisement filtering rules corresponding to each keyword is small, usually no more than 2 and no more than 10, the conversion analysis time is short. If the resolution time of the 1w advertisement filtering rule is 1.5s, the average parsing time per strip is 0.15ms, so the matching time is increased by at most 1.5ms. At the same time, the embodiment of the present invention may also cache the parsing result of the advertisement filtering rule after hitting the advertisement filtering rule for the first time, so that there is no parsing overhead subsequently, thereby further reducing the time consumption.
- the embodiment of the invention further provides a webpage data processing method. It should be noted that the webpage data processing method of the embodiment of the present invention may be performed by the webpage data processing apparatus provided by the embodiment of the present invention, and the webpage data processing apparatus of the embodiment of the present invention may also be used to perform the embodiment provided by the present invention. Web page data processing method.
- FIG. 21 is a flow chart of a web page data processing method according to a first embodiment of the present invention. As shown in FIG. 21, the browser webpage data processing method includes the following steps:
- Step S402 obtaining a uniform resource locator input in the browser.
- the browser can be a browser on a personal computer (PC) or a browser on the mobile terminal.
- the user can input the Uniform Resource Locator (URL) of the web page to be tested on the browser. ). Get the url to determine if ad filtering is required.
- URL Uniform Resource Locator
- Step S404 matching the uniform resource locator by using the keyword of the advertisement filtering rule.
- the uniform resource locator can be matched by using the keyword of the advertisement filtering rule.
- the url may be segmented first, for example, by passing the url into the disconnector, and by setting a predetermined rule in the segmenter to segment the url to obtain a plurality of segmented characters. Then, multiple segment characters are passed into the keyword matcher, and multiple segment characters are matched by using the preset keywords in the keyword matcher, and each segment character is judged one by one to hit the keyword matcher. Keyword.
- the preset keyword may correspond to multiple advertisement filtering rules, so that when the keyword matches the url, only the advertisement filtering rule corresponding to the keyword may be matched with the url, and no matching of each advertisement filtering rule is required. .
- Step S406 If the uniform resource locator matches the keyword, the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword.
- the advertisement filtering rule corresponding to the keyword can be obtained, so that the uniform resource locator and the keyword filtering rule corresponding to the keyword can be obtained, and the uniform resource locator and all the advertisement filtering rules need not be used. Make a match.
- the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, wherein the uniform resource locator may be a 'uniform resource locator matching the keyword'.
- the segment character matching the url of the keyword may be introduced into the rule rule matcher, where the rule matcher has a correspondence between the keyword and the advertisement filter rule. Matching the segment character of the url matching the keyword to the advertisement filtering rule in the rule matcher may first match the segment character of the url with the advertisement filter rule of the white list, and then segment the character of the url with black.
- the list of advertisement filtering rules is matched, wherein the whitelist indicates a list of advertisement filtering rules that do not filter the resources matched by the rule, and the blacklist indicates that the list of resource advertisement filtering rules matched by the rule is filtered. If the advertisement filtering rule of the whitelist is matched, the resource corresponding to the url corresponding to the segmentation character may be requested; if the advertisement filtering rule of the blacklist is matched, the resource corresponding to the url corresponding to the segmentation character is not required. If none match, the next segment character can be matched in the same way.
- Matching the url in the rule matcher may first convert the ad filter rule corresponding to the matched keyword into Regular expressions, then use the interface of the regular expression to query the ad filter rules to determine if the url matches the ad filter rules.
- Step S408 If the uniform resource locator matches the advertisement filtering rule corresponding to the keyword, the advertisement filtering rule is used to perform advertisement filtering.
- the uniform resource locator is matched with the keyword filtering rule corresponding to the keyword, if the keyword matching uniform resource locator matches the keyword filtering rule corresponding to the keyword, the matched advertisement filtering rule may be output, and the advertisement is utilized. Filter rules for ad filtering. That is, if it is determined that the resource requested by the url is an advertisement, the browser does not need to request the resource.
- the url is matched by the keyword of the advertisement filtering rule, and the url matching the keyword is matched with the advertisement filtering rule corresponding to the keyword, so as to avoid the url and all the advertisement filtering rules are performed one by one.
- the matching reduces the number of matching advertisement filtering rules, thereby solving the problem that each advertisement filtering time is long due to the large number of filtering rules, ensuring effective filtering of the advertising space, and achieving the effect of reducing the advertising filtering time.
- the existing advertisement filtering needs to match the url with the 20,000 advertisement filtering rules one by one. If an advertisement filtering rule is matched, the advertisement filtering is performed.
- the url is first matched with the keyword of the advertisement filtering rule, and if the matched keyword A corresponds to 100 advertisement filtering rules, only the url needs to be matched with the 100 advertisement filtering rules. Significantly reduces the time of matching.
- the webpage data processing method can be used for advertisement filtering of a PC browser, or can be used on a browser on a mobile terminal, and can be implemented by a PC or a mobile terminal itself, or can be implemented by a cloud server. (such as middleware) to achieve its function.
- the webpage data processing method of the embodiment of the present invention can produce a better effect when the advertisement rule that can be supported on the mobile terminal is limited.
- the browser webpage data processing method comprises: transmitting the uniform resource locator to the segmenter; and segmenting the uniform resource locator in the segmenter And obtaining a plurality of segment characters, wherein the matching the uniform resource locators by using the keywords of the advertisement filtering rule comprises: matching the plurality of segment characters one by one with the keywords in the keyword matcher.
- a segmenter is used to segment the uniform resource locator.
- segmentation may be performed according to a preset segmentation rule.
- FIG. 22 is a flowchart of a web page data processing method in accordance with a second embodiment of the present invention.
- the browser webpage data processing method of this embodiment may be a preferred embodiment of the browser webpage data processing method of the above embodiment.
- the browser webpage data processing method includes the following steps:
- Step S502 is the same as step S402 shown in FIG. 21, and details are not described herein.
- Step S504 acquiring a keyword corresponding to the advertisement filtering rule.
- the keyword in the keyword matcher may be initialized first.
- the specific initialization process may be: first obtaining a keyword corresponding to the advertisement filtering rule.
- the keyword is extracted from the file of the advertisement filtering rule, so that after the url matches the keyword, the advertisement filtering rule corresponding to the keyword can be queried.
- Step S506 a dictionary tree of keywords corresponding to the advertisement filtering rule is established.
- the dictionary tree is a distributed concept query method.
- the basic idea is to record the prefix information of all keywords in the table, so the number of comparisons can be greatly reduced when querying. This method is especially useful when the number of keywords is large.
- the keywords are organized by establishing a dictionary tree of keywords corresponding to the advertisement filtering rules, and the trie tree is used to further optimize the time of consumption of the advertisement filtering.
- the keyword can be stored in a sequential manner to improve the speed of the search.
- the nodes in the trie tree contain empty links (null pointers), which represent the trie tree. There are no keywords in the current location to facilitate the fastest lookup.
- Step S508 acquiring keywords in the dictionary tree.
- matching the url with the keyword may first obtain the keywords in the dictionary tree to match the url with the keywords in the dictionary tree.
- Step S510 determining whether the uniform resource locator matches the keyword in the dictionary tree.
- the keyword matcher of the advertisement filter rule searches for the segment character passed in according to the url segmenter, and finds whether the segment character is associated with the segment character in the trie tree.
- Keyword matching where the match includes an exact match and a partial match.
- An exact match means that the segmented character is exactly the same as a keyword, and a partial match is when a keyword is a prefix of a segmented character. For example, when searching for a keyword in a trie tree, if there is an in the keyword, when the segment character is as or ask, the query matching success can be returned.
- the corresponding advertisement filter rule keyword is queried in the trie tree, the corresponding advertisement filter rule can be found by using the keyword, and the searched advertisement filter rule is used for the advertisement filter.
- Steps S512 and S514 are the same in steps S406 and S408 shown in FIG. 21, and are not described herein.
- the dictionary tree of the keyword to match the url and the keyword, the time consumption of the url in matching the keyword is reduced, thereby further reducing the advertisement filtering time.
- acquiring a keyword corresponding to the advertisement filtering rule comprises: reading a file of the advertisement filtering rule; and extracting the keyword from a file of the advertisement filtering rule.
- Establishing a dictionary tree of keywords corresponding to the advertisement filtering rule includes: establishing a correspondence between the keyword and the advertisement filtering rule; and establishing the dictionary tree according to the extracted keyword.
- the file of the advertisement filtering rule may be read into the memory from the disk in the PC or the mobile terminal or the cloud server. Then extract the keywords from the ad filter rules file and create a correspondence between the keywords and the ad filter rules.
- the rules for extracting keywords from the files of the advertisement filtering rule may include:
- the character length of the keyword is greater than or equal to 3 and less than 32.
- the Key (keyword) extraction process includes: traversing the character string in the advertisement filtering rule file until a first character in the above-mentioned extraction rule set is found, and is recorded as the starting position of the keyword, and continues to traverse until The end of the string, or the character in the next extraction rule above, is recorded as the end position.
- the character between the start position and the end position is used as an alternative keyword. It is checked whether the candidate keyword satisfies the above-mentioned extraction conditions 4), 5), and 6), and if so, returns the keyword as the final keyword.
- the advertisement filtering rule When an appropriate keyword cannot be extracted from an advertisement filtering rule, the advertisement filtering rule is added to the global queue, and the advertisement filtering rule in the global queue indicates that there is no advertisement filtering rule associated with the corresponding keyword. For the ad filtering rules in the global queue, each url needs to be matched. By checking the actual advertisement filtering rules in adblock, it is rare to extract the keywords that meet the requirements in the advertisement filtering rules. Currently, no more than 20 keywords cannot be extracted in the 11285 rules.
- the matching the uniform resource locator by using the preset keyword comprises: determining whether the uniform resource locator matches the used preset keyword, wherein if the uniform resource locator is determined And matching the preset keyword, the advertisement filtering rule corresponding to the keyword is converted into a regular expression.
- Matching the uniform resource locator matching the keyword with the advertisement filtering rule corresponding to the keyword includes: matching the uniform resource locator matching the keyword with the regular expression. If the uniform resource locator matched by the keyword matches the regular expression, the advertisement filtering rule corresponding to the regular expression is output, and the advertisement filtering rule corresponding to the regular expression is output for advertisement filtering. .
- the matching of the url in the rule matcher may first convert the advertisement filtering rule corresponding to the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to facilitate judgment. Whether the url matches the ad filter rules.
- the embodiment of the present invention converts the advertisement filtering rule corresponding to the keyword into a regular expression only when it is determined that the url matches the keyword, and does not need to convert all the advertisement filtering rules into regular expressions when starting the advertisement filtering. .
- the advertisement filtering rule corresponding to the keyword needs to be converted into a regular expression, since it is necessary to consume a certain time, for example, in the mobile terminal browser. It took about 1.5 seconds to get started. Since the average number of advertisement filtering rules corresponding to each keyword is small, usually no more than 2 and no more than 10, the conversion analysis time is short. If the resolution time of the 1w advertisement filtering rule is 1.5s, the average parsing time per strip is 0.15ms, so the matching time is increased by at most 1.5ms. At the same time, the embodiment of the present invention may also cache the parsing result of the advertisement filtering rule after hitting the advertisement filtering rule for the first time, so that there is no parsing overhead subsequently, thereby further reducing the time consumption.
- the browser webpage data processing method includes:
- step S601 a url is input in the browser.
- step S602 the url is input into the segmenter to segment the url.
- the url is segmented according to a predetermined rule in the segmenter; the segmentation characters obtained from all segments are saved.
- the predetermined rule may be: first, segment the url by using "/" as a separator, then the first segment is the domain name after segmentation, and the remaining segments are segmentation of each path; then, for the domain name segmentation, further ".”
- the separator is divided into domain name segments.
- step S603 the segmented url is input to the keyword matcher.
- the url segmenter in turn passes each segment character to the keyword matcher of the filter rule.
- step S604 it is determined step by step whether to hit the keyword in the keyword matcher.
- the keyword matcher of the filter rule it is determined whether the keyword corresponding to the filter rule is hit. If there is no hit, step S605 is performed; if the hit is performed, step S606 is performed.
- step S605 it is determined whether there are still segment characters not matched. If yes, go back to step S603; if no, go to step S606.
- Step S606 returning Flase. Indicates that no filtering is required and resources can be requested.
- Step S607 the URL corresponding to the hit segment is passed to the rule matcher. Then step S608 is performed.
- the rule matcher stores a correspondence between the keyword and the filter rule.
- step S608 it is determined whether the URL hits the blacklist and does not hit the whitelist. Since the URL that hits the blacklist also includes some URLs that do not need to be filtered, the whitelist is set to match those URLs that do not need to be filtered. If the URL hits the blacklist and does not hit the whitelist, step S610 is performed. Otherwise, if the blacklist is missed and the whitelist is not hit, step S609 is performed.
- step S609 False is returned.
- Step S610 outputting a corresponding filtering rule.
- step S611 the advertisement filtering is performed by using the corresponding filtering rule.
- the advertisement filtering takes time:
- the webpage data processing method proposed by the present invention can significantly reduce the time spent on advertisement filtering when accessing a webpage, and improve the user experience.
- the disclosed apparatus may be implemented in other manners.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
- the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
- the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
- a number of instructions are included to cause a computer device (which may be a personal computer, mobile terminal, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
- the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
A webpage data processing method and device, the method comprising: acquiring a webpage to be tested; matching the webpage to be tested with a preset matching condition to obtain a matching result, the matching condition comprising keywords of an advertisement filtering rule and the advertisement filtering rule corresponding to the keywords, or the matching condition comprising an area with a first identifier preset in a preset webpage corresponding to the webpage address of the webpage to be tested; and determining the filtering condition of the webpage to be tested according to the matching result. Therefore, compared with a manual detection method, the method and device of the present invention quickly and timely detect the filtering problem of the webpage, thus improving detection efficiency, and being particularly suitable when there is a great number of webpages to be tested.
Description
本发明涉及移动通信技术领域,特别是涉及一种网页数据处理方法及装置。The present invention relates to the field of mobile communication technologies, and in particular, to a webpage data processing method and apparatus.
网站运营者通常会在网页内植入某些商家的数据,例如广告,以相应获得这些商家的赞助,进而保障网站的正常运行及盈利;但对于用户来说,网页中植入的这些数据均属于非有效内容,其存在给用户带来了诸多不便,如:用户在浏览一个新网页时,首先需要区分其中的广告等非有效内容和有效内容;或者,由于广告内容对相应网页区域中有效内容的遮挡,导致用户难以获取该有效内容。为了给用户提供一个洁净的网络环境,多数浏览器都设置有过滤功能,以滤除网页中植入的非有效内容,例如过滤广告;其过滤原理一般为:根据待过滤网页的排版样式、框架代码等特征制定对应的过滤规则,通过该过滤规则来识别网页中的非有效内容(例如广告),并阻断非有效内容在网页中的加载过程或将非有效内容在页面中隐藏,不进行显示。Website operators usually put data of certain businesses, such as advertisements, on the webpage to obtain the sponsorship of these merchants, thereby ensuring the normal operation and profitability of the website; but for the user, the data embedded in the webpage is It belongs to non-valid content, and its existence brings a lot of inconvenience to users. For example, when browsing a new webpage, users first need to distinguish between non-active content and effective content such as advertisements; or, because the advertisement content is valid for the corresponding webpage area The occlusion of the content makes it difficult for the user to obtain the valid content. In order to provide users with a clean network environment, most browsers have a filtering function to filter out non-valid content embedded in webpages, such as filtering advertisements. The filtering principle is generally: according to the layout style and frame of the webpage to be filtered. A feature such as a code formulates a corresponding filtering rule, which identifies non-valid content (such as an advertisement) in the webpage, and blocks the loading process of the non-effective content in the webpage or hides the non-effective content in the page, without performing display.
但实际应用中,由于网页的排版样式会随着网站的更新而发生变化,或者,网站维护者为避免其植入的数据被过滤而刻意更改网页的排版样式或框架代码等特征,导致预设的过滤规则不再适用于更新后的网页,从而出现过滤失效、误过滤有效内容等过滤问题。因此,需要及时发现上述过滤问题,以便优化过滤方法,提高过滤准确度。However, in actual applications, since the layout style of the webpage changes with the update of the website, or the website maintainer deliberately changes the layout style or frame code of the webpage to prevent the data embedded therein from being filtered, the preset is caused. The filtering rules no longer apply to the updated webpage, which causes filtering problems such as filtering failures and incorrect filtering of valid content. Therefore, it is necessary to discover the above filtering problem in time in order to optimize the filtering method and improve the filtering accuracy.
一般的,通过人工检测法来确定网页是否存在过滤问题,能够保证检测结果的准确性,但由于网站数量巨大,且每个网站每天可能更新十几次甚至更多,该人工检测法无法保证及时检测到每次过滤问题,检测效率极低。In general, manual detection is used to determine whether there is a filtering problem on the webpage, which can ensure the accuracy of the detection results. However, due to the huge number of websites and the fact that each website may be updated ten or more times a day, the manual detection method cannot guarantee timely. Every time a filtering problem is detected, the detection efficiency is extremely low.
另外,浏览网页的浏览器上,广告过滤插件adblock是一个广泛应用的广告过滤插件。其基本原理是设置一系列的过滤规则,在浏览器发出资源请求以请求网页资源之前,先检查其统一资源定位符(Uniform Resource Locator,简称url)是否命中某条过滤规则,如果命中某条过滤规则,可以确定浏览器请求的资源为广告,浏览器无需请求该资源。In addition, on the browser of the webpage, the ad filter plugin adblock is a widely used ad filter plugin. The basic principle is to set a series of filtering rules. Before the browser sends a resource request to request web resources, check whether its Uniform Resource Locator (URL) hits a filtering rule. If a filter is hit, The rule can determine that the resource requested by the browser is an advertisement, and the browser does not need to request the resource.
为了达到较好的过滤效果,通常需要设置较多的过滤规则,比如adblock提供的过滤规则就超过2万条。目前的浏览器广告过滤方法是:当用户通过浏览器输入某个url时,利用该url逐个对过滤规则进行匹配,如果匹配上某个过滤规则,则返回true(表示需要进行广告过滤),否则返回false(表示不需要进行广告过滤)。由于在浏览器中设置有大量的广告的过滤规则,在浏览器每一次网络请求时,与大量的过滤规则逐个进行匹配,使得广告过滤时性能
开销较大,同时由于过滤规则数量大,导致每次广告过滤时间长。In order to achieve better filtering results, it is usually necessary to set more filtering rules. For example, adblock provides more than 20,000 filtering rules. The current browser advertisement filtering method is: when a user inputs a certain url through a browser, the url is used to match the filtering rules one by one, and if a filtering rule is matched, it returns true (indicating that advertisement filtering is required), otherwise Returns false (indicating that no ad filtering is required). Since the filtering rules of a large number of advertisements are set in the browser, each time the browser requests the network, it matches with a large number of filtering rules one by one, so that the performance of the advertisement filtering performance
The overhead is large, and because of the large number of filtering rules, each advertisement filter takes a long time.
发明内容Summary of the invention
本发明实施例中提供了一种网页数据处理方法及装置,以解决人工检测网页过滤问题所存在的检测不及时、效率低的问题,实现快速有效地发现过滤问题。The embodiment of the invention provides a webpage data processing method and device, which solves the problem that the detection of the webpage filtering problem is not timely and inefficient, and realizes the filtering problem quickly and effectively.
为了实现上述目的,根据本发明的一个方面,提供了一种广告过滤方法。根据本发明的浏览器广告过滤方法包括:获取待测网页;将上述待测网页与预先设置的匹配条件进行匹配,得到匹配结果,其中,上述匹配条件包括广告过滤规则的关键字和上述关键字对应的广告过滤规则,或者上述匹配条件包括与上述待测网页的网页地址对应的预设网页,上述预设网页中预先设置有第一标识的区域:以及根据上述匹配结果确定上述待测网页的过滤情况。In order to achieve the above object, according to an aspect of the present invention, an advertisement filtering method is provided. The browser advertisement filtering method according to the present invention includes: acquiring a webpage to be tested; matching the webpage to be tested with a matching condition set in advance to obtain a matching result, wherein the matching condition includes a keyword of the advertisement filtering rule and the keyword Corresponding advertisement filtering rules, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, where the first webpage is preset in the preset webpage: and the webpage to be tested is determined according to the matching result. Filter the situation.
进一步地,在获取待测网页的同时,上述方法还包括:获取上述待测网页的网页地址对应的预设网页,在将上述待测网页与预先设置的匹配条件进行匹配,得到匹配结果之前,上述方法还包括:分别在上述预设网页和待测网页中存在实际内容的区域设置第一标识,将上述待测网页与预先设置的匹配条件进行匹配,得到匹配结果包括:判断上述预设网页与上述待测网页中设置有上述第一标识的区域是否相匹配,根据上述匹配结果确定上述待测网页的过滤情况包括:如果上述预设网页与待测网页中设置有上述第一标识的区域相匹配,则判定上述待测网页不存在过滤问题,否则判定上述待测网页存在过滤问题。Further, the method further includes: acquiring the preset webpage corresponding to the webpage address of the webpage to be tested, and matching the webpage to be tested with a preset matching condition to obtain a matching result, The method further includes: setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested, and matching the webpage to be tested with a pre-set matching condition, and obtaining a matching result includes: determining the preset webpage. And determining whether the area of the webpage to be tested is matched with the area of the webpage to be tested, and determining the filtering condition of the webpage to be tested according to the matching result: if the preset webpage and the webpage to be tested are provided with the first identifier If there is a match, it is determined that there is no filtering problem in the webpage to be tested, otherwise it is determined that the webpage to be tested has a filtering problem.
进一步地,获取待测网页包括:获取上述待测网页的统一资源定位符,将上述待测网页与预先设置的匹配条件进行匹配,得到匹配结果包括:利用广告过滤规则的关键字对上述统一资源定位符进行匹配;如果上述统一资源定位符与上述关键字匹配,则将上述统一资源定位符与上述关键字对应的广告过滤规则进行匹配,根据上述匹配结果确定上述待测网页的过滤情况包括:如果上述统一资源定位符与上述关键字对应的广告过滤规则匹配,则利用上述广告过滤规则进行广告过滤。Further, the obtaining the webpage to be tested includes: obtaining the uniform resource locator of the webpage to be tested, and matching the webpage to be tested with the matching condition set in advance, and obtaining the matching result includes: using the keyword of the advertisement filtering rule to the unified resource The locator performs matching; if the uniform resource locator matches the keyword, the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, and the filtering condition of the webpage to be tested is determined according to the matching result: If the uniform resource locator matches the advertisement filtering rule corresponding to the keyword, the advertisement filtering rule is used to filter the advertisement.
为了实现上述目的,根据本发明的另一方面,提供了一种网页数据处理装置。根据本发明的网页数据处理装置包括处理器,上述处理器用于执行以下程序模块:网页获取单元,用于获取待测网页;网页匹配单元,用于将上述待测网页与预先设置的匹配条件进行匹配,得到匹配结果,其中,上述匹配条件包括广告过滤规则的关键字和上述关键字对应的广告过滤规则,或者上述匹配条件包括与上述待测网页的网页地址对应的预设网页,上述预设网页中预先设置有第一标识的区域:以及结果确定单元,用于根据上述匹配结果确定上述待测网页的过滤情况。
In order to achieve the above object, according to another aspect of the present invention, a web page data processing apparatus is provided. The webpage data processing apparatus according to the present invention includes a processor, the processor is configured to execute the following program module: a webpage obtaining unit, configured to acquire a webpage to be tested, and a webpage matching unit, configured to perform the matching webpage with the preset matching condition Matching, the matching result is obtained, wherein the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, the preset The area of the webpage is pre-set with the first identifier: and the result determining unit is configured to determine, according to the foregoing matching result, the filtering condition of the webpage to be tested.
进一步地,上述网页获取单元还用于在获取待测网页的同时,获取上述待测网页的网页地址对应的预设网页,上述装置还包括:网页标记单元,用于分别在上述预设网页和待测网页中存在实际内容的区域设置第一标识,上述网页匹配单元还用于判断上述预设网页与上述待测网页中设置有上述第一标识的区域是否相匹配,上述结果确定单元还用于在上述预设网页与待测网页中设置有上述第一标识的区域相匹配时,判定上述待测网页不存在过滤问题,否则判定上述待测网页存在过滤问题。Further, the webpage obtaining unit is further configured to: acquire the preset webpage corresponding to the webpage address of the webpage to be tested, and the device further includes: a webpage marking unit, respectively, in the preset webpage and The first identifier of the area where the actual content exists in the webpage to be tested, the webpage matching unit is further configured to determine whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested, and the result determining unit further uses When the preset webpage matches the area where the first identifier is set in the webpage to be tested, it is determined that the webpage to be tested does not have a filtering problem, otherwise, it is determined that the webpage to be tested has a filtering problem.
进一步地,上述网页获取单元包括:第一获取单元,用于获取上述待测网页的统一资源定位符,上述网页匹配单元包括:第一匹配单元,用于利用广告过滤规则的关键字对上述统一资源定位符进行匹配;第二匹配单元,用于当上述统一资源定位符与上述关键字匹配时,将上述统一资源定位符与上述关键字对应的广告过滤规则进行匹配,上述结果确定单元包括:过滤单元,用于当上述统一资源定位符与上述关键字对应的广告过滤规则匹配时,利用上述广告过滤规则进行广告过滤。Further, the webpage obtaining unit includes: a first acquiring unit, configured to acquire a uniform resource locator of the webpage to be tested, where the webpage matching unit includes: a first matching unit, configured to use the keyword of the advertisement filtering rule to The resource locator performs matching; the second matching unit is configured to: when the uniform resource locator matches the keyword, the foregoing uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, and the result determining unit includes: The filtering unit is configured to perform advertisement filtering by using the advertisement filtering rule when the uniform resource locator matches the advertisement filtering rule corresponding to the keyword.
为了实现上述目的,根据本发明的另一方面,提供了一种具有处理器可执行的程序代码的计算机可读介质,应用于一网页数据处理设备,上述程序代码使处理器执行下述步骤:获取待测网页;将上述待测网页与预先设置的匹配条件进行匹配,得到匹配结果,其中,上述匹配条件包括广告过滤规则的关键字和上述关键字对应的广告过滤规则,或者上述匹配条件包括与上述待测网页的网页地址对应的预设网页,上述预设网页中预先设置有第一标识的区域:以及根据上述匹配结果确定上述待测网页的过滤情况。In order to achieve the above object, in accordance with another aspect of the present invention, a computer readable medium having program code executable by a processor is provided for use in a web page data processing apparatus, the program code causing the processor to perform the following steps: Obtaining a webpage to be tested; matching the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, wherein the preset webpage is pre-set with the area of the first identifier: and determining the filtering condition of the webpage to be tested according to the matching result.
由以上技术方案可见,本发明实施例通过获取待测网页;将待测网页与预先设置的匹配条件进行匹配,得到匹配结果,其中,匹配条件包括广告过滤规则的关键字和关键字对应的广告过滤规则,或者匹配条件包括与上述待测网页的网页地址对应的预设网页,上述预设网页中预先设置有第一标识的区域:以及根据上述匹配结果确定上述待测网页的过滤情况。因此,相对于人工检测法,本实施例能够快速、及时地检测出网页存在的过滤问题(如过滤失效、误过滤等),提高检测效率,尤其适用于待测网页数量巨大的场合。As can be seen from the above technical solution, the embodiment of the present invention obtains a matching result by matching the webpage to be tested with a pre-set matching condition, wherein the matching condition includes the keyword of the advertisement filtering rule and the advertisement corresponding to the keyword. The filtering rule, or the matching condition, includes a preset webpage corresponding to the webpage address of the webpage to be tested, and the preset webpage is pre-set with the area of the first identifier: and determining the filtering condition of the webpage to be tested according to the matching result. Therefore, compared with the manual detection method, the embodiment can quickly and timely detect the filtering problem of the webpage (such as filtering failure, error filtering, etc.), and improve the detection efficiency, and is particularly suitable for the occasion where the number of web pages to be tested is huge.
构成本发明的一部分的附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings, which are incorporated in the claims In the drawing:
图1为本发明实施例提供的一种网页数据处理方法的流程示意图;1 is a schematic flowchart of a webpage data processing method according to an embodiment of the present invention;
图2为本发明实施例提供的一种实现图1中步骤S13的方法流程图;
FIG. 2 is a flowchart of a method for implementing step S13 in FIG. 1 according to an embodiment of the present invention;
图3为本发明实施例提供的基于图2所示方法的判断过滤问题类型的方法流程图;FIG. 3 is a flowchart of a method for determining a type of filtering problem based on the method shown in FIG. 2 according to an embodiment of the present invention;
图4(a)为通过本发明实施例处理得到的一种预设网页的示意图;4(a) is a schematic diagram of a preset webpage processed by the embodiment of the present invention;
图4(b)为通过本发明实施例处理得到的一种待测网页的示意图;4(b) is a schematic diagram of a webpage to be tested processed by the embodiment of the present invention;
图4(c)为通过本发明实施例处理得到的另一种待测网页的示意图;FIG. 4(c) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention;
图4(d)为通过本发明实施例处理得到的另一种待测网页的示意图;FIG. 4(d) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention;
图4(e)为通过本发明实施例处理得到的另一种待测网页的示意图;FIG. 4(e) is a schematic diagram of another webpage to be tested processed by the embodiment of the present invention; FIG.
图5为本发明实施例提供的另一种网页数据处理方法的流程示意图;FIG. 5 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention;
图6(a)为未经本发明实施例处理的一种网页示意图;FIG. 6(a) is a schematic diagram of a webpage not processed by an embodiment of the present invention;
图6(b)为对图6(a)所示网页执行图5所示步骤S22后的示意图;FIG. 6(b) is a schematic diagram showing the step S22 shown in FIG. 5 after performing the webpage shown in FIG. 6(a);
图6(c)为对图6(b)所示网页中的实际内容进一步处理后的示意图;Figure 6 (c) is a schematic diagram of further processing the actual content in the web page shown in Figure 6 (b);
图7为本发明实施例提供的实现图5中步骤S23的一种方法流程图;FIG. 7 is a flowchart of a method for implementing step S23 in FIG. 5 according to an embodiment of the present invention;
图8为图7所示实施例中预设比较点的示意图;8 is a schematic diagram of preset comparison points in the embodiment shown in FIG. 7;
图9为本发明实施例提供的实现图5中步骤S23的另一种方法流程图;FIG. 9 is a flowchart of another method for implementing step S23 in FIG. 5 according to an embodiment of the present invention;
图10为本发明实施例提供的基于网页隔行扫描实现图9中步骤S341~S342的一种方法流程图;FIG. 10 is a flowchart of a method for implementing steps S341-S342 of FIG. 9 based on webpage interlaced scanning according to an embodiment of the present invention;
图11为本发明实施例提供的另一种网页数据处理方法的流程示意图;FIG. 11 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention;
图12为本发明实施例提供的以边框为第一标识的一种网页示意图;FIG. 12 is a schematic diagram of a webpage with a border as a first identifier according to an embodiment of the present invention; FIG.
图13为本发明实施例提供的实现图11中步骤S33的一种方法流程图;FIG. 13 is a flowchart of a method for implementing step S33 in FIG. 11 according to an embodiment of the present invention;
图14为本发明实施例提供的另一种网页数据处理方法的流程示意图;FIG. 14 is a schematic flowchart diagram of another webpage data processing method according to an embodiment of the present invention;
图15为本发明实施例提供的一种预设网页和待测网页的分区结果示意图;FIG. 15 is a schematic diagram of a partitioning result of a preset webpage and a webpage to be tested according to an embodiment of the present invention;
图16为本发明实施例提供的一种网页数据处理装置的结构示意图;FIG. 16 is a schematic structural diagram of a webpage data processing apparatus according to an embodiment of the present invention;
图17为本发明实施例提供的另一种网页数据处理装置的结构示意图;
FIG. 17 is a schematic structural diagram of another webpage data processing apparatus according to an embodiment of the present invention;
图18是根据本发明第一实施例的网页数据处理装置的示意图;FIG. 18 is a schematic diagram of a webpage data processing apparatus according to a first embodiment of the present invention; FIG.
图19是根据本发明第二实施例的网页数据处理装置的示意图;19 is a schematic diagram of a webpage data processing apparatus according to a second embodiment of the present invention;
图20是根据本发明第三实施例的网页数据处理装置的示意图;20 is a schematic diagram of a web page data processing apparatus according to a third embodiment of the present invention;
图21是根据本发明第一实施例的网页数据处理方法的流程图;21 is a flowchart of a web page data processing method according to a first embodiment of the present invention;
图22是根据本发明第二实施例的网页数据处理方法的流程图;以及22 is a flowchart of a web page data processing method according to a second embodiment of the present invention;
图23是根据本发明实施例的一种优选的网页数据处理方法的流程图。23 is a flow chart of a preferred web page data processing method in accordance with an embodiment of the present invention.
需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。It should be noted that the embodiments in the present invention and the features in the embodiments may be combined with each other without conflict. The invention will be described in detail below with reference to the drawings in conjunction with the embodiments.
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It is to be understood that the terms "first", "second" and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It will be understood that the data so used may be interchanged where appropriate to facilitate the embodiments of the invention described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.
本发明实施例提供一种网页数据处理方法及装置,以解决人工检测网页过滤问题所存在的检测不及时、效率低的问题。The embodiment of the invention provides a webpage data processing method and device, which solves the problem that the detection of the webpage filtering problem is not timely and the efficiency is low.
为了使本技术领域的人员更好地理解本发明实施例中的技术方案,并使本发明实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明实施例中技术方案作进一步详细的说明。The above-mentioned objects, features, and advantages of the embodiments of the present invention will become more apparent and understood. Give further details.
图1为本发明实施例提供的一种网页数据处理方法的流程图。参照图1,本发明实施例提供的网页数据处理方法包括如下步骤:
FIG. 1 is a flowchart of a method for processing webpage data according to an embodiment of the present invention. Referring to FIG. 1, a webpage data processing method provided by an embodiment of the present invention includes the following steps:
S11:获取待测网页,以及所述待测网页的网页地址对应的预设网页;S11: Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;
其中,上述预设网页和待测网页为上述网页地址在不同时刻所对应的两个网页,该预设网页可以为某一历史时刻上述网页地址对应的不存在问题的网页,即该网页对应的是过滤正常的情况下的网页,不存在误过滤或过滤失效的问题。The preset webpage and the webpage to be tested are two webpages corresponding to the webpage address at different times, and the preset webpage may be a webpage corresponding to the webpage address at a certain historical moment, that is, the webpage corresponding to the webpage, that is, the corresponding webpage It is a web page in the case of normal filtering, and there is no problem of false filtering or filtering failure.
S12:分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识;S12: setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested respectively;
上述实际内容既包括有效内容,也包括广告等非有效内容。预设网页上设置有第一标识的区域即为匹配条件的一个方面,将待测网页与该匹配条件进行匹配包括下面步骤S13的判断方式。可选地,匹配条件还可以包括广告过滤规则的关键字和所述关键字对应的广告过滤规则,这一点将在后面进行描述。The above actual content includes both valid content and non-valid content such as advertisements. The area where the first identifier is set on the preset webpage is an aspect of the matching condition, and matching the webpage to be tested with the matching condition includes the determining manner of the following step S13. Optionally, the matching condition may further include a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, which will be described later.
S13:判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,如果匹配,则执行步骤S14,否则执行步骤S15;S13: determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested, if yes, step S14 is performed, otherwise step S15 is performed;
S14:判定所述待测网页不存在过滤问题;S14: determining that the webpage to be tested does not have a filtering problem;
S15:判定所述待测网页存在过滤问题。S15: Determine that the webpage to be tested has a filtering problem.
由上述步骤可知,本发明实施例通过获取同一网页地址对应的预设网页和待测网页,并分别在上述预设网页和待测网页中存在实际内容的区域设置第一标识,以上述预设网页为基准,判断待测网页中设置有第一标识的区域是否与预设网页中设置有第一标识的区域匹配,根据判断结果判定该待测网页是否存在过滤问题;应用本发明实施例,只需为不同的网页地址设置相应的预设网页,就能够自动检测多个网站、多个网页地址对应的网页的过滤问题;在某网页地址对应的网页排版样式和/或框架代码改变后,只需相应改变该网页地址对应的预设网页即可继续准确执行自动检测。因此,相对于人工检测法,本实施例能够快速、及时地检测出网页过滤问题(如误过滤或过滤失效的问题),提高检测效率,尤其适用于待测网页数量巨大的场合。According to the foregoing steps, the embodiment of the present invention obtains the preset webpage and the webpage to be tested corresponding to the same webpage address, and sets the first identifier in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively, by using the foregoing preset Determining, by the webpage, whether the area in which the first identifier is set in the webpage to be tested matches the area in which the first identifier is set in the preset webpage, and determining whether the webpage to be tested has a filtering problem according to the determination result; By setting a corresponding preset webpage for different webpage addresses, it is possible to automatically detect the filtering problem of webpages corresponding to multiple websites and multiple webpage addresses; after the webpage layout style and/or frame code corresponding to a webpage address is changed, Simply change the default webpage corresponding to the webpage address to continue to perform automatic detection accurately. Therefore, compared with the manual detection method, the embodiment can quickly and timely detect the webpage filtering problem (such as the problem of false filtering or filtering failure), and improve the detection efficiency, and is particularly suitable for the occasion where the number of web pages to be tested is huge.
在本发明一个可行的实施例中,可以将经过步骤S12处理后的预设网页和待测网页存储为图片格式,对该图片格式的预设网页和待测网页执行S13所述的判断步骤。In a possible embodiment of the present invention, the preset webpage and the webpage to be tested processed in step S12 may be stored as a picture format, and the determining step described in S13 is performed on the preset webpage and the webpage to be tested.
在本发明另一个可行的实施例中,还可以不将预设网页和待测网页图片化,而是直接根据经过步骤S12处理后的结果,实现S13所述的判断步骤。In another possible embodiment of the present invention, the preset webpage and the webpage to be tested may not be imaged, but the determining step described in S13 may be implemented directly according to the result processed through step S12.
本实施例所述的待测网页中设置有第一标识的区域与预设网页中设置有第一标识的
区域相匹配,是指如果预设网页中的某个区域存在第一标识,则待测网页中的对应区域也应当存在第一标识,同时,如果预设网页中的某个区域不存在第一标识,则待测网页中的对应区域也应当不存在第一标识。In the webpage to be tested, the first identifier is set in the webpage to be tested, and the first logo is set in the preset webpage.
The matching of the area means that if a certain identifier exists in an area of the preset webpage, the corresponding area in the webpage to be tested should also have the first identifier, and if a certain area in the preset webpage does not exist first, If the identifier is specified, the corresponding area in the web page to be tested should also have no first identifier.
实际应用中,步骤S13所述的判断待测网页中设置有第一标识的区域与预设网页中设置有第一标识的区域是否相匹配的实施方式有多种,图2示意出了一种可行的实施方式。In an actual application, there are various implementation manners for determining whether the area where the first identifier is set in the webpage to be tested and the area in which the first identifier is set in the preset webpage are matched, and FIG. 2 illustrates a A viable implementation.
参见图2,本发明一个可行实施例提供的网页数据处理方法中,判断待测网页中设置有第一标识的区域与预设网页中设置有第一标识的区域是否相匹配,包括以下步骤:Referring to FIG. 2, in a webpage data processing method according to a possible embodiment of the present invention, determining whether an area in which a first identifier is set in a webpage to be tested matches an area in which a first identifier is set in a preset webpage includes the following steps:
S331、分别计算所述预设网页中设置有所述第一标识的区域的第一总面积,以及所述待测网页中设置有所述第一标识的区域的第二总面积;S331. Calculate, respectively, a first total area of the area where the first identifier is set in the preset webpage, and a second total area of the area where the first identifier is set in the webpage to be tested.
S332、计算所述第一总面积和第二总面积之间的第三比值;S332. Calculate a third ratio between the first total area and the second total area.
S333、判断所述第三比值是否在预设范围内,如果是,则执行步骤S334,否则执行步骤S335;S333, determining whether the third ratio is within a preset range, if yes, executing step S334, otherwise performing step S335;
S334、判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配。S334: Determine that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier.
S335、判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。S335. Determine that the preset webpage does not match an area where the first identifier is set in the webpage to be tested.
严格来讲,当待测网页和预设网页中设置有所述第一标识的区域完全匹配时,所述第一总面积应当等于第二总面积,即所述第三比值应当为1,也即所述预设范围应当设置为一阂值,该阂值为1;但考虑到计算误差的存在、或者为避免频繁修改过滤规则带来的工作负担,可以设定只要所述第三比值在以“1”为核心的预设范围内,则认为所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配。其中,所述预设范围的最大值和最小值的确定,可以根据实际检测需求而定,检测精度要求越高,则预设范围的最小值越大、最大值越小;例如,在检测精度要求不高的情况下,可以设置所述预设范围为[0.75,1.35],在检测精度要求较高的情况下,可以设置所述预设范围为[0.95,1.05]。当然,上述预设范围的具体数值仅为基于本发明原理的一种可行的实施方式,不应当认为是对本发明的保护范围的限制。Strictly speaking, when the area to be tested and the preset web page are completely matched with the area where the first identifier is set, the first total area should be equal to the second total area, that is, the third ratio should be 1, also That is, the preset range should be set to a threshold value, and the threshold value is 1; however, considering the existence of the calculation error or the work load caused by avoiding frequent modification of the filtering rule, it may be set as long as the third ratio is In the preset range with the "1" as the core, the preset webpage is considered to match the area in which the first identifier is set in the webpage to be tested. The determination of the maximum value and the minimum value of the preset range may be determined according to actual detection requirements. The higher the detection accuracy requirement is, the larger the minimum value of the preset range is, and the smaller the maximum value is; for example, the detection accuracy is If the requirement is not high, the preset range can be set to [0.75, 1.35]. In the case where the detection accuracy is high, the preset range can be set to [0.95, 1.05]. Of course, the specific values of the above-mentioned preset ranges are only one possible implementation manner based on the principles of the present invention, and should not be construed as limiting the scope of the present invention.
在本发明的另一可行实施例中,当通过图2所示实施例,判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,即所述待测网页存在过滤问题,还可以继续执行图3所示步骤,以判断过滤问题的具体类型:
In another possible embodiment of the present invention, when the embodiment shown in FIG. 2 is used, it is determined that the preset webpage does not match the area in which the first identifier is set in the webpage to be tested, that is, the to-be-tested There is a filtering problem on the webpage, and you can continue to perform the steps shown in Figure 3 to determine the specific type of filtering problem:
S631、判断所述第三比值是否小于所述预设范围的最小值,如果是,则执行步骤S632,否则执行步骤S633;S631, determining whether the third ratio is less than the minimum value of the preset range, if yes, proceeding to step S632, otherwise performing step S633;
S632、判定所述待测网页存在过滤失效;S632. Determine that the webpage to be tested has a filtering failure.
S633、判断所述第三比值是否大于所述预设范围的最大值,如果是,则判定所述待测网页存在误过滤。S633: Determine whether the third ratio is greater than a maximum value of the preset range, and if yes, determine that the webpage to be tested has error filtering.
上述实施例中列举的两个预设范围的实例[0.75,1.35]和[0.95,1.05],每个预设范围的最大值和最小值与1的差值均相等;可选的,还可以根据对两种类型的过滤问题的不同检测精度,分别设置所述预设范围的最大值和最小值;例如,如果对过滤失效现象的检测精度要求较高,而对误过滤现象的检测精度要求较低,则设置较大的最小值和较大的最大值,如可以设为[0.95,1.35],The examples of the two preset ranges listed in the above embodiment [0.75, 1.35] and [0.95, 1.05] are equal to the difference between the maximum value and the minimum value of each preset range and 1; alternatively, According to the different detection precisions of the two types of filtering problems, the maximum and minimum values of the preset range are respectively set; for example, if the detection accuracy of the filtering failure phenomenon is high, and the detection precision of the false filtering phenomenon is required Lower, set a larger minimum and a larger maximum, such as can be set to [0.95, 1.35],
下面参照图4(a)~图4(e)对图2和图3所示的本发明实施例进行阐述。Embodiments of the present invention shown in Figs. 2 and 3 will be described below with reference to Figs. 4(a) to 4(e).
图4(a)为通过步骤S12处理后的预设网页的一种示意图,设置有上述第一标识的区域有4个,分别在图4(a)标号为A1、B1、C1和D1,以便于描述;其中,A1、B1、C1和D1的面积值分别为2、1、1、1.5;则该预设网页中设置有第一标识的区域的总面积,即上述第一总面积S1=A1+B1+C1+D1=5.5。4(a) is a schematic diagram of a preset webpage processed by step S12, and four regions are provided with the first identifier, which are labeled as A1, B1, C1, and D1 in FIG. 4(a), respectively. For description, wherein the area values of A1, B1, C1, and D1 are 2, 1, 1, and 1.5, respectively; then the total area of the area in which the first identifier is set in the preset webpage, that is, the first total area S1= A1+B1+C1+D1=5.5.
情景一:若通过步骤S12处理后的待测网页的示意图如图4(b)所示,即待测网页中设置有第一标识的区域亦有4个,标号分别为A2、B2、C2和D2,且A1和A2、B1和B2、C1和C2、D1和D2分别相匹配。其中,A2、B2、C2和D2的面积分别为2、1、1、1.5;则可以计算图4(b)所示的待测网页中设置有第一标识的区域的总面积,即上述第二总面积S2=A2+B2+C2+D2=5.5;进而可以计算得到上述第三比值为S1/S2=1,即图4(b)所示情况下,第三比值在预设范围内,可以判定该待测网页不存在过滤问题,与直接对比图4(a)和图4(b)得到的结果一致。Scenario 1: If the schematic diagram of the web page to be tested processed in step S12 is as shown in FIG. 4(b), there are also four areas in the web page to be tested with the first identifier, and the labels are A2, B2, C2 and D2, and A1 and A2, B1 and B2, C1 and C2, D1 and D2 match, respectively. Wherein, the areas of A2, B2, C2 and D2 are respectively 2, 1, 1, 1.5; then the total area of the area in which the first identifier is set in the webpage to be tested shown in FIG. 4(b) can be calculated, that is, the above The total area S2=A2+B2+C2+D2=5.5; further, the third ratio is calculated as S1/S2=1, that is, in the case shown in FIG. 4(b), the third ratio is within a preset range. It can be determined that there is no filtering problem in the webpage to be tested, which is consistent with the results obtained by directly comparing FIG. 4(a) and FIG. 4(b).
情景二:若通过步骤S12处理后的待测网页的示意图如图4(c)所示,即待测网页中设置有第一标识的区域仅有3个,标号分别为A3、B3和C3。其中,A3、B3和C3的面积分别为2、1、1;则可以计算,图4(b)所示的待测网页中设置有第一标识的区域的总面积,即步骤S331中的第二总面积S3=A3+B3+C3=4;进而可以计算得到上述第三比值为S1/S3=1.375。若所述预设范围设置为[0.75,1.35],则图4(c)所示情况下,计算得到的第三比值不在预设范围内,判定网页存在过滤问题。进一步的,由于1.375>1.35,即第三比值大于预设范围的最大值,可以判定图4(c)所示的待测网页存在误过滤,与直接对比图4(a)和图4(b)得到的结果一致。
Scenario 2: If the web page to be tested after being processed in step S12 is as shown in FIG. 4(c), there are only three areas in the web page to be tested with the first identifier, and the labels are A3, B3 and C3 respectively. Wherein, the areas of A3, B3, and C3 are 2, 1, and 1, respectively; then, the total area of the area where the first identifier is set in the webpage to be tested shown in FIG. 4(b), that is, the number in step S331, can be calculated. The total area S3=A3+B3+C3=4; further, the third ratio is calculated as S1/S3=1.375. If the preset range is set to [0.75, 1.35], in the case shown in FIG. 4(c), the calculated third ratio is not within the preset range, and it is determined that the webpage has a filtering problem. Further, since 1.375>1.35, that is, the third ratio is greater than the maximum value of the preset range, it can be determined that the webpage to be tested shown in FIG. 4(c) has error filtering, and directly compares FIG. 4(a) and FIG. 4(b). The results obtained are consistent.
情景三:若通过步骤S12处理后的待测网页的示意图如图4(d)所示,即待测网页中设置有第一标识的区域有4个,标号分别为A4、B4、C4和D4。其中,A4、B4、C4和D4的面积分别为2、1、1、2;则可以计算图4(b)所示的待测网页中设置有第一标识的区域的总面积,即上述第二总面积S4=A4+B4+C4+D4=6;进而可以计算得到上述第三比值为S1/S4≈0.92。若所述预设范围设置为[0.75,1.35],则图4(d)所示情况下,第三比值在预设范围内,可以判定该待测网页不存在过滤问题。虽然此情况下,计算得到的第三比值并不为1,即图4(a)的预设网页与图4(d)的待测网页并不是完全匹配,但由于差异较小,在检测精度要求不高的情况下,也可以认为图4(d)的待测网页不存在过滤问题。Scenario 3: If the schematic diagram of the web page to be tested processed in step S12 is as shown in FIG. 4(d), there are four areas in the web page to be tested with the first identifier, and the labels are A4, B4, C4 and D4 respectively. . Wherein, the areas of A4, B4, C4 and D4 are respectively 2, 1, 1, 2; then the total area of the area in which the first mark is set in the webpage to be tested shown in FIG. 4(b) can be calculated, that is, the above The total area S2=A4+B4+C4+D4=6; further, the third ratio is calculated as S1/S4≈0.92. If the preset range is set to [0.75, 1.35], in the case shown in FIG. 4(d), the third ratio is within the preset range, and it can be determined that the webpage to be tested does not have a filtering problem. In this case, the calculated third ratio is not 1, that is, the preset webpage of FIG. 4(a) does not completely match the webpage to be tested of FIG. 4(d), but the detection accuracy is small due to the small difference. If the requirements are not high, it can be considered that there is no filtering problem in the web page to be tested in FIG. 4(d).
情景四:若通过步骤S12处理后的待测网页的示意图如图4(e)所示,即待测网页中设置有第一标识的区域亦有4个,标号分别为A5、B5、C5和D5。其中,A5、B5、C5和D5的面积分别为2、1、1、4;则可以计算图4(e)所示的待测网页中设置有第一标识的区域的总面积,即上述第二总面积S5=A5+B5+C5+D5=8;进而可以计算得到上述第三比值为S 1/S5≈0.69。若所述预设范围设置为[0.75,1.35],则图4(e)所示情况下,计算得到的第三比值不在预设范围内,判定网页存在过滤问题。进一步的,由于0.69<0.75,即第三比值小于预设范围的最小值,可以判定图4(e)所示的待测网页存在过滤失效,与直接对比图4(a)和图4(e)得到的结果一致。Scenario 4: If the schematic diagram of the web page to be tested processed in step S12 is as shown in FIG. 4(e), there are also four areas in the web page to be tested with the first identifier, and the labels are A5, B5, C5 and D5. Wherein, the areas of A5, B5, C5 and D5 are respectively 2, 1, 1 and 4; then the total area of the area in which the first mark is set in the webpage to be tested shown in Fig. 4(e) can be calculated, that is, the above The total area S5=A5+B5+C5+D5=8; further, the third ratio is calculated as S 1/S5 ≈ 0.69. If the preset range is set to [0.75, 1.35], in the case shown in FIG. 4(e), the calculated third ratio is not within the preset range, and it is determined that the webpage has a filtering problem. Further, since 0.69<0.75, that is, the third ratio is smaller than the minimum value of the preset range, it can be determined that the webpage to be tested shown in FIG. 4(e) has filtering failure, and directly compares FIG. 4(a) and FIG. 4(e). The results obtained are consistent.
可选的,在本发明另一可行的实施例中,在得到上述第一总面积和第二总面积后,还可以计算二者的面积差(第一总面积减去第二总面积)与上述第一总面积(或第二总面积)的第四比值,如果该第四比值的绝对值小于预设阂值,则判定所述待测网页不存在过滤问题,反之存在过滤问题;进一步的,如果该第四比值的绝对值不小于(即大于或等于)预设阂值,且该第四比值小于零,则判定所述待测网页存在过滤失效;如果该第四比值的绝对值不小于(即大于或等于)预设阂值,且该第四比值大于零,则判定所述待测网页存在误过滤现象。Optionally, in another feasible embodiment of the present invention, after obtaining the first total area and the second total area, the area difference between the two (the first total area minus the second total area) may be calculated and The fourth ratio of the first total area (or the second total area), if the absolute value of the fourth ratio is less than the preset threshold, determining that the webpage to be tested does not have a filtering problem, and vice versa, there is a filtering problem; If the absolute value of the fourth ratio is not less than (ie, greater than or equal to) the preset threshold, and the fourth ratio is less than zero, determining that the webpage to be tested has a filter failure; if the absolute value of the fourth ratio is not If the preset threshold is less than (or greater than or equal to), and the fourth ratio is greater than zero, it is determined that the webpage to be tested has a false filtering phenomenon.
图5为本发明另一实施例提供的网页数据处理方法的流程图。参照图5,该实施例所述的网页数据处理方法包括如下步骤:FIG. 5 is a flowchart of a method for processing webpage data according to another embodiment of the present invention. Referring to FIG. 5, the webpage data processing method described in this embodiment includes the following steps:
S21:获取待测网页,以及所述待测网页的网页地址对应的预设网页;S21: Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;
S22:分别将所述预设网页和待测网页中存在实际内容的区域的背景颜色设置为预设颜色;S22: Set a background color of an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, as a preset color;
S23:判断所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域是否相匹配,如果匹配,则执行步骤S24,否则执行步骤S25;
S23: determining whether the preset webpage and the background color of the webpage to be tested match the area of the preset color, if yes, step S24 is performed, otherwise step S25 is performed;
S24:判定所述待测网页不存在过滤问题;S24: determining that the webpage to be tested does not have a filtering problem;
S25:判定所述待测网页存在过滤问题。S25: Determine that the webpage to be tested has a filtering problem.
对应于图1所示实施例,图5所示实施例以预设颜色为所述第一标识,用于标记网页中存在实际内容的区域。Corresponding to the embodiment shown in FIG. 1 , the embodiment shown in FIG. 5 uses the preset color as the first identifier, and is used to mark an area in the webpage where the actual content exists.
在本发明另一可行的实施例中,在将预设网页和待测网页中存在实际内容的区域的背景颜色设置为预设颜色的同时,还可以对上述两个网页中的实际内容执行如下处理:当实际内容为文字时,将该文字的颜色也设置为上述预设颜色;当实际内容为图片时,删除该图片。In another possible embodiment of the present invention, when the background color of the area where the actual content exists in the preset webpage and the webpage to be tested is set as the preset color, the actual content in the two webpages may also be executed as follows. Processing: When the actual content is text, the color of the text is also set to the above preset color; when the actual content is a picture, the picture is deleted.
由于两种不同的颜色叠加后会得到不同于这两种颜色的第三种颜色、网页中的图片内容会覆盖对应区域的背景颜色,因此,通过上述对实际内容的处理,消除文字本身的颜色以及图片颜色对网页颜色的影响,保证网页中存在实际内容的区域的颜色与该区域的背景颜色相同,进而可以直接获取相应的网页的颜色,根据所获取到的颜色判断待测网页与预设网页是否匹配,不需要判断获取到的颜色是否为对应区域的背景颜色,或者通过其他复杂的方式获取对应区域的背景颜色。Since the two different colors are superimposed, a third color different from the two colors is obtained, and the image content in the webpage covers the background color of the corresponding area. Therefore, the color of the text itself is eliminated by the above processing of the actual content. And the effect of the color of the image on the color of the webpage, ensuring that the color of the area in which the actual content exists in the webpage is the same as the background color of the area, and the color of the corresponding webpage can be directly obtained, and the webpage to be tested and the preset are determined according to the obtained color. Whether the webpage matches, whether it is determined whether the acquired color is the background color of the corresponding area, or the background color of the corresponding area is obtained by other complicated methods.
例如,以黑色为上述预设颜色,对图6(a)所示的网页执行步骤S22,网页中存在实际内容的区域的背景颜色变成黑色,可以得到图6(b)所示网页;由图6(b)可见,若网页中的文字的颜色与预设颜色(黑色)不同,则文字的颜色与对应区域的背景颜色叠加后得到的该区域的实际颜色亦与预设颜色(黑色)不同,若网页中存在图片,则该图片会完全覆盖该区域的背景颜色,导致该区域的实际颜色只能表现为图片中的颜色不便于颜色对比;因此,本发明实施例在图6(b)所示处理结果的基础上,通过删除网页中的图片内容、将网页中的文字的颜色设置为与背景颜色相同的预设颜色(黑色),得到图6(c)所示的处理结果;由图6(c)可见,最终处理得到的网页中存在实际内容的区域统一显示为纯黑色块,利于后续步骤的执行。For example, the black color is the preset color, and the step S22 is performed on the webpage shown in FIG. 6(a), and the background color of the area where the actual content exists in the webpage becomes black, and the webpage shown in FIG. 6(b) can be obtained; It can be seen from FIG. 6(b) that if the color of the text in the webpage is different from the preset color (black), the actual color of the area obtained by superimposing the color of the text and the background color of the corresponding area is also the preset color (black). Differently, if there is a picture in the webpage, the picture will completely cover the background color of the area, and the actual color of the area can only be expressed as the color in the picture is not convenient for color comparison; therefore, the embodiment of the present invention is shown in FIG. 6(b). On the basis of the processing result shown in the figure, the processing result shown in FIG. 6(c) is obtained by deleting the picture content in the webpage and setting the color of the text in the webpage to the preset color (black) which is the same as the background color; It can be seen from FIG. 6(c) that the area where the actual content exists in the final processed webpage is uniformly displayed as a pure black block, which is advantageous for the execution of the subsequent steps.
在本发明的一个可行的实施例中,可以采用图2所示的方法实现S23中所述的判断所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域是否相匹配,即:分别计算所述预设网页中背景颜色为预设颜色的区域的总面积M1,以及待测网页中背景颜色为预设颜色的区域的总面积M2,并计算比值M1/M2,如果M1/M2在预设范围内,则判定所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域相匹配,否则判定所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域不匹配,存在过滤问题。相应的,在判定所述待测网页存在过滤问题后,还可以通过图3所示方法进一步确定过滤问题的类型(过滤失效或者误过滤)。
In a possible embodiment of the present invention, the method shown in FIG. 2 may be used to determine whether the preset webpage and the background color of the webpage to be tested are the preset color in the webpage to be tested. Matching, that is, calculating a total area M1 of the area in which the background color is the preset color in the preset webpage, and a total area M2 of the area in which the background color of the webpage to be tested is the preset color, and calculating the ratio M1/M2, If the M1/M2 is within the preset range, determining that the preset webpage matches an area in the webpage to be tested whose background color is the preset color, otherwise determining the preset webpage and the webpage to be tested. The area in which the background color is the preset color does not match, and there is a filtering problem. Correspondingly, after determining that the webpage to be tested has a filtering problem, the type of the filtering problem (filtering failure or false filtering) may be further determined by the method shown in FIG. 3.
在本发明另一可行的实施例中,还可以通过图7所示的流程来实现S23中所述的判断所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域是否相匹配:In another possible embodiment of the present invention, the determining, by the process shown in FIG. 7, the determining that the background color of the preset webpage and the webpage to be tested is the preset color is performed in S23. Whether it matches:
S311:比较所述预设网页和待测网页中与同一预设比较点相对应的区域的颜色是否相同;S311: Compare whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;
所述预设比较点是指网页中坐标为预设坐标值的像素点,例如,参照图8,可以以网页的左上角为原点建立xy坐标系,水平向右的方向为x轴方向,竖直向下的方向为y轴方向;其中,坐标为(3,2)的像素点P1即可作为一个预设比较点,坐标为(8,4)的像素点P2亦可作为一个预设比较点;同一预设比较点分别映射到预设网页和待测网页中得到的两个区域(像素点)为一对相对应的区域,步骤S311即对每对相对应的区域的颜色进行比较。如果所述预设网页和待测网页中与同一预设比较点相对应的区域的颜色相同,说明该预设比较点对应的两个区域相匹配,即或者均存在有效内容,或者均不存在有效内容。The preset comparison point refers to a pixel point in the webpage whose coordinates are preset coordinate values. For example, referring to FIG. 8 , the xy coordinate system can be established with the upper left corner of the webpage as the origin, and the horizontal right direction is the x-axis direction. The direction of the straight downward direction is the y-axis direction; wherein the pixel point P1 with coordinates (3, 2) can be used as a preset comparison point, and the pixel point P2 with coordinates (8, 4) can also be used as a preset comparison. Point; the same preset comparison point is respectively mapped to the preset webpage and the two regions (pixels) obtained in the webpage to be tested as a pair of corresponding regions, and step S311 compares the colors of each pair of corresponding regions. If the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same, it indicates that the two areas corresponding to the preset comparison point match, that is, both have valid content, or none exist. Effective content.
为保证检测的准确度,上述预设比较点的总个数不应过少,具体个数值可以根据实际应用需求设定。In order to ensure the accuracy of the detection, the total number of preset comparison points should not be too small, and the specific values can be set according to actual application requirements.
S312:计算所述颜色比较结果为不相同的预设比较点的个数与预设比较点的总个数之间的第一比值;S312: Calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;
S313:判断所述第一比值是否小于第一预设比值,如果所述第一比值小于第一预设比值,则执行步骤S314,否则执行步骤S315;S313: determining whether the first ratio is smaller than the first preset ratio, if the first ratio is less than the first preset ratio, step S314 is performed, otherwise step S315 is performed;
S314:判定所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域匹配;S314: Determine that the preset webpage matches an area of the webpage to be tested whose background color is the preset color.
S315:判定所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域不匹配。S315: Determine that the preset webpage does not match an area in the webpage to be tested whose background color is the preset color.
所述第一比值越大,说明颜色比较结果为不同的预设比较点个数越多,相应的,预设网页和待测网页之间不匹配的区域越大。因此,可根据检测精度要求(允许的预设网页和待测网页之间不匹配区域占整个网页的最大比例)设置所述第一预设比值,当第一比值大于该第一预设比值时,说明预设网页和待测网页之间不匹配区域所占比例过大,从而可以判定待测网页存在过滤问题,反之,可以判定待测网页不存在过滤问题。The larger the first ratio is, the more the number of preset comparison points is different for the color comparison result, and correspondingly, the area that does not match between the preset web page and the web page to be tested is larger. Therefore, the first preset ratio may be set according to the detection precision requirement (the maximum ratio of the unmatched area between the allowed preset webpage and the webpage to be tested to the entire webpage), when the first ratio is greater than the first preset ratio The ratio of the unmatched area between the preset webpage and the webpage to be tested is too large, so that the filtering problem of the webpage to be tested may be determined. Conversely, it may be determined that the webpage to be tested does not have a filtering problem.
在本发明一个可行的实施例中,当通过图7所示方法判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,即所述待测网页存在过滤问题,还可以继续执行以下步骤,以判定过滤问题的具体类型:
In a possible embodiment of the present invention, when the method shown in FIG. 7 determines that the preset webpage does not match the area in which the first identifier is set in the webpage to be tested, that is, the webpage to be tested is filtered. Problem, you can continue to perform the following steps to determine the specific type of filtering problem:
判断所述待测网页中,颜色比较结果为不同的预设比较点对应的第一区域的颜色,是否与所述预设颜色相同;Determining, in the webpage to be tested, whether the color of the first region corresponding to the different preset comparison points is the same as the preset color;
如果所述第一区域的颜色与预设颜色相同,则判定所述第一区域存在过滤失效问题,否则判定所述第一区域存在误过滤问题。If the color of the first area is the same as the preset color, it is determined that the first area has a filtering failure problem, otherwise the first area is determined to have a false filtering problem.
例如,预设比较点P1(3,2)的颜色比较结果为不同,即待测网页中坐标为(3,2)的像素点的颜色,与预设网页中坐标为(3,2)的像素点的颜色不同;在此前提下,如果待测网页中坐标为(3,2)的像素点的颜色与预设颜色相同,相应的预设网页中坐标为(3,2)的像素点的颜色与预设颜色不同,说明预设网页中不存在实际内容的区域,在待测网页中的对应区域内存在实际内容,因此,可以判定待测网页在该预设比较点对应的区域处存在非有效内容,即出现过滤失效。相反的,如果待测网页中坐标为(3,2)的像素点的颜色与预设颜色不同,预设网页中坐标为(3,2)的像素点的颜色与预设颜色相同,说明预设网页中存在实际内容的区域,在待测网页中的对应区域内不存在实际内容,因此,可以判定待测网页在该预设比较点对应的区域处的有效内容被滤除,即出现误过滤。For example, the color comparison result of the preset comparison point P1 (3, 2) is different, that is, the color of the pixel with the coordinate of (3, 2) in the webpage to be tested, and the coordinate of the preset webpage is (3, 2). The color of the pixel is different. Under this premise, if the color of the pixel with the coordinates (3, 2) in the web page to be tested is the same as the preset color, the pixel with the coordinates of (3, 2) in the corresponding preset web page. The color of the preset color is different from the preset color, and the actual content is not present in the preset webpage. The actual content exists in the corresponding area of the webpage to be tested. Therefore, it can be determined that the webpage to be tested is in the corresponding area of the preset comparison point. There is non-valid content, that is, filtering failure occurs. Conversely, if the color of the pixel with the coordinates (3, 2) in the web page to be tested is different from the preset color, the color of the pixel with the coordinates (3, 2) in the preset webpage is the same as the preset color, indicating The area where the actual content exists in the webpage does not exist in the corresponding area in the webpage to be tested. Therefore, it can be determined that the effective content of the webpage to be tested in the area corresponding to the preset comparison point is filtered out, that is, an error occurs. filter.
可选的,在本发明另一可行的实施例中,基于图2所示方法的原理,可以通过图9所示的方法流程实现步骤S23所述的判断所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域是否相匹配;参见图9,该方法包括以下步骤:Optionally, in another possible embodiment of the present invention, based on the principle of the method shown in FIG. 2, the determining the preset webpage and the to-be-tested according to step S23 may be implemented by the method flow shown in FIG. Whether the background color of the webpage matches the area of the preset color; referring to FIG. 9, the method includes the following steps:
S341、将所述预设网页和待测网页中,与同一预设比较点相对应的区域的背景颜色分别与所述预设颜色比较;S341. Compare, in the preset webpage and the webpage to be tested, a background color of an area corresponding to the same preset comparison point with the preset color;
S342、记录所述预设网页中背景颜色与所述预设颜色相同的区域的个数M1,以及所述待测网页中背景颜色与所述预设颜色相同的区域的个数M2;S342, recording the number M1 of regions in the preset webpage with the same background color as the preset color, and the number M2 of the regions in the webpage to be tested that have the same background color as the preset color;
S343、计算所述M1和M2的比值M1/M2;S343, calculating a ratio M1/M2 of the M1 and M2;
S344、判断M1/M2是否在预设范围内,如果是,则执行步骤S345,否则执行步骤S346;S344, it is determined whether M1/M2 is within a preset range, if yes, step S345 is performed, otherwise step S346 is performed;
S345、判定所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域相匹配;S345. Determine that the preset webpage matches an area of the webpage to be tested whose background color is the preset color.
S346、判定所述预设网页与所述待测网页中背景颜色为所述预设颜色的区域不匹配。S346. Determine that the preset webpage does not match an area in which the background color of the webpage to be tested is the preset color.
严格来讲,当所述预设网页与待测网页中背景颜色为所述预设颜色的区域完全匹配时,应当有M1=M2,即M1/M2=1,也即步骤S344中的预设范围应当设置为一阂值,该阂值为1。但根据实际应用中的检测精度要求,可以设定该预设范围为一包含“1”的数值区间;且检测精度要求越高,该预设范围的最小值越大、最大值越小。
Strictly speaking, when the preset webpage and the background color of the webpage to be tested completely match the area of the preset color, there should be M1=M2, that is, M1/M2=1, that is, the preset in step S344. The range should be set to a value of 1, which is 1. However, according to the detection accuracy requirement in practical applications, the preset range may be set to a numerical interval including “1”; and the higher the detection accuracy requirement, the larger the minimum value of the preset range and the smaller the maximum value.
进一步的,当通过图9所示方法判定所述预设网页与待测网页中背景颜色为所述预设颜色的区域不匹配,即所述待测网页存在过滤问题时,还可以继续执行如下步骤,以判定过滤问题的具体类型:Further, when it is determined by the method shown in FIG. 9 that the preset webpage does not match the area in which the background color of the webpage to be tested is the preset color, that is, when the webpage to be tested has a filtering problem, the following may continue to be performed as follows: Steps to determine the specific type of filtering problem:
如果M1>M2,则判定所述待测网页存在误过滤;如果M1<M2,则判定所述待测网页存在过滤失效。If M1>M2, it is determined that the webpage to be tested has error filtering; if M1<M2, it is determined that the webpage to be tested has filtering invalidity.
为更好的实现自动检测、快速完成预设网页和待测网页的比较,本发明的一个具体实施例中分别对待测网页和预设网页执行图10所示的基于网页隔行扫描的方法流程,以获取M1和M2,实现图9中所示的步骤S341~S342。In order to better implement the automatic detection and the quick completion of the comparison between the preset webpage and the webpage to be tested, in a specific embodiment of the present invention, the method for performing the webpage interlaced scanning method shown in FIG. 10 is performed on the webpage to be tested and the preset webpage respectively. In order to acquire M1 and M2, steps S341 to S342 shown in Fig. 9 are realized.
参见图10,该方法包括如下步骤:Referring to Figure 10, the method includes the following steps:
S1:以待扫描网页的左上角为坐标原点,设定扫描参数,包括:横坐标X(初始值为0),纵坐标Y(初始值为0),横向扫描步长ΔW,纵向扫描步长ΔH,网页的宽度W,以及网页的高度H;S1: setting the scanning parameters by using the upper left corner of the webpage to be scanned as the coordinate origin, including: the abscissa X (initial value is 0), the ordinate Y (initial value is 0), the horizontal scanning step length ΔW, and the longitudinal scanning step length ΔH, the width W of the web page, and the height H of the web page;
S2:判断坐标为(X,Y)的预设比较点的颜色是否与预设颜色相同,如果是,则执行步骤S3,否则执行步骤S4;S2: determining whether the color of the preset comparison point whose coordinates are (X, Y) is the same as the preset color, if yes, executing step S3, otherwise performing step S4;
S3:将预设比较点(X,Y)对应的比较结果记录为1,并执行步骤S5;S3: Record the comparison result corresponding to the preset comparison point (X, Y) as 1, and perform step S5;
S4:将预设比较点(X,Y)对应的比较结果记录为0,并执行步骤S5;S4: Record the comparison result corresponding to the preset comparison point (X, Y) as 0, and perform step S5;
S5:将纵坐标Y的值增加一个纵向扫描步长ΔH;S5: increasing the value of the ordinate Y by a longitudinal scanning step size ΔH;
即执行赋值运算Y=Y+ΔH。That is, the assignment operation Y=Y+ΔH is performed.
S6:判断纵坐标Y是否大于H,如果是,则执行步骤S7,否则返回步骤S2;S6: determining whether the ordinate Y is greater than H, if yes, proceeding to step S7, otherwise returning to step S2;
S7:将横坐标X的值增加一个横向扫描步长ΔW,将纵坐标Y的值设置为0;S7: increasing the value of the abscissa X by a horizontal scanning step size ΔW, and setting the value of the ordinate Y to 0;
即执行赋值运算X=X+ΔW,Y=0。That is, the assignment operation X=X+ΔW is performed, and Y=0.
S8:判断横坐标X是否大于W,如果是,则执行步骤S9,否则返回步骤S2;S8: determining whether the abscissa X is greater than W, if yes, proceeding to step S9, otherwise returning to step S2;
S9:计算比较结果为“1”的个数M;其中,当所述待扫描网页为所述预设网页时,M=M1,当所述待扫描网页为所述待测网页时,M=M2。
S9: Calculating the number M of the comparison result is “1”; wherein, when the webpage to be scanned is the preset webpage, M=M1, when the webpage to be scanned is the webpage to be tested, M= M2.
可见,图10所述方法,以扫描点为所述预设比较点,通过调节横向扫描步长ΔW,和/或纵向扫描步长ΔH,可以调节扫描点的总个数,即调节预设比较点的个数,简单灵活;同时,在扫描过程中自动比较每个预设比较点对应区域的颜色是否与预设颜色相同,还可以提高处理效率。It can be seen that, in the method of FIG. 10, the scan point is the preset comparison point, and the total number of scan points can be adjusted by adjusting the horizontal scan step size ΔW and/or the vertical scan step size ΔH, that is, adjusting the preset comparison. The number of points is simple and flexible. At the same time, it is automatically compared with the preset color in the corresponding area of each preset comparison point during the scanning process, and the processing efficiency can be improved.
可选的,在本发明的另一个可行的实施例中,图10所示方法中的比较结果可以通过数字矩阵的方式存储例如,扫描过程中,横坐标X取值共20个,纵坐标Y的取值共5个,则可以得到如下所示5行20列的数字矩阵:Optionally, in another feasible embodiment of the present invention, the comparison result in the method shown in FIG. 10 may be stored by means of a digital matrix. For example, during the scanning process, the abscissa X has a total of 20 values, and the ordinate Y A total of five values, you can get a matrix of 5 rows and 20 columns as shown below:
0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,00,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,00,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,1,1,0,0,1,1,1,1,1,0,0,0,1,1,10,0,0,0,1,1,1,0,0,1,1,1,1,1,0,0,0,1,1,1
上述数字矩阵中,每个数字对应一个扫描点,即一个预设比较点。In the above digital matrix, each number corresponds to one scanning point, that is, a preset comparison point.
图11为本发明另一实施例提供的网页数据处理方法的流程图。参照图11,该实施例所述的网页数据处理方法包括如下步骤:FIG. 11 is a flowchart of a method for processing webpage data according to another embodiment of the present invention. Referring to FIG. 11, the webpage data processing method described in this embodiment includes the following steps:
S31、获取待测网页,以及所述待测网页的网页地址对应的预设网页;S31: Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;
S32、分别在所述预设网页和待测网页中存在实际内容的区域设置边框;S32. Set a border in an area where the actual content exists in the preset webpage and the webpage to be tested respectively;
其中,所述边框与所述存在实际内容的区域的边界重合。图12示出了对图6(a)所示网页中的“专栏”所在区域设置所述边框后的网页示意图;需要说明的是,本发明实施例采用的所述边框不仅仅局限于图12中采样的虚线框。Wherein the border coincides with a boundary of the area where the actual content exists. FIG. 12 is a schematic diagram of a webpage after the frame is set in the area of the “column” in the webpage shown in FIG. 6(a); it should be noted that the frame used in the embodiment of the present invention is not limited to FIG. The dashed box in the middle of the sample.
S33、判断所述预设网页与所述待测网页中设置有所述边框的区域是否相匹配,如果匹配,则执行步骤S34,否则执行步骤S35;S33, determining whether the preset webpage and the area of the webpage to be tested are matched with the border, if yes, step S34 is performed, otherwise step S35 is performed;
S34:判定所述待测网页不存在过滤问题;S34: determining that the webpage to be tested does not have a filtering problem;
S35:判定所述待测网页存在过滤问题。
S35: Determine that the webpage to be tested has a filtering problem.
对应于图1所示实施例,图11所示实施例以边框为所述第一标识,用于标记网页中存在实际内容的区域。Corresponding to the embodiment shown in FIG. 1, the embodiment shown in FIG. 11 uses a border as the first identifier, and is used to mark an area in the webpage where the actual content exists.
可选的,上述步骤S33中判断所述预设网页与所述待测网页中设置有所述边框的区域是否相匹配,可以通过图13所示的方法实现:Optionally, whether the preset webpage is matched with the area in which the border is set in the webpage to be tested is determined in the foregoing step S33, and may be implemented by using the method shown in FIG.
S321、计算所述预设网页中设置有所述边框的区域和待测网页中设置有所述边框的区域不重叠的部分的面积,与所述预设网页中设置有所述边框的区域的总面积之间的第二比值;S321: Calculate an area of the preset webpage where the border is disposed, and an area of the portion of the webpage to be tested that does not overlap with the border, and an area of the preset webpage where the border is disposed. a second ratio between the total areas;
S322、判断所述第二比值是否小于第二预设比值,如果是,则执行步骤S323,否则执行步骤S324;S322, determining whether the second ratio is less than the second preset ratio, if yes, proceeding to step S323, otherwise performing step S324;
S323:判定所述预设网页与所述待测网页中设置有所述边框的区域匹配;S323: determining that the preset webpage matches an area in the webpage to be tested that is provided with the border;
S324:判定所述预设网页与所述待测网页中设置有所述边框的区域不匹配。S324: Determine that the preset webpage does not match an area in the webpage to be tested that is provided with the border.
所述第二比值越大,说明不重叠的部分越多,相应的,预设网页和待测网页之间不匹配的区域越大,反之,第二比值越小,说明重叠的部分越多,预设网页和待测网页之间匹配的区域越大。The larger the second ratio is, the more the non-overlapping part is, and the corresponding area between the preset webpage and the webpage to be tested is larger, and the second ratio is smaller, indicating that the overlapping part is more. The larger the matching area between the default web page and the web page to be tested.
需要说明的是,本发明实施例所述的用于标记网页中存在实际内容的区域的第一标识的具体形式,不仅局限于图5所示实施例中的预设颜色,以及图11所示实施例中的多边形图框,本领域普通技术人员在没有做出创造性劳动前提下所获得的通过其他标记方式实现的所有其他实施例,都应当属于本发明的保护范围。It should be noted that the specific form of the first identifier used to mark the area where the actual content exists in the webpage according to the embodiment of the present invention is not limited to the preset color in the embodiment shown in FIG. 5, and FIG. The polygonal frame in the embodiment, all other embodiments obtained by other marking methods obtained by those skilled in the art without creative efforts should fall within the protection scope of the present invention.
在本发明一个可行的实施例中,当通过图13所示方法判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,即所述待测网页存在过滤问题,还可以继续执行以下步骤,以判定过滤问题的具体类型:In a possible embodiment of the present invention, when the method shown in FIG. 13 determines that the preset webpage does not match the area in which the first identifier is set in the webpage to be tested, that is, the webpage to be tested is filtered. Problem, you can continue to perform the following steps to determine the specific type of filtering problem:
当所述预设网页中,与所述待测网页中设置有所述边框的第一区域相对应的区域未设置所述边框时,判定所述第一区域存在过滤失效;When the preset webpage is not located in an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure;
当所述预设网页中,与所述待测网页中未设置所述边框的第二区域相对应的区域设置有所述边框时,判定所述第二区域存在误过滤。
And determining, in the preset webpage, that the second region has a false filter when the border corresponding to the second region where the border is not disposed in the webpage to be tested is set.
图14为本发明的另一可行实施例提供的网页数据处理方法的流程图,包括如下步骤:FIG. 14 is a flowchart of a method for processing webpage data according to another possible embodiment of the present invention, including the following steps:
S41、获取待测网页,以及所述待测网页的网页地址对应的预设网页;S41. Obtain a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;
S42、分别将所述预设网页和待测网页中存在实际内容的区域的设置第一标识;S42. Set a first identifier of the preset webpage and the area where the actual content exists in the webpage to be tested, respectively.
S43、分别将所述预设网页和待测网页划分为一一对应的多个比较区域;S43. The preset webpage and the webpage to be tested are respectively divided into a plurality of comparison areas corresponding to one-to-one correspondence;
如图15所示的一种预设网页和待测网页的分区结果示意图,该预设网页被划分为Q1、Q2、Q3和Q4四个比较区域,相应的,待测结果亦被划分为四个区域,分别为对应于Q1的区域Z1,对应于Q2的区域Z2,对应于Q3的区域Z3,以及对应于Q4的区域Z4。As shown in FIG. 15 , a preset web page and a partition result of the web page to be tested are divided into four comparison areas: Q1, Q2, Q3, and Q4. Correspondingly, the test result is also divided into four. The regions are the region Z1 corresponding to Q1, the region Z2 corresponding to Q2, the region Z3 corresponding to Q3, and the region Z4 corresponding to Q4.
S44、分别判断所述预设网页与所述待测网页之间相对应的比较区域中设置有所述第一标识的区域是否相匹配,如果匹配,则执行步骤S45,否则执行步骤S46;S44, respectively, determining whether the area of the comparison area corresponding to the preset webpage and the webpage to be tested is matched with the first identifier, if yes, step S45 is performed, otherwise step S46 is performed;
以图15为例,即分别比较Q1和Z1中设置有所述第一标识的区域是否匹配,Q2和Z2中设置有所述第一标识的区域是否匹配,Q3和Z3中设置有所述第一标识的区域是否匹配,以及Q4和Z4中设置有所述第一标识的区域是否匹配。Taking FIG. 15 as an example, whether the areas in which the first identifier is set in Q1 and Z1 are respectively matched, and whether the areas in which the first identifier is set in Q2 and Z2 are matched, and the number is set in Q3 and Z3. Whether an identified area matches, and whether the areas in the Q4 and Z4 in which the first identification is set match.
S45、判定所述比较区域中属于所述待测网页的比较区域不存在过滤问题;S45. Determine that there is no filtering problem in the comparison area of the comparison area that belongs to the webpage to be tested.
S46、判定所述比较区域中属于所述待测网页的比较区域存在过滤问题。S46. Determine a filtering problem in the comparison area that belongs to the webpage to be tested in the comparison area.
上述技术方案中,通过对预设网页和待测网页对应分区,进而分别判断每对区域中设置有所述第一标识的区域是否匹配,相对于以整个网页为比较对象,该方案可以减小检测误差。In the foregoing technical solution, by determining a partition between the preset webpage and the webpage to be tested, and determining whether the area in which the first identifier is set in each pair of regions is matched, the scheme may be reduced compared with the comparison of the entire webpage. Detection error.
通过以上的方法实施例的描述,所属领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:只读存储器(ROM)、随机存取存储器(RAM)、磁碟或者光盘等各种可以存储程序代码的介质。Through the description of the above method embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for causing a A computer device (which may be a personal computer, server, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes various types of media that can store program codes, such as a read only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
与本发明提供的网页数据处理方法实施例相对应,本发明还提供了一种网页数据处理装置。
Corresponding to the embodiment of the webpage data processing method provided by the present invention, the present invention further provides a webpage data processing apparatus.
图16为本发明一种可行的实施例提供的网页数据处理装置的结构示意图。参见图16,该网页数据处理装置包括:网页获取单元810、网页标记单元820、网页匹配单元830和结果确定单元840。FIG. 16 is a schematic structural diagram of a webpage data processing apparatus according to a possible embodiment of the present invention. Referring to FIG. 16, the webpage data processing apparatus includes a webpage obtaining unit 810, a webpage marking unit 820, a webpage matching unit 830, and a result determining unit 840.
其中,网页获取单元810,用于分别获取待测网页,以及所述待测网页的网页地址对应的预设网页。The webpage obtaining unit 810 is configured to obtain a webpage to be tested and a preset webpage corresponding to the webpage address of the webpage to be tested.
网页标记单元820,用于分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识。The webpage marking unit 820 is configured to set a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested.
网页匹配单元830,用于判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配。The webpage matching unit 830 is configured to determine whether the preset webpage matches an area in the webpage to be tested in which the first identifier is disposed.
结果确定单元840,用于在所述预设网页与待测网页中设置有所述第一标识的区域相匹配时,判定所述待测网页不存在过滤问题,否则判定所述待测网页存在过滤问题。The result determining unit 840 is configured to determine that the webpage to be tested does not have a filtering problem when the preset webpage matches the area in which the first identifier is set in the webpage to be tested, and otherwise determine that the webpage to be tested exists Filter the problem.
由上述实施例可见,本发明实施例通过获取同一网页地址对应的预设网页和待测网页,并分别在上述预设网页和待测网页中存在实际内容的区域设置第一标识,以上述预设网页为基准,判断待测网页中设置有第一标识的区域是否与预设网页中设置有第一标识的区域匹配,根据判断结果判定该待测网页是否存过滤问题;应用本发明实施例,只需为不同的网页地址设置相应的预设网页,就能够自动检测多个网站、多个网页地址对应的网页的过滤问题;在某网页地址对应的网页排版样式和/或框架代码改变后,只需相应改变该网页地址对应的预设网页即可继续准确执行自动检测。因此,相对于人工检测法,本实施例能够快速、及时地检测出过滤问题,提高检测效率,尤其适用于待测网页数量巨大的场合。It can be seen that, in the embodiment of the present invention, the preset webpage corresponding to the same webpage address and the webpage to be tested are obtained, and the first identifier is set in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively. Setting a webpage as a reference, determining whether the area in which the first identifier is set in the webpage to be tested matches the area in which the first identifier is set in the preset webpage, and determining whether the webpage to be tested has a filtering problem according to the determination result; applying the embodiment of the present invention By setting a corresponding preset webpage for different webpage addresses, it is possible to automatically detect the filtering problem of webpages corresponding to multiple websites and multiple webpage addresses; after the webpage layout style and/or frame code corresponding to a webpage address is changed, , you only need to change the default webpage corresponding to the webpage address to continue to perform automatic detection accurately. Therefore, compared with the manual detection method, the embodiment can detect the filtering problem quickly and timely, and improve the detection efficiency, and is particularly suitable for occasions where the number of web pages to be tested is huge.
在本发明的一个可行实施例中,网页匹配单元830可以包括:In a possible embodiment of the present invention, the webpage matching unit 830 may include:
面积计算单元,用于分别计算所述预设网页中设置有所述第一标识的区域的第一总面积,以及所述待测网页中设置有所述第一标识的区域的第二总面积;An area calculating unit, configured to separately calculate a first total area of the area in which the first identifier is set in the preset webpage, and a second total area of the area in which the first identifier is disposed in the webpage to be tested ;
第三计算单元,用于计算所述第一总面积和第二总面积之间的第三比值;a third calculating unit, configured to calculate a third ratio between the first total area and the second total area;
第三判定单元,用于判断所述第三比值是否在预设范围内;如果所述第三比值在预设范围内,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。
a third determining unit, configured to determine whether the third ratio is within a preset range; if the third ratio is within a preset range, determining that the preset webpage and the webpage to be tested are set in the The area of the first identifier is matched, and the preset webpage is determined to not match the area in which the first identifier is set in the webpage to be tested.
另外,所述网页处理装置还可以包括:第三子确定单元,用于在所述结果确定单元判定所述待测网页存在过滤问题后,比较所述第三比值、所述预设范围的最小值,以及所述预设范围的最大值,并在所述第三比值小于所述预设范围的最小值时,判定所述待测网页存在过滤失效,在如果所述第三比值大于所述预设范围的最大值时,判定所述待测网页存在误过滤。In addition, the webpage processing apparatus may further include: a third sub-determining unit, configured to compare the third ratio, the minimum of the preset range, after the result determining unit determines that the webpage to be tested has a filtering problem a value, and a maximum value of the preset range, and when the third ratio is less than a minimum value of the preset range, determining that the webpage to be tested has a filter failure, if the third ratio is greater than the When the maximum value of the preset range is determined, it is determined that the webpage to be tested has error filtering.
在本发明的另一个可行实施例中,网页标记单元820可以包括:In another possible embodiment of the present invention, the webpage marking unit 820 may include:
背景设置单元,用于分别将所述预设网页和待测网页中存在实际内容的区域的背景颜色设置为预设颜色;a background setting unit, configured to respectively set a background color of an area where the actual content exists in the preset webpage and the webpage to be tested as a preset color;
文字处理单元,用于当所述预设网页和/或待测网页中的实际内容为文字时,设置所述文字的颜色为所述预设颜色;a word processing unit, configured to set a color of the text to be the preset color when the actual content in the preset webpage and/or the webpage to be tested is a text;
图片处理单元,用于当所述预设网页和/或待测网页中的实际内容为图片时,删除所述图片。The picture processing unit is configured to delete the picture when the actual content in the preset webpage and/or the webpage to be tested is a picture.
相应的,上述实施例中,网页匹配单元830可以包括:Correspondingly, in the foregoing embodiment, the webpage matching unit 830 may include:
颜色比较单元,用于比较所述预设网页和待测网页中与同一预设比较点相对应的区域的颜色是否相同;a color comparison unit, configured to compare whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;
第一计算单元,用于计算所述颜色比较结果为不相同的预设比较点的个数与预设比较点的总个数之间的第一比值;a first calculating unit, configured to calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;
第一判定单元,用于判断所述第一比值是否小于第一预设比值,并在所述第一比值大于第一预设比值时,判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配。a first determining unit, configured to determine whether the first ratio is smaller than a first preset ratio, and when the first ratio is greater than the first preset ratio, determining that the preset webpage is set in the webpage to be tested The area with the first identifier does not match, otherwise it is determined that the preset webpage matches the area of the webpage to be tested in which the first identifier is set.
另外,上述实施例提供的网页数据处理装置还可以包括:第一子确定单元,用于在所述结果确定单元判定所述待测网页存在过滤问题后,判断所述待测网页中,颜色比较结果为不同的预设比较点对应的第一区域的颜色,是否与所述预设颜色相同,并在所述第一区域的颜色与预设颜色相同时,判定所述第一区域存在过滤失效,否则判定所述第一区域存在误过滤。In addition, the webpage data processing apparatus provided in the foregoing embodiment may further include: a first sub-determining unit, configured to determine, after the result determining unit determines that the webpage to be tested has a filtering problem, the color comparison in the webpage to be tested The result is that the color of the first area corresponding to the different preset comparison points is the same as the preset color, and when the color of the first area is the same as the preset color, it is determined that the first area has filtering failure. Otherwise, it is determined that there is false filtering in the first area.
在本发明的另一个可行实施例中,网页标记单元820可以包括:In another possible embodiment of the present invention, the webpage marking unit 820 may include:
第二计算单元,用于计算所述预设网页和待测网页中多边形图框不重叠的部分的面积与所述预设网页中多边形图框的总面积之间的第二比值;
a second calculating unit, configured to calculate a second ratio between an area of a portion of the preset webpage and the webpage to be tested that does not overlap with a polygon frame and a total area of the polygon frame in the preset webpage;
第二判定单元,用于在所述第二比值不大于第二预设比值时,判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配。a second determining unit, configured to determine, when the second ratio is not greater than the second preset ratio, that the preset webpage does not match an area in the webpage to be tested that is provided with the first identifier, otherwise The preset webpage matches an area of the webpage to be tested in which the first identifier is disposed.
相应的,网页匹配单元830可以包括:Correspondingly, the webpage matching unit 830 can include:
第二计算单元,用于计算所述预设网页和待测网页中多边形图框不重叠的部分的面积与所述预设网页中多边形图框的总面积之间的第二比值;a second calculating unit, configured to calculate a second ratio between an area of a portion of the preset webpage and the webpage to be tested that does not overlap with a polygon frame and a total area of the polygon frame in the preset webpage;
第二判定单元,用于在所述第二比值不大于第二预设比值时,判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配。a second determining unit, configured to determine, when the second ratio is not greater than the second preset ratio, that the preset webpage does not match an area in the webpage to be tested that is provided with the first identifier, otherwise The preset webpage matches an area of the webpage to be tested in which the first identifier is disposed.
另外,上述实施例提供的网页数据处理装置还可以包括:第二子确定单元,用于在所述结果确定单元判定所述待测网页存在过滤问题后,执行如下判定:In addition, the webpage data processing apparatus provided in the foregoing embodiment may further include: a second sub-determining unit, configured to: after the result determining unit determines that the webpage to be tested has a filtering problem, perform the following determination:
如果所述预设网页中,与所述待测网页中设置有所述边框的第一区域相对应的区域未设置所述边框,则判定所述第一区域存在过滤失效;如果所述预设网页中,与所述待测网页中未设置所述边框的第二区域相对应的区域设置有所述边框时,则判定所述第二区域存在误过滤。If the preset webpage is not provided with an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure; if the preset In the webpage, when the border is set in an area corresponding to the second area where the border is not set in the webpage to be tested, it is determined that the second area has error filtering.
一般的,网页匹配单元830直接以整个网页为对象,判读是否匹配;而在本发明的另一个可行实施例中,所述网页数据处理装置还可以包括:区域分割单元,用于分别将所述预设网页和待测网页划分为一一对应的多个比较区域;相应的,网页匹配单元830包括:第一子匹配单元,用于分别判断所述预设网页与所述待测网页之间相对应的每对比较区域中设置有所述第一标识的区域是否相匹配。In general, the webpage matching unit 830 directly determines whether the matching is performed by using the entire webpage. In another possible embodiment of the present invention, the webpage data processing apparatus may further include: an area dividing unit, respectively, The preset webpage and the webpage to be tested are divided into a plurality of corresponding comparison areas; correspondingly, the webpage matching unit 830 includes: a first sub-matching unit, configured to respectively determine between the preset webpage and the webpage to be tested Whether the regions in which the first identifier is disposed in each pair of comparison regions corresponding to each other match.
上述实施例中,通过对待测网页和预设网页分区,并分别判断每个区域是否匹配,可以减小判断过程中数值计算等因素带来的误差,提高检测准确度。In the above embodiment, by determining the webpage to be tested and the preset webpage, and judging whether each area is matched, the error caused by the numerical calculation and other factors in the judging process can be reduced, and the detection accuracy is improved.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本发明时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, the above devices are described separately by function into various units. Of course, the functions of the various units may be implemented in one or more software and/or hardware in the practice of the invention.
另外,本发明还提供一种具有处理器可执行的程序代码的计算机可读介质,在被执行时,所述程序代码使得处理器执行下述步骤:Additionally, the present invention provides a computer readable medium having program code executable by a processor, which, when executed, causes the processor to perform the steps of:
获取待测网页,以及所述待测网页的网页地址对应的预设网页;Obtaining a webpage to be tested, and a preset webpage corresponding to the webpage address of the webpage to be tested;
分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识;
Setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested respectively;
判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配;Determining whether the preset webpage matches an area in the webpage to be tested in which the first identifier is set;
如果所述预设网页与待测网页中设置有所述第一标识的区域相匹配,则判定所述待测网页不存在过滤问题,否则判定所述待测网页存在过滤问题。If the preset webpage matches the area in which the first identifier is set in the webpage to be tested, it is determined that the webpage to be tested does not have a filtering problem, otherwise, it is determined that the webpage to be tested has a filtering problem.
在本发明的一个可行实施例中,判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:分别计算所述预设网页中设置有所述第一标识的区域的第一总面积,以及所述待测网页中设置有所述第一标识的区域的第二总面积;计算所述第一总面积和第二总面积之间的第三比值;判断所述第三比值是否在预设范围内;如果所述第三比值在预设范围内,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。In a possible embodiment of the present invention, determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: separately calculating that the preset webpage is set in the preset a first total area of the first identified area, and a second total area of the area in the web page to be tested in which the first identifier is disposed; calculating a third between the first total area and the second total area Determining whether the third ratio is within a preset range; if the third ratio is within a preset range, determining the preset webpage and the area of the webpage to be tested that is provided with the first identifier Matching, otherwise determining that the preset webpage does not match an area in the webpage to be tested in which the first identifier is set.
另外,在判定所述待测网页存在过滤问题后,还可以执行如下步骤:如果所述第三比值小于所述预设范围的最小值,则判定所述待测网页存在过滤失效;如果所述第三比值大于所述预设范围的最大值,则判定所述待测网页存在误过滤。In addition, after determining that the webpage to be tested has a filtering problem, the following step may be performed: if the third ratio is less than a minimum value of the preset range, determining that the webpage to be tested has a filtering failure; If the third ratio is greater than the maximum value of the preset range, it is determined that the webpage to be tested has error filtering.
在本发明的另一个可行实施例中,所述分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识,包括:分别将所述预设网页和待测网页中存在实际内容的区域的背景颜色设置为预设颜色;当所述实际内容为文字时,设置所述文字的颜色为所述预设颜色;当所述实际内容为图片时,删除所述图片。In another possible embodiment of the present invention, the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively presenting the preset webpage and the webpage to be tested The background color of the area of the actual content is set to a preset color; when the actual content is text, the color of the text is set as the preset color; when the actual content is a picture, the picture is deleted.
相应的,所述判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:比较所述预设网页和待测网页中与同一预设比较点相对应的区域的颜色是否相同;计算所述颜色比较结果为不相同的预设比较点的个数与预设比较点的总个数之间的第一比值;判断所述第一比值是否小于第一预设比值;如果所述第一比值小于第一预设比值,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。Correspondingly, the determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested includes: comparing the preset webpage and the webpage to be tested with the same preset comparison point Whether the color of the corresponding area is the same; calculating a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points; determining whether the first ratio is smaller than a first preset ratio; if the first ratio is smaller than the first preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the pre-determination The webpage is not matched with the area in which the first identifier is set in the webpage to be tested.
另外,在判定所述待测网页存在过滤问题后,还可以执行如下步骤:判断所述待测网页中,颜色比较结果为不同的预设比较点对应的第一区域的颜色,是否与所述预设颜色相同;如果所述第一区域的颜色与预设颜色相同,则判定所述第一区域存在过滤失效问题,否则判定所述第一区域存在误过滤问题。In addition, after determining that the webpage to be tested has a filtering problem, the following step may be performed: determining, in the webpage to be tested, that the color comparison result is the color of the first region corresponding to the different preset comparison point, and whether The preset color is the same; if the color of the first area is the same as the preset color, it is determined that the first area has a filtering failure problem, otherwise the first area is determined to have a false filtering problem.
在本发明的另一个可行实施例中,所述分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识,包括:分别在所述预设网页和待测网页中存在实际内容的区域设置边框;其中,所述边框与所述存在实际内容的区域的边界重合。
In another possible embodiment of the present invention, the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively presenting in the preset webpage and the webpage to be tested A locale border of the actual content; wherein the border coincides with a boundary of the area where the actual content exists.
相应的,所述判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:计算所述预设网页中设置有所述边框的区域和待测网页中设置有所述边框的区域不重叠的部分的面积,与所述预设网页中设置有所述边框的区域的总面积之间的第二比值;判断所述第二比值是否小于第二预设比值;如果所述第二比值小于第二预设比值,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。Correspondingly, determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: calculating an area in which the border is set in the preset webpage, and testing a second ratio between an area of a portion of the webpage where the area of the border does not overlap, and a total area of the area of the preset webpage where the border is disposed; determining whether the second ratio is smaller than the second a preset ratio; if the second ratio is smaller than the second preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the preset webpage The area in which the first identifier is set in the webpage to be tested does not match.
另外,在判定所述待测网页存在过滤问题后,还可以执行如下步骤:当所述预设网页中,与所述待测网页中设置有所述边框的第一区域相对应的区域未设置所述边框时,判定所述第一区域存在过滤失效;当所述预设网页中,与所述待测网页中未设置所述边框的第二区域相对应的区域设置有所述边框时,判定所述第二区域存在误过滤。In addition, after determining that the webpage to be tested has a filtering problem, the following step may be performed: in the preset webpage, an area corresponding to the first area in which the border is set in the webpage to be tested is not set. When the border is set, it is determined that the first area has a filter failure; when the preset webpage is set with the border corresponding to the second area of the webpage to be tested where the border is not disposed, It is determined that there is false filtering in the second region.
在本发明的另一个可行实施例中,在判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配之前,还可以执行步骤:分别将所述预设网页和待测网页划分为一一对应的多个比较区域。In another possible embodiment of the present invention, before determining whether the preset webpage and the area of the webpage to be tested are matched with the first identifier, the step of: respectively: performing the preset The webpage and the webpage to be tested are divided into a plurality of comparison areas corresponding one by one.
相应的,所述判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:分别判断所述预设网页与所述待测网页之间相对应的每对比较区域中设置有所述第一标识的区域是否相匹配。Correspondingly, determining whether the preset webpage is matched with the area in which the first identifier is set in the webpage to be tested includes: determining, respectively, that the preset webpage corresponds to the webpage to be tested Whether each of the pair of comparison areas in which the first identifier is set matches.
此外,参照图17,本发明另一实施例提供的网页数据处理装置,包括处理器101以及计算机可读介质102;其中,计算机可读介质102中存储有处理器101能够执行的程序代码,处理器101读取计算机可读介质102内的程序代码用于实现上述步骤或单元功能。In addition, referring to FIG. 17, a webpage data processing apparatus according to another embodiment of the present invention includes a processor 101 and a computer readable medium 102. The computer readable medium 102 stores program code that can be executed by the processor 101, and processes The program 101 reads program code within the computer readable medium 102 for implementing the steps or unit functions described above.
此外,应该明白的是,本文所述的计算机可读介质(例如,存储器)可以是易失性存储器或非易失性存储器,或者可以包括易失性存储器和非易失性存储器两者。作为例子而非限制性的,非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦写可编程ROM(EEPROM)或快闪存储器。易失性存储器可以包括随机存取存储器(RAM),该RAM可以充当外部高速缓存存储器。作为例子而非限制性的,RAM可以以多种形式获得,比如同步RAM(DRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据速率SDRAM(DDR SDRAM)、增强SDRAM(ESDRAM)、同步链路DRAM(SLDRAM)以及直接Rambus RAM(DRRAM)。所公开的方面的存储设备意在包括但不限于这些和其它合适类型的存储器。In addition, it should be understood that the computer readable medium (eg, memory) described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example and not limitation, non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash. Memory. Volatile memory can include random access memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM can be obtained in a variety of forms, such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM) and direct Rambus RAM (DRRAM). Storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
图18是根据本发明第一实施例的网页数据处理装置的示意图。如图18所示,该网页数据处理装置包括第一获取单元10、第一匹配单元20、第二匹配单元30和过滤单元40。
Figure 18 is a diagram showing a web page data processing apparatus according to a first embodiment of the present invention. As shown in FIG. 18, the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40.
第一获取单元10用于获取待测网页的统一资源定位符。The first obtaining unit 10 is configured to obtain a uniform resource locator of the webpage to be tested.
浏览器可以是个人计算机上的(personal computer,简称PC)浏览器,也可以是移动终端上的浏览器,用户可以在浏览器上输入统一资源定位符(Uniform Resource Locator,简称url)。获取该url以便于判断是否需要进行广告过滤。The browser can be a personal computer (PC) browser, or a browser on the mobile terminal. The user can input a Uniform Resource Locator (URL) on the browser. Get the url to determine if ad filtering is required.
第一匹配单元20用于利用广告过滤规则的关键字对统一资源定位符进行匹配。The first matching unit 20 is configured to match the uniform resource locators by using keywords of the advertisement filtering rule.
在获取到输入的统一资源定位符之后,可以利用广告过滤规则的关键字对统一资源定位符进行匹配。可以是先将url进行分段处理,例如将url传入分断器中,通过在分段器中设置预定规则,以对url进行分段处理,得到多个分段字符。再将多个分段字符传入关键字匹配器中,利用关键字匹配器中的预设关键字对多个分段字符进行匹配,逐个判断每个分段字符是否命中关键字匹配器中的关键字。其中,关键字可以与多个广告过滤规则对应,从而在关键字与url匹配时,可以只将该关键字对应的广告过滤规则与url进行匹配,无需对每一个广告过滤规则进行匹配。After obtaining the input uniform resource locator, the uniform resource locator can be matched by using the keyword of the advertisement filtering rule. The url may be segmented first, for example, by passing the url into the disconnector, and by setting a predetermined rule in the segmenter to segment the url to obtain a plurality of segmented characters. Then, multiple segment characters are passed into the keyword matcher, and multiple segment characters are matched by using the preset keywords in the keyword matcher, and each segment character is judged one by one to hit the keyword matcher. Keyword. The keyword can be matched with multiple advertisement filtering rules, so that when the keyword matches the url, only the advertisement filtering rule corresponding to the keyword can be matched with the url, and there is no need to match each advertisement filtering rule.
第二匹配单元30用于当统一资源定位符与关键字匹配时,将所述统一资源定位符与关键字对应的广告过滤规则进行匹配。The second matching unit 30 is configured to match the uniform resource locator with the advertisement filtering rule corresponding to the keyword when the uniform resource locator matches the keyword.
在统一资源定位符与关键字匹配时,将统一资源定位符与关键字对应的广告过滤规则进行匹配,无需将统一资源定位符与所有的广告过滤规则进行匹配。When the uniform resource locator matches the keyword, the uniform resource locator is matched with the keyword filtering rule corresponding to the keyword, and the uniform resource locator is not matched with all the advertisement filtering rules.
将统一资源定位符与关键字对应的广告过滤规则进行匹配,其中,统一资源定位符可以是与关键字匹配的统一资源定位符。具体地,可以是将匹配到关键字的url的分段字符传入到规则rule匹配器中,其中,rule匹配器中存有关键字与广告过滤规则的对应关系。在rule匹配器中将匹配到关键字的url的分段字符与广告过滤规则进行匹配可以是先将url的分段字符与白名单的广告过滤规则进行匹配,再将url的分段字符与黑名单的广告过滤规则进行匹配,其中,白名单表示不过滤与该规则匹配的资源的广告过滤规则的名单,黑名单表示过滤与该规则匹配的资源广告过滤规则的名单。如果匹配到白名单的广告过滤规则,可以请求分段字符对应的url对应的资源;如果匹配到黑名单的广告过滤规则,则无需请求分段字符对应的url对应的资源。如果均未匹配到,则可以按照相同的方式对下一个分段字符进行匹配。The uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, wherein the uniform resource locator may be a uniform resource locator matching the keyword. Specifically, the segment character matching the url of the keyword may be introduced into the rule rule matcher, where the rule matcher has a correspondence between the keyword and the advertisement filter rule. Matching the segment character of the url matching the keyword to the advertisement filtering rule in the rule matcher may first match the segment character of the url with the advertisement filter rule of the white list, and then segment the character of the url with black. The list of advertisement filtering rules is matched, wherein the whitelist indicates a list of advertisement filtering rules that do not filter resources matching the rule, and the blacklist indicates filtering a list of resource advertisement filtering rules that match the rule. If the advertisement filtering rule of the whitelist is matched, the resource corresponding to the url corresponding to the segmentation character may be requested; if the advertisement filtering rule of the blacklist is matched, the resource corresponding to the url corresponding to the segmentation character is not required. If none match, the next segment character can be matched in the same way.
在rule匹配器中对url进行匹配可以是先将匹配到的关键字对应的广告过滤规则转化为正则表达式,然后利用正则表达式的接口查询广告过滤规则,以便于判断url是否与广告过滤规则匹配。The matching of the url in the rule matcher may first convert the corresponding advertisement filtering rule of the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to determine whether the url is related to the advertisement filtering rule. match.
过滤单元40用于当统一资源定位符与关键字对应的广告过滤规则匹配时,利用广告过滤规则进行广告过滤。
The filtering unit 40 is configured to perform advertisement filtering by using an advertisement filtering rule when the uniform resource locator matches the advertisement filtering rule corresponding to the keyword.
在将统一资源定位符与关键字对应的广告过滤规则进行匹配之后,如果‘与关键字匹配的统一资源定位符’与‘关键字对应的广告过滤规则’匹配,则可以输出‘匹配到的广告过滤规则’,利用该广告过滤规则进行广告过滤。即,如果确定url请求的资源为广告,浏览器则无需请求该资源。After matching the uniform resource locator with the keyword filtering rule corresponding to the keyword, if the 'Uniform Resource Locator matching the keyword matches the 'Ad filter rule corresponding to the keyword', the matching advertisement may be output. Filtering rules', using this ad filtering rule for ad filtering. That is, if it is determined that the resource requested by the url is an advertisement, the browser does not need to request the resource.
根据本发明实施例,先利用广告过滤规则的关键字对url进行匹配,再将‘匹配到关键字的url’与‘关键字对应的广告过滤规则’进行匹配,避免将url与所有的广告过滤规则一一进行匹配,减少了匹配的广告过滤规则的数量,从而解决了由于过滤规则数量大导致每次广告过滤时间长问题,保证了广告地有效过滤,达到了减少广告过滤时间的效果。According to the embodiment of the present invention, the url is matched by the keyword of the advertisement filtering rule, and then the url matching the keyword is matched with the advertisement filtering rule corresponding to the keyword, so as to avoid filtering the url and all the advertisements. The rules are matched one by one, which reduces the number of matched advertisement filtering rules, thereby solving the problem that each advertisement filtering time is long due to the large number of filtering rules, ensuring effective filtering of the advertising space, and achieving the effect of reducing the advertising filtering time.
例如,假设广告过滤规则有20000条,现有的广告过滤需要将url与20000条广告过滤规则逐一进行匹配,如果匹配到某一条广告过滤规则,则进行广告过滤。在本发明实施例中,首先将url与‘广告过滤规则的关键字’进行匹配,假设匹配到的关键字A对应有100条广告过滤规则,则只需要将url与这100条广告过滤规则进行匹配,大大减少了匹配的时间。For example, if there are 20,000 advertisement filtering rules, the existing advertisement filtering needs to match the url with the 20,000 advertisement filtering rules one by one. If an advertisement filtering rule is matched, the advertisement filtering is performed. In the embodiment of the present invention, the url is first matched with the keyword of the advertisement filter rule. If the matched keyword A corresponds to 100 advertisement filtering rules, only the url and the 100 advertisement filtering rules need to be performed. Matching greatly reduces the time of matching.
在本发明实施例中,网页数据处理装置可以用于PC浏览器的广告过滤,也可以用在移动终端上的浏览器上,可以通过PC或者移动终端本身实现其功能,也可以是通过云端服务器(如中间件)实现其功能。当移动终端上能够支持的广告规律规则有限时,本发明实施例的网页数据处理装置能够产生更佳的效果。In the embodiment of the present invention, the webpage data processing apparatus may be used for advertisement filtering of a PC browser, or may be used on a browser on a mobile terminal, and may implement its function through a PC or a mobile terminal itself, or may be through a cloud server. (such as middleware) to achieve its function. The webpage data processing apparatus of the embodiment of the present invention can produce a better effect when the rules of advertisement rules that can be supported on the mobile terminal are limited.
优选地,网页数据处理装置包括传入单元和分段单元。传入单元用于在获取在浏览器中输入待测网页的统一资源定位符之后,将统一资源定位符传入分段器。分段单元用于在分段器中对统一资源定位符进行分段,得到多个分段字符。其中,第一匹配单元包括第二匹配模块,该第二匹配模块用于逐个将多个分段字符与关键字匹配器中的关键字进行匹配。分段器用于对统一资源定位符进行分段。Preferably, the web page data processing apparatus includes an incoming unit and a segmented unit. The incoming unit is configured to pass the uniform resource locator to the segmenter after obtaining the uniform resource locator of the web page to be tested in the browser. The segmentation unit is configured to segment the uniform resource locator in the segmenter to obtain a plurality of segmentation characters. The first matching unit includes a second matching module, and the second matching module is configured to match the plurality of segment characters to the keywords in the keyword matcher one by one. A segmenter is used to segment the uniform resource locator.
在分段器中可以按照预设的分段规则进行分段,预设的分段规则可以包括:首先,以“/”为分隔符将url进行分段,则分段后第一段为域名、其余段为各路径分段;然后,对于域名分段,进一步以“.”为分隔符划分为各层域名分段;最后,对于非域名分段,进一步按照特殊字符划分为分段,其中,特殊字符可以特殊字符包括‘.’、‘_’、‘-’、‘?’、‘:’、‘=’、‘;’、‘&’、‘+’等。通过按照预定规则对url进行分段,可以进一步保证对广告过滤的效果。In the segmenter, segmentation may be performed according to a preset segmentation rule. The preset segmentation rule may include: first, segmenting the url by using a “/” as a separator, and then segmenting the first segment into a domain name. The remaining segments are segmented for each path; then, for the domain name segmentation, the domain name segmentation is further divided by "." as a delimiter; finally, for non-domain name segmentation, further divided into segments according to special characters, wherein , special characters can be special characters including '.', '_', '-', '? ', ':', ‘=’, ‘;’, ‘&’, ‘+’, etc. By segmenting the url according to predetermined rules, the effect of filtering the advertisement can be further ensured.
图19是根据本发明第二实施例的网页数据处理装置的示意图。该实施例可以作为上述实施例的一种优选实施方式。如图19所示,该网页数据处理装置包括第一获取单元10、第一匹配单元20、第二匹配单元30和过滤单元40。其中,网页数据处理装置还包括第二获取单元50和建立单元60,第一匹配单元20包括获取模块201和第一判断模块202。
Figure 19 is a diagram showing a web page data processing apparatus in accordance with a second embodiment of the present invention. This embodiment can be taken as a preferred embodiment of the above embodiment. As shown in FIG. 19, the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40. The webpage data processing apparatus further includes a second obtaining unit 50 and an establishing unit 60. The first matching unit 20 includes an obtaining module 201 and a first judging module 202.
第二获取单元50用于在利用广告过滤规则的关键字对统一资源定位符进行匹配之前,获取与广告过滤规则对应的关键字。The second obtaining unit 50 is configured to acquire a keyword corresponding to the advertisement filtering rule before the uniform resource locator is matched by using the keyword of the advertisement filtering rule.
在利用广告过滤规则的关键字对统一资源定位符进行匹配之前,可以先对关键字匹配器中的关键字进行初始化,具体的初始化过程可以是:先获取与广告过滤规则对应的关键字。例如,从广告过滤规则的文件中提取关键字,从而使得在url匹配到关键字之后,可以查询到关键字对应的广告过滤规则。The keyword in the keyword matcher may be initialized before the keyword is matched by the keyword of the advertisement filter rule. The specific initialization process may be: first obtaining a keyword corresponding to the advertisement filter rule. For example, the keyword is extracted from the file of the advertisement filtering rule, so that after the url matches the keyword, the advertisement filtering rule corresponding to the keyword can be queried.
建立单元60用于建立与广告过滤规则对应的关键字的字典树。The establishing unit 60 is configured to establish a dictionary tree of keywords corresponding to the advertisement filtering rules.
字典树即Trie树是一种分布式概念的查询方法,其基本思路是将所有关键字的前缀信息记录在表中,因此在查询时能大量减少比较的次数。当关键字数量很多时,此种方法尤其适用。通过建立广告过滤规则对应的关键字的字典树,对关键字进行组织,利用trie树进一步优化广告过滤的消耗的时间。The dictionary tree, the Trie tree, is a distributed concept query method. The basic idea is to record the prefix information of all keywords in the table, so the number of comparisons can be greatly reduced when querying. This method is especially useful when the number of keywords is large. The keywords are organized by establishing a dictionary tree of keywords corresponding to the advertisement filtering rules, and the trie tree is used to further optimize the time of consumption of the advertisement filtering.
为了通过trie树实现最快的查找效果,对关键字可以使用顺序存储的方式,提高查找的速度,在trie树的节点内的包含有空链接(空指针),这些空链接表示了trie树当前位置没有关键字,以便于实现最快速度的查找。In order to achieve the fastest lookup effect through the trie tree, the keyword can be stored in a sequential manner to improve the speed of the search. The nodes in the trie tree contain empty links (null pointers), which represent the current trie tree. There are no keywords in the location to facilitate the fastest lookup.
获取模块201用于获取字典树中的关键字。The obtaining module 201 is configured to acquire keywords in the dictionary tree.
在建立字典树之后,利用关键字对url进行匹配可以是先获取字典树中的关键字,以便于将url与字典树中的关键字进行匹配。After the dictionary tree is built, matching the url with the keyword may first obtain the keywords in the dictionary tree to match the url with the keywords in the dictionary tree.
第一判断模块202用于判断统一资源定位符与字典树中的关键字是否匹配。The first determining module 202 is configured to determine whether the uniform resource locator matches a keyword in the dictionary tree.
判断统一资源定位符与字典树中的关键字是否匹配,即利用关键字的字典树对url进行匹配。当将url的分段字符传入到关键字匹配器中,广告过滤规则的关键字匹配器根据url分段器传入的分段字符,在trie树中查找该分段字符是否与某个关键字匹配,其中,匹配包括完全匹配和部分匹配。完全匹配是指分段字符与某个关键字完全相同,部分匹配是指某个关键字是分段字符的前缀。例如,在查找trie树中的关键字时,关键字中有as,则当分段字符为as或者ask时,可以返回查询匹配成功。当在trie树中查询到对应的广告过滤规则关键字时,可以通过关键字找到对应的广告过滤规则,利用查找到的广告过滤规则进行广告过滤。Determine whether the uniform resource locator matches the keyword in the dictionary tree, that is, use the keyword dictionary tree to match the url. When the segment character of the url is passed to the keyword matcher, the keyword matcher of the advertisement filtering rule searches the trie tree for the segment character according to the segment character passed in the url segmenter. Word matching, where the match includes an exact match and a partial match. An exact match means that the segmented character is exactly the same as a keyword, and a partial match is when a keyword is a prefix of a segmented character. For example, when searching for a keyword in a trie tree, if there is an in the keyword, when the segment character is as or ask, the query matching success can be returned. When the corresponding advertisement filter rule keyword is queried in the trie tree, the corresponding advertisement filter rule can be found by using the keyword, and the searched advertisement filter rule is used for the advertisement filter.
根据本发明实施例,通过利用关键字的字典树进行url与关键字的匹配,减少了url在匹配关键字时的时间消耗,从而进一步减少了广告过滤时间。
According to the embodiment of the present invention, by using the dictionary tree of the keyword to match the url and the keyword, the time consumption of the url in matching the keyword is reduced, thereby further reducing the advertisement filtering time.
优选地,第二获取单元50包括读取模块和提取模块。读取模块用于读取广告过滤规则的文件。提取模块用于从广告过滤规则的文件中提取关键字。建立单元60包括第一建立模块和第二建立模块。第一建立模块用于建立关键字与广告过滤规则的对应关系。第二建立模块用于根据提取的关键字建立字典树。Preferably, the second acquisition unit 50 includes a reading module and an extraction module. The read module is used to read the files of the ad filter rules. The extraction module is used to extract keywords from the files of the advertisement filtering rules. The establishing unit 60 includes a first establishing module and a second establishing module. The first establishing module is used to establish a correspondence between keywords and advertisement filtering rules. The second building module is configured to build a dictionary tree based on the extracted keywords.
具体地,可以先在PC或者移动终端或者云端服务器中将广告过滤规则的文件从磁盘读入内存中。然后从广告过滤规则的文件中提取关键字,并建立关键字同广告过滤规则的对应关系。其中,从广告过滤规则的文件中提取关键字的规则可以包括:Specifically, the file of the advertisement filtering rule may be read into the memory from the disk in the PC or the mobile terminal or the cloud server. Then extract the keywords from the file of the ad filter rule and establish the corresponding relationship between the keyword and the ad filter rule. The rules for extracting keywords from the files of the advertisement filtering rule may include:
1)不包括adblock规则占用的字符,如‘@’、‘|’、‘*’等。1) Does not include characters occupied by adblock rules, such as ‘@’, ‘|’, ‘*’, etc.
2)不包括广告过滤规则中option的部分(option是adblock定义的规则中一部分,用于指明对某些域名或者类型的资源应用/不应用该规则)。2) Does not include the part of the option in the advertisement filtering rule (option is part of the rule defined by adblock, which is used to indicate that the rule is applied/not applied to certain domain names or types of resources).
3)限定关键字可能包含的字符有‘0至9的数字’,‘a至z的26个英文字母’,‘.’、‘_’、‘-’、‘?’、‘:’、‘=’、‘;’、‘&’、‘+’等。3) Qualified keywords may contain characters of ‘0 to 9 digits, ‘a to z 26 English letters’, ‘.’, ‘_’, ‘-’, ‘? ', ':', ‘=’, ‘;’, ‘&’, ‘+’, etc.
4)当从广告过滤规则中选取关键字时,广告过滤规则要么以域名开始,要么以特殊字符开始,特殊字符包括‘.’、‘_’、‘-’、‘?’、‘;’、‘=’、‘:’、‘/’、‘&’、‘+’等。4) When selecting keywords from the ad filter rules, the ad filter rules either start with a domain name or start with a special character, including special characters ‘.’, ‘_’, ‘-’, ‘? ', ‘;’, ‘=’, ‘:’, ‘/’, ‘&’, ‘+’, etc.
5)关键字的字符长度大于或者等于3,小于32。5) The character length of the keyword is greater than or equal to 3 and less than 32.
6)仅仅以http,https,.html,.jpg等url中频繁出现的字符串不能为关键字。6) Strings that appear frequently in urls such as http, https, .html, .jpg, etc. cannot be keywords.
7)正则表达式的规则不提取关键字。7) The rules of regular expressions do not extract keywords.
具体地,Key(关键字)提取流程包括:遍历广告过滤规则文件中的字符串,直到找到第一个在上述提取规则集合中的某个字符,记为关键字的起始位置,继续遍历直到该字符串结束,或者下一个上述提取规则中的字符,记为结束位置。Specifically, the Key (keyword) extraction process includes: traversing the character string in the advertisement filtering rule file until a first character in the above-mentioned extraction rule set is found, and is recorded as the starting position of the keyword, and continues to traverse until The end of the string, or the character in the next extraction rule above, is recorded as the end position.
将起始位置和结束位置之间的字符作为备选关键字。检查备选关键字是否满足关键字上述提取条件4)、5)、6),如果满足则返回该关键字,以作为最终的关键字。The character between the start position and the end position is used as an alternative keyword. It is checked whether the candidate keyword satisfies the above-mentioned extraction conditions 4), 5), and 6), and if so, returns the keyword as the final keyword.
在返回关键字之后,可以检查广告过滤规则文件中的字符串是否结束,如果结束,返回没有合适的关键字,否则,继续提取关键字。After returning the keyword, you can check whether the string in the ad filter rule file ends. If it ends, it returns no suitable keyword. Otherwise, continue to extract the keyword.
当从某条广告过滤规则中不能提取到合适的关键字时,将该广告过滤规则加入到global队列,其中global队列中的广告过滤规则表示没有关联到对应的关键字的广告过滤规则。对
global队列中的广告过滤规则,每个url都需要进行匹配。通过对adblock中实际广告过滤规则的检查,在广告过滤规则中无法提取符合要求的关键字的情况极少,目前检查11285条规则中不能提取关键字的不超过20条。When an appropriate keyword cannot be extracted from an advertisement filtering rule, the advertisement filtering rule is added to the global queue, and the advertisement filtering rule in the global queue indicates that there is no advertisement filtering rule associated with the corresponding keyword. Correct
The ad filtering rules in the global queue, each url needs to be matched. By checking the actual advertisement filtering rules in adblock, it is rare to extract the keywords that meet the requirements in the advertisement filtering rules. Currently, no more than 20 keywords cannot be extracted in the 11285 rules.
图20是根据本发明第三实施例的网页数据处理装置的示意图。该实施例可以作为上述实施例的一种优选实施方式。如图20所示,该网页数据处理装置包括第一获取单元10、第一匹配单元20、第二匹配单元30和过滤单元40。其中,第一匹配单元20包括第二判断模块203,第二匹配单元30包括第一匹配模块301。Figure 20 is a diagram showing a web page data processing apparatus in accordance with a third embodiment of the present invention. This embodiment can be taken as a preferred embodiment of the above embodiment. As shown in FIG. 20, the webpage data processing apparatus includes a first obtaining unit 10, a first matching unit 20, a second matching unit 30, and a filtering unit 40. The first matching unit 20 includes a second determining module 203, and the second matching unit 30 includes a first matching module 301.
第二判断模块203用于判断统一资源定位符与广告过滤规则的关键字是否匹配,其中,如果判断出统一资源定位符与广告过滤规则的关键字匹配,则将关键字对应的广告过滤规则转换为正则表达式。The second judging module 203 is configured to determine whether the uniform resource locator matches the keyword of the advertisement filtering rule, and if it is determined that the uniform resource locator matches the keyword of the advertisement filtering rule, the advertisement filtering rule corresponding to the keyword is converted. Is a regular expression.
第一匹配模块301用于将与统一资源定位符与正则表达式进行匹配。The first matching module 301 is configured to match the uniform resource locator with the regular expression.
过滤单元40还用于当关键字匹配的统一资源定位符与正则表达式匹配时,输出‘与正则表达式对应的广告过滤规则’,通过输出的‘与正则表达式对应的广告过滤规则’进行广告过滤。The filtering unit 40 is further configured to: when the uniform resource locator matched by the keyword matches the regular expression, output an advertisement filter rule corresponding to the regular expression, and output the 'advertising filter rule corresponding to the regular expression' Ad filtering.
本发明实施例中,在rule匹配器中对url进行匹配可以是先将匹配到的关键字对应的广告过滤规则转化为正则表达式,然后利用正则表达式的接口查询广告过滤规则,以便于判断url是否与广告过滤规则匹配。优选地,本发明实施例仅在判断出url匹配到关键字时,将该关键字对应的广告过滤规则转换为正则表达式,无需在启动广告过滤时,将所有广告过滤规则转化为正则表达式。In the embodiment of the present invention, the matching of the url in the rule matcher may first convert the advertisement filtering rule corresponding to the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to facilitate judgment. Whether the url matches the ad filter rules. Preferably, the embodiment of the present invention converts the advertisement filtering rule corresponding to the keyword into a regular expression only when it is determined that the url matches the keyword, and does not need to convert all the advertisement filtering rules into regular expressions when starting the advertisement filtering. .
由于在启动时,需要消耗一定的时间,例如在移动终端浏览器中需要消耗的约1.5s的时间,本发明实施例中,只需要将关键字对应的广告过滤规则转换为正则表达式,免去了启动时需要消耗的约1.5s的时间。而由于平均每个关键字对应的广告过滤规则的数目很小,通常不超过2条,最多不超过10条,因此需要转化解析的时间很短。假如1w条广告过滤规则的解析时间为1.5s,平均每条的解析时间为0.15ms,因此对匹配时间最多增加1.5ms。同时本发明实施例也可以在第一次命中该广告过滤规则后将该广告过滤规则的解析结果缓存起来,这样后续就不会有解析开销了,从而进一步减少时间的消耗。In the embodiment of the present invention, only the advertisement filtering rule corresponding to the keyword needs to be converted into a regular expression, since it is necessary to consume a certain time, for example, in the mobile terminal browser. It took about 1.5 seconds to get started. Since the average number of advertisement filtering rules corresponding to each keyword is small, usually no more than 2 and no more than 10, the conversion analysis time is short. If the resolution time of the 1w advertisement filtering rule is 1.5s, the average parsing time per strip is 0.15ms, so the matching time is increased by at most 1.5ms. At the same time, the embodiment of the present invention may also cache the parsing result of the advertisement filtering rule after hitting the advertisement filtering rule for the first time, so that there is no parsing overhead subsequently, thereby further reducing the time consumption.
本发明实施例还提供了一种网页数据处理方法。需要说明的是,本发明实施例的网页数据处理方法可以通过本发明实施例所提供的网页数据处理装置来执行,本发明实施例的网页数据处理装置也可以用于执行本发明实施例所提供的网页数据处理方法。
The embodiment of the invention further provides a webpage data processing method. It should be noted that the webpage data processing method of the embodiment of the present invention may be performed by the webpage data processing apparatus provided by the embodiment of the present invention, and the webpage data processing apparatus of the embodiment of the present invention may also be used to perform the embodiment provided by the present invention. Web page data processing method.
图21是根据本发明第一实施例的网页数据处理方法的流程图。如图21所示,该浏览器网页数据处理方法包括步骤如下:21 is a flow chart of a web page data processing method according to a first embodiment of the present invention. As shown in FIG. 21, the browser webpage data processing method includes the following steps:
步骤S402,获取在浏览器中输入的统一资源定位符。Step S402, obtaining a uniform resource locator input in the browser.
浏览器可以是个人计算机(personal computer,简称PC)上的浏览器,也可以是移动终端上的浏览器,用户可以在浏览器上输入待测网页的统一资源定位符(Uniform Resource Locator,简称url)。获取该url以便于判断是否需要进行广告过滤。The browser can be a browser on a personal computer (PC) or a browser on the mobile terminal. The user can input the Uniform Resource Locator (URL) of the web page to be tested on the browser. ). Get the url to determine if ad filtering is required.
步骤S404,利用广告过滤规则的关键字对统一资源定位符进行匹配。Step S404, matching the uniform resource locator by using the keyword of the advertisement filtering rule.
在获取到输入的统一资源定位符之后,可以利用广告过滤规则的关键字对统一资源定位符进行匹配。可以是先将url进行分段处理,例如将url传入分断器中,通过在分段器中设置预定规则,以对url进行分段处理,得到多个分段字符。再将多个分段字符传入关键字匹配器中,利用关键字匹配器中的预设关键字对多个分段字符进行匹配,逐个判断每个分段字符是否命中关键字匹配器中的关键字。其中,预设的关键字可以与多个广告过滤规则对应,从而在关键字与url匹配时,可以只将该关键字对应的广告过滤规则与url进行匹配,无需对每一个广告过滤规则进行匹配。After obtaining the input uniform resource locator, the uniform resource locator can be matched by using the keyword of the advertisement filtering rule. The url may be segmented first, for example, by passing the url into the disconnector, and by setting a predetermined rule in the segmenter to segment the url to obtain a plurality of segmented characters. Then, multiple segment characters are passed into the keyword matcher, and multiple segment characters are matched by using the preset keywords in the keyword matcher, and each segment character is judged one by one to hit the keyword matcher. Keyword. The preset keyword may correspond to multiple advertisement filtering rules, so that when the keyword matches the url, only the advertisement filtering rule corresponding to the keyword may be matched with the url, and no matching of each advertisement filtering rule is required. .
步骤S406,如果统一资源定位符与关键字匹配,则将统一资源定位符与关键字对应的广告过滤规则进行匹配。Step S406: If the uniform resource locator matches the keyword, the uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword.
在统一资源定位符与关键字匹配时,可以获取与关键字对应的广告过滤规则,以便于将统一资源定位符与关键字对应的广告过滤规则,无需将统一资源定位符与所有的广告过滤规则进行匹配。When the uniform resource locator matches the keyword, the advertisement filtering rule corresponding to the keyword can be obtained, so that the uniform resource locator and the keyword filtering rule corresponding to the keyword can be obtained, and the uniform resource locator and all the advertisement filtering rules need not be used. Make a match.
将统一资源定位符与关键字对应的广告过滤规则进行匹配,其中,统一资源定位符可以是‘与关键字匹配的统一资源定位符’。具体地,可以是将匹配到关键字的url的分段字符传入到规则rule匹配器中,其中,rule匹配器中存有关键字与广告过滤规则的对应关系。在rule匹配器中将匹配到关键字的url的分段字符与广告过滤规则进行匹配可以是先将url的分段字符与白名单的广告过滤规则进行匹配,再将url的分段字符与黑名单的广告过滤规则进行匹配,其中,白名单表示不过滤该规则匹配的资源的广告过滤规则的名单,黑名单表示过滤该规则匹配的资源广告过滤规则的名单。如果匹配到白名单的广告过滤规则,可以请求分段字符对应的url对应的资源;如果匹配到黑名单的广告过滤规则,则无需请求分段字符对应的url对应的资源。如果均未匹配到,则可以按照相同的方式对下一个分段字符进行匹配。The uniform resource locator is matched with the advertisement filtering rule corresponding to the keyword, wherein the uniform resource locator may be a 'uniform resource locator matching the keyword'. Specifically, the segment character matching the url of the keyword may be introduced into the rule rule matcher, where the rule matcher has a correspondence between the keyword and the advertisement filter rule. Matching the segment character of the url matching the keyword to the advertisement filtering rule in the rule matcher may first match the segment character of the url with the advertisement filter rule of the white list, and then segment the character of the url with black. The list of advertisement filtering rules is matched, wherein the whitelist indicates a list of advertisement filtering rules that do not filter the resources matched by the rule, and the blacklist indicates that the list of resource advertisement filtering rules matched by the rule is filtered. If the advertisement filtering rule of the whitelist is matched, the resource corresponding to the url corresponding to the segmentation character may be requested; if the advertisement filtering rule of the blacklist is matched, the resource corresponding to the url corresponding to the segmentation character is not required. If none match, the next segment character can be matched in the same way.
在rule匹配器中对url进行匹配可以是先将匹配到的关键字对应的广告过滤规则转化为
正则表达式,然后利用正则表达式的接口查询广告过滤规则,以便于判断url是否与广告过滤规则匹配。Matching the url in the rule matcher may first convert the ad filter rule corresponding to the matched keyword into
Regular expressions, then use the interface of the regular expression to query the ad filter rules to determine if the url matches the ad filter rules.
步骤S408,如果统一资源定位符与关键字对应的广告过滤规则匹配,则利用广告过滤规则进行广告过滤。Step S408: If the uniform resource locator matches the advertisement filtering rule corresponding to the keyword, the advertisement filtering rule is used to perform advertisement filtering.
在将统一资源定位符与关键字对应的广告过滤规则进行匹配之后,如果关键字匹配的统一资源定位符与关键字对应的广告过滤规则匹配,则可以输出匹配到的广告过滤规则,利用该广告过滤规则进行广告过滤。即,如果确定url请求的资源为广告,浏览器则无需请求该资源。After the uniform resource locator is matched with the keyword filtering rule corresponding to the keyword, if the keyword matching uniform resource locator matches the keyword filtering rule corresponding to the keyword, the matched advertisement filtering rule may be output, and the advertisement is utilized. Filter rules for ad filtering. That is, if it is determined that the resource requested by the url is an advertisement, the browser does not need to request the resource.
根据本发明实施例,先利用广告过滤规则的关键字对url进行匹配,再将匹配到关键字的url与关键字对应的广告过滤规则进行匹配,避免将url与所有的广告过滤规则一一进行匹配,减少了匹配的广告过滤规则的数量,从而解决了由于过滤规则数量大导致每次广告过滤时间长问题,保证了广告地有效过滤,达到了减少广告过滤时间的效果。According to the embodiment of the present invention, the url is matched by the keyword of the advertisement filtering rule, and the url matching the keyword is matched with the advertisement filtering rule corresponding to the keyword, so as to avoid the url and all the advertisement filtering rules are performed one by one. The matching reduces the number of matching advertisement filtering rules, thereby solving the problem that each advertisement filtering time is long due to the large number of filtering rules, ensuring effective filtering of the advertising space, and achieving the effect of reducing the advertising filtering time.
例如,假设广告过滤规则有20000条,现有的广告过滤需要将url与20000条广告过滤规则逐一进行匹配,如果匹配到某一条广告过滤规则,则进行广告过滤。在本发明实施例中,首先将url与广告过滤规则的关键字进行匹配,假设匹配到的关键字A对应有100条广告过滤规则,则只需要将url与这100条广告过滤规则进行匹配,大大减少了匹配的时间。For example, if there are 20,000 advertisement filtering rules, the existing advertisement filtering needs to match the url with the 20,000 advertisement filtering rules one by one. If an advertisement filtering rule is matched, the advertisement filtering is performed. In the embodiment of the present invention, the url is first matched with the keyword of the advertisement filtering rule, and if the matched keyword A corresponds to 100 advertisement filtering rules, only the url needs to be matched with the 100 advertisement filtering rules. Significantly reduces the time of matching.
在本发明实施例中,网页数据处理方法可以用于PC浏览器的广告过滤,也可以用在移动终端上的浏览器上,可以通过PC或者移动终端本身实现其功能,也可以是通过云端服务器(如中间件)实现其功能。当移动终端上能够支持的广告规律规则有限时,本发明实施例的网页数据处理方法能够产生更佳的效果。In the embodiment of the present invention, the webpage data processing method can be used for advertisement filtering of a PC browser, or can be used on a browser on a mobile terminal, and can be implemented by a PC or a mobile terminal itself, or can be implemented by a cloud server. (such as middleware) to achieve its function. The webpage data processing method of the embodiment of the present invention can produce a better effect when the advertisement rule that can be supported on the mobile terminal is limited.
优选地,在获取在浏览器中输入的统一资源定位符之后,浏览器网页数据处理方法包括:将统一资源定位符传入分段器;以及在分段器中对统一资源定位符进行分段,得到多个分段字符,其中,利用广告过滤规则的关键字对统一资源定位符进行匹配包括:逐个将多个分段字符与关键字匹配器中的关键字进行匹配。分段器用于对统一资源定位符进行分段。Preferably, after obtaining the uniform resource locator input in the browser, the browser webpage data processing method comprises: transmitting the uniform resource locator to the segmenter; and segmenting the uniform resource locator in the segmenter And obtaining a plurality of segment characters, wherein the matching the uniform resource locators by using the keywords of the advertisement filtering rule comprises: matching the plurality of segment characters one by one with the keywords in the keyword matcher. A segmenter is used to segment the uniform resource locator.
在分段器中可以按照预设的分段规则进行分段,预设的分段规则可以包括:首先,以“/”为分隔符将url进行分段,则分段后第一段为域名、其余段为各路径分段;然后,对于域名分段,进一步以“.”为分隔符划分为各层域名分段;最后,对于非域名分段,进一步按照特殊字符划分为分段,其中,特殊字符可以特殊字符包括“.”、“_”、“-”、“?”、“;”、“=”、“:”、“/”、“&”、“+”等。通过按照预定规则对url进行分段,可以进一步保证对广告过滤的效果。
In the segmenter, segmentation may be performed according to a preset segmentation rule. The preset segmentation rule may include: first, segmenting the url by using a “/” as a separator, and then segmenting the first segment into a domain name. The remaining segments are segmented for each path; then, for the domain name segmentation, the domain name segmentation is further divided by "." as a delimiter; finally, for non-domain name segmentation, further divided into segments according to special characters, wherein Special characters can include ".", "_", "-", "?", ";", "=", ":", "/", "&", "+", etc. By segmenting the url according to predetermined rules, the effect of filtering the advertisement can be further ensured.
图22是根据本发明第二实施例的网页数据处理方法的流程图。该实施例的浏览器网页数据处理方法可以是上述实施例的浏览器网页数据处理方法的一种优选实施方式。如图22所示,该浏览器网页数据处理方法包括步骤如下:Figure 22 is a flowchart of a web page data processing method in accordance with a second embodiment of the present invention. The browser webpage data processing method of this embodiment may be a preferred embodiment of the browser webpage data processing method of the above embodiment. As shown in FIG. 22, the browser webpage data processing method includes the following steps:
步骤S502与图21所示的步骤S402相同,这里不做赘述。Step S502 is the same as step S402 shown in FIG. 21, and details are not described herein.
步骤S504,获取与广告过滤规则对应的关键字。Step S504, acquiring a keyword corresponding to the advertisement filtering rule.
在利用预设的关键字对统一资源定位符进行匹配之前,可以先对关键字匹配器中的关键字进行初始化,具体的初始化过程可以是,先获取与广告过滤规则对应的关键字。例如,从广告过滤规则的文件中提取关键字,从而使得在url匹配到关键字之后,可以查询到关键字对应的广告过滤规则。Before the uniform resource locator is matched by using the preset keyword, the keyword in the keyword matcher may be initialized first. The specific initialization process may be: first obtaining a keyword corresponding to the advertisement filtering rule. For example, the keyword is extracted from the file of the advertisement filtering rule, so that after the url matches the keyword, the advertisement filtering rule corresponding to the keyword can be queried.
步骤S506,建立与广告过滤规则对应的关键字的字典树。Step S506, a dictionary tree of keywords corresponding to the advertisement filtering rule is established.
字典树即Trie树是一种分布式概念的查询方法,其基本思路是将所有关键字的前缀信息记录在表中,因此在查询时能大量减少比较的次数。当关键字数量很多时,此种方法尤其适用。通过建立广告过滤规则对应的关键字的字典树,对关键字进行组织,利用trie树进一步优化广告过滤的消耗的时间。The dictionary tree, the Trie tree, is a distributed concept query method. The basic idea is to record the prefix information of all keywords in the table, so the number of comparisons can be greatly reduced when querying. This method is especially useful when the number of keywords is large. The keywords are organized by establishing a dictionary tree of keywords corresponding to the advertisement filtering rules, and the trie tree is used to further optimize the time of consumption of the advertisement filtering.
为了通过trie树实现最快的查找效果,对关键字可以使用了顺序存储的方式,提高查找的速度,在trie树的节点内的包含有空链接(空指针),这些空链接表示了trie树当前位置没有关键字,以便于实现最快速度的查找。In order to achieve the fastest lookup effect through the trie tree, the keyword can be stored in a sequential manner to improve the speed of the search. The nodes in the trie tree contain empty links (null pointers), which represent the trie tree. There are no keywords in the current location to facilitate the fastest lookup.
步骤S508,获取字典树中的关键字。Step S508, acquiring keywords in the dictionary tree.
在建立字典树之后,利用关键字对url进行匹配可以是先获取字典树中的关键字,以便于将url与字典树中的关键字进行匹配。After the dictionary tree is built, matching the url with the keyword may first obtain the keywords in the dictionary tree to match the url with the keywords in the dictionary tree.
步骤S510,判断统一资源定位符与字典树中的关键字是否匹配。Step S510, determining whether the uniform resource locator matches the keyword in the dictionary tree.
判断统一资源定位符与字典树中的关键字是否匹配,即利用关键字的字典树对url进行匹配。当将url的分段字符传入到关键字匹配器中,广告过滤规则的关键字匹配器的根据url分段器传入的分段字符,在trie树中查找该分段字符是否与某个关键字匹配,其中,匹配包括完全匹配和部分匹配。完全匹配是指分段字符与某个关键字完全相同,部分匹配是指某个关键字是分段字符的前缀。例如,在查找trie树中的关键字时,关键字中有as,则当分段字符为as或者ask时,可以返回查询匹配成功。当在trie树中查询到对应的广告过滤规则关键字时,可以通过关键字找到对应的广告过滤规则,利用查找到的广告过滤规则进行广告过滤。
Determine whether the uniform resource locator matches the keyword in the dictionary tree, that is, use the keyword dictionary tree to match the url. When the segment character of the url is passed to the keyword matcher, the keyword matcher of the advertisement filter rule searches for the segment character passed in according to the url segmenter, and finds whether the segment character is associated with the segment character in the trie tree. Keyword matching, where the match includes an exact match and a partial match. An exact match means that the segmented character is exactly the same as a keyword, and a partial match is when a keyword is a prefix of a segmented character. For example, when searching for a keyword in a trie tree, if there is an in the keyword, when the segment character is as or ask, the query matching success can be returned. When the corresponding advertisement filter rule keyword is queried in the trie tree, the corresponding advertisement filter rule can be found by using the keyword, and the searched advertisement filter rule is used for the advertisement filter.
步骤S512、步骤S514分别于图21所示的步骤S406、步骤S408相同,这里不做赘述。Steps S512 and S514 are the same in steps S406 and S408 shown in FIG. 21, and are not described herein.
根据本发明实施例,通过利用关键字的字典树进行url与关键字的匹配,减少了url在匹配关键字时的时间消耗,从而进一步减少了广告过滤时间。According to the embodiment of the present invention, by using the dictionary tree of the keyword to match the url and the keyword, the time consumption of the url in matching the keyword is reduced, thereby further reducing the advertisement filtering time.
优选地,获取与所述广告过滤规则对应的关键字包括:读取所述广告过滤规则的文件;从所述广告过滤规则的文件中提取所述关键字。建立所述广告过滤规则对应的关键字的字典树包括:建立所述关键字与所述广告过滤规则的对应关系;以及根据提取的关键字建立所述字典树。Preferably, acquiring a keyword corresponding to the advertisement filtering rule comprises: reading a file of the advertisement filtering rule; and extracting the keyword from a file of the advertisement filtering rule. Establishing a dictionary tree of keywords corresponding to the advertisement filtering rule includes: establishing a correspondence between the keyword and the advertisement filtering rule; and establishing the dictionary tree according to the extracted keyword.
具体地,可以先在PC或者移动终端或者云端服务器中将广告过滤规则的文件从磁盘读入内存中。然后从广告过滤规则的文件中提取关键字,并创建关键字同广告过滤规则的对应关系。其中,从广告过滤规则的文件中提取关键字的规则可以包括:Specifically, the file of the advertisement filtering rule may be read into the memory from the disk in the PC or the mobile terminal or the cloud server. Then extract the keywords from the ad filter rules file and create a correspondence between the keywords and the ad filter rules. The rules for extracting keywords from the files of the advertisement filtering rule may include:
1)不包括adblock规则占用的字符,如“@”、“|”、“*”等。1) Does not include characters occupied by adblock rules, such as "@", "|", "*", etc.
2)不包括广告过滤规则中option的部分。2) Does not include the part of the option in the ad filter rule.
3)限定关键字可能包含的字符有‘0至9的数字’,‘a至z的26个英文字母’,‘.’、‘_’、‘-’、‘?’、‘:’、‘=’、‘;’、‘&’、‘+’等。3) Qualified keywords may contain characters of ‘0 to 9 digits, ‘a to z 26 English letters’, ‘.’, ‘_’, ‘-’, ‘? ', ':', ‘=’, ‘;’, ‘&’, ‘+’, etc.
4)当从广告过滤规则中选取关键字时,广告过滤规则要么以域名开始,要么以特殊字符开始,特殊字符包括‘.’、‘_’、‘-’、‘?’、‘;’、‘=’、‘:’、‘/’、‘&’、‘+’等。4) When selecting keywords from the ad filter rules, the ad filter rules either start with a domain name or start with a special character, including special characters ‘.’, ‘_’, ‘-’, ‘? ', ‘;’, ‘=’, ‘:’, ‘/’, ‘&’, ‘+’, etc.
5)关键字的字符长度大于或者等于3,小于32。5) The character length of the keyword is greater than or equal to 3 and less than 32.
6)仅仅以http,https,.html,.jpg等url中频繁出现的字符串不能为关键字。6) Strings that appear frequently in urls such as http, https, .html, .jpg, etc. cannot be keywords.
7)正则表达式的规则不提取关键字。7) The rules of regular expressions do not extract keywords.
具体地,Key(关键字)提取流程包括:遍历广告过滤规则文件中的字符串,直到找到第一个在上述提取规则集合中的某个字符,记为关键字的起始位置,继续遍历直到该字符串结束,或者下一个上述提取规则中的字符,记为结束位置。Specifically, the Key (keyword) extraction process includes: traversing the character string in the advertisement filtering rule file until a first character in the above-mentioned extraction rule set is found, and is recorded as the starting position of the keyword, and continues to traverse until The end of the string, or the character in the next extraction rule above, is recorded as the end position.
将起始位置和结束位置之间的字符作为备选关键字。检查备选关键字是否满足关键字上述提取条件4)、5)、6),如果满足则返回该关键字,以作为最终的关键字。The character between the start position and the end position is used as an alternative keyword. It is checked whether the candidate keyword satisfies the above-mentioned extraction conditions 4), 5), and 6), and if so, returns the keyword as the final keyword.
在返回关键字之后,可以检查广告过滤规则文件中的字符串是否结束,如果结束,返回没有合适的关键字,否则,继续提取关键字。
After returning the keyword, you can check whether the string in the ad filter rule file ends. If it ends, it returns no suitable keyword. Otherwise, continue to extract the keyword.
当从某条广告过滤规则中不能提取到合适的关键字时,将该广告过滤规则加入到global队列,其中global队列中的广告过滤规则表示没有关联到对应的关键字的广告过滤规则。对global队列中的广告过滤规则,每个url都需要进行匹配。通过对adblock中实际广告过滤规则的检查,在广告过滤规则中无法提取符合要求的关键字的情况极少,目前检查11285条规则中不能提取关键字的不超过20条。When an appropriate keyword cannot be extracted from an advertisement filtering rule, the advertisement filtering rule is added to the global queue, and the advertisement filtering rule in the global queue indicates that there is no advertisement filtering rule associated with the corresponding keyword. For the ad filtering rules in the global queue, each url needs to be matched. By checking the actual advertisement filtering rules in adblock, it is rare to extract the keywords that meet the requirements in the advertisement filtering rules. Currently, no more than 20 keywords cannot be extracted in the 11285 rules.
优选地,利用预设的关键字对所述统一资源定位符进行匹配包括:判断所述统一资源定位符与所述利用预设的关键字是否匹配,其中,如果判断出所述统一资源定位符与所述利用预设的关键字匹配,则将所述关键字对应的广告过滤规则转换为正则表达式。将与所述关键字匹配的统一资源定位符与所述关键字对应的广告过滤规则进行匹配包括:将与所述关键字匹配的统一资源定位符与所述正则表达式进行匹配。其中,如果所述关键字匹配的统一资源定位符与所述正则表达式匹配,则输出所述正则表达式对应的广告过滤规则,通过输出的所述正则表达式对应的广告过滤规则进行广告过滤。Preferably, the matching the uniform resource locator by using the preset keyword comprises: determining whether the uniform resource locator matches the used preset keyword, wherein if the uniform resource locator is determined And matching the preset keyword, the advertisement filtering rule corresponding to the keyword is converted into a regular expression. Matching the uniform resource locator matching the keyword with the advertisement filtering rule corresponding to the keyword includes: matching the uniform resource locator matching the keyword with the regular expression. If the uniform resource locator matched by the keyword matches the regular expression, the advertisement filtering rule corresponding to the regular expression is output, and the advertisement filtering rule corresponding to the regular expression is output for advertisement filtering. .
本发明实施例中,在rule匹配器中对url进行匹配可以是先将匹配到的关键字对应的广告过滤规则转化为正则表达式,然后利用正则表达式的接口查询广告过滤规则,以便于判断url是否与广告过滤规则匹配。优选地,本发明实施例仅在判断出url匹配到关键字时,将该关键字对应的广告过滤规则转换为正则表达式,无需在启动广告过滤时,将所有广告过滤规则转化为正则表达式。In the embodiment of the present invention, the matching of the url in the rule matcher may first convert the advertisement filtering rule corresponding to the matched keyword into a regular expression, and then use the interface of the regular expression to query the advertisement filtering rule, so as to facilitate judgment. Whether the url matches the ad filter rules. Preferably, the embodiment of the present invention converts the advertisement filtering rule corresponding to the keyword into a regular expression only when it is determined that the url matches the keyword, and does not need to convert all the advertisement filtering rules into regular expressions when starting the advertisement filtering. .
由于在启动时,需要消耗一定的时间,例如在移动终端浏览器中需要消耗的约1.5s的时间,本发明实施例中,只需要将关键字对应的广告过滤规则转换为正则表达式,免去了启动时需要消耗的约1.5s的时间。而由于平均每个关键字对应的广告过滤规则的数目很小,通常不超过2条,最多不超过10条,因此需要转化解析的时间很短。假如1w条广告过滤规则的解析时间为1.5s,平均每条的解析时间为0.15ms,因此对匹配时间最多增加1.5ms。同时本发明实施例也可以在第一次命中该广告过滤规则后将该广告过滤规则的解析结果缓存起来,这样后续就不会有解析开销了,从而进一步减少时间的消耗。In the embodiment of the present invention, only the advertisement filtering rule corresponding to the keyword needs to be converted into a regular expression, since it is necessary to consume a certain time, for example, in the mobile terminal browser. It took about 1.5 seconds to get started. Since the average number of advertisement filtering rules corresponding to each keyword is small, usually no more than 2 and no more than 10, the conversion analysis time is short. If the resolution time of the 1w advertisement filtering rule is 1.5s, the average parsing time per strip is 0.15ms, so the matching time is increased by at most 1.5ms. At the same time, the embodiment of the present invention may also cache the parsing result of the advertisement filtering rule after hitting the advertisement filtering rule for the first time, so that there is no parsing overhead subsequently, thereby further reducing the time consumption.
图23是根据本发明实施例的一种优选的网页数据处理方法的流程图。如图23所示,该浏览器网页数据处理方法包括:23 is a flow chart of a preferred web page data processing method in accordance with an embodiment of the present invention. As shown in FIG. 23, the browser webpage data processing method includes:
步骤S601,在浏览器中输入url。In step S601, a url is input in the browser.
步骤S602,将url输入分段器,对url进行分段。在分段器中将url按照预定规则进行分段;将所有分段得到的分段字符保存起来。
In step S602, the url is input into the segmenter to segment the url. The url is segmented according to a predetermined rule in the segmenter; the segmentation characters obtained from all segments are saved.
预定规则可以是:首先,以“/”为分隔符对url进行分段,则分段后第一段为域名、其余段为各路径分段;然后,对于域名分段,进一步以“.”为分隔符划分为各层域名分段;最后,对于非域名分段,进一步按照特殊字符划分为分段,其中,特殊字符可以特殊字符包括‘.’、‘_’、‘-’、‘?’、‘:’、‘=’、‘;’、‘&’、‘+’等。通过按照预定规则对url进行分段,可以进一步保证对广告过滤的效果。The predetermined rule may be: first, segment the url by using "/" as a separator, then the first segment is the domain name after segmentation, and the remaining segments are segmentation of each path; then, for the domain name segmentation, further "." The separator is divided into domain name segments. Finally, for non-domain segmentation, it is further divided into segments according to special characters. Among them, special characters can include special characters such as '.', '_', '-', '? ', ':', ‘=’, ‘;’, ‘&’, ‘+’, etc. By segmenting the url according to predetermined rules, the effect of filtering the advertisement can be further ensured.
步骤S603,将分段后的url输入到关键字匹配器。url分段器依次将各分段字符传入过滤规则的关键字匹配器。In step S603, the segmented url is input to the keyword matcher. The url segmenter in turn passes each segment character to the keyword matcher of the filter rule.
步骤S604,逐段判断是否命中关键字匹配器中的关键字。在过滤规则的关键字匹配器中判断是否命中过滤规则对应的关键字,若没有命中,则执行步骤S605;若命中则执行步骤S606。In step S604, it is determined step by step whether to hit the keyword in the keyword matcher. In the keyword matcher of the filter rule, it is determined whether the keyword corresponding to the filter rule is hit. If there is no hit, step S605 is performed; if the hit is performed, step S606 is performed.
步骤S605,判断是否还有分段字符未匹配。若是则返回步骤S603;若否,则执行步骤S606。In step S605, it is determined whether there are still segment characters not matched. If yes, go back to step S603; if no, go to step S606.
步骤S606,返回Flase。表明不需要进行广告过滤,可以请求资源。Step S606, returning Flase. Indicates that no filtering is required and resources can be requested.
步骤S607,将命中的分段对应的URL传给rule匹配器。然后执行步骤S608。其中,rule匹配器存储有关键字与过滤规则的对应关系;Step S607, the URL corresponding to the hit segment is passed to the rule matcher. Then step S608 is performed. The rule matcher stores a correspondence between the keyword and the filter rule.
步骤S608,判断URL是否命中黑名单且不命中白名单。由于命中黑名单的URL中还包括一些不需要过滤的URL,因此通过设置白名单来匹配这些不需要过滤的URL。若URL命中黑名单且不命中白名单,则执行步骤S 610,反之,若未命中黑名单且不命中白名单,则执行步骤S609。In step S608, it is determined whether the URL hits the blacklist and does not hit the whitelist. Since the URL that hits the blacklist also includes some URLs that do not need to be filtered, the whitelist is set to match those URLs that do not need to be filtered. If the URL hits the blacklist and does not hit the whitelist, step S610 is performed. Otherwise, if the blacklist is missed and the whitelist is not hit, step S609 is performed.
步骤S609,返回False。In step S609, False is returned.
步骤S610,输出对应的过滤规则。Step S610, outputting a corresponding filtering rule.
步骤S611,利用对应的过滤规则进行广告过滤。In step S611, the advertisement filtering is performed by using the corresponding filtering rule.
通过本发明实施例,对比现有技术的广告过滤时间,可以实现如下效果:Compared with the prior art advertisement filtering time, the following effects can be achieved by using the embodiments of the present invention:
在传统的广告过滤耗时:
Time spent on traditional ad filtering:
利用本发明后广告过滤耗时:After using the invention, the advertisement filtering takes time:
由上表可知,利用本发明提出的网页数据处理方法可以明显减低访问网页时所进行的广告过滤耗时,提高用户体验。It can be seen from the above table that the webpage data processing method proposed by the present invention can significantly reduce the time spent on advertisement filtering when accessing a webpage, and improve the user experience.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence. Because certain steps may be performed in other sequences or concurrently in accordance with the present invention. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are different, and the details that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
在本发明所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、移动终端、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, mobile terminal, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for a device or system embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment. The apparatus and system embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie It can be located in one place or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
以上所述仅是本申请的具体实施方式,使本领域技术人员能够理解或实现本申请。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
The above description is only a specific embodiment of the present application, so that those skilled in the art can understand or implement the present application. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the application is not limited to the embodiments shown herein, but is to be accorded the broadest scope of the principles and novel features disclosed herein.
Claims (31)
- 一种网页数据处理方法,其特征在于,包括:A webpage data processing method, comprising:获取待测网页;Obtain the web page to be tested;将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果,其中,所述匹配条件包括广告过滤规则的关键字和所述关键字对应的广告过滤规则,或者所述匹配条件包括与所述待测网页的网页地址对应的预设网页,所述预设网页中预先设置有第一标识的区域;以及Matching the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, where the first webpage is preset in the preset webpage;根据所述匹配结果确定所述待测网页的过滤情况。Determining, according to the matching result, a filtering situation of the webpage to be tested.
- 根据权利要求1所述的网页数据处理方法,其特征在于,A web page data processing method according to claim 1, wherein在获取待测网页的同时,所述方法还包括:获取所述待测网页的网页地址对应的预设网页;The method further includes: acquiring a preset webpage corresponding to the webpage address of the webpage to be tested;在将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果之前,所述方法还包括:分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识,The method further includes: setting a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, before the matching the webpage to be matched with the pre-set matching condition to obtain a matching result.将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果包括:判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,Matching the webpage to be tested with the pre-set matching condition, and obtaining a matching result includes: determining whether the preset webpage matches an area in the webpage to be tested, where the first identifier is set,根据所述匹配结果确定所述待测网页的过滤情况包括:如果所述预设网页与待测网页中设置有所述第一标识的区域相匹配,则判定所述待测网页不存在过滤问题,否则判定所述待测网页存在过滤问题。Determining, according to the matching result, the filtering situation of the webpage to be tested includes: if the preset webpage matches an area in which the first identifier is set in the webpage to be tested, determining that the webpage to be tested does not have a filtering problem Otherwise, it is determined that the webpage to be tested has a filtering problem.
- 根据权利要求2所述的网页数据处理方法,其特征在于,A web page data processing method according to claim 2, wherein所述第一标识为预设颜色,所述分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识,包括:分别将所述预设网页和待测网页中存在实际内容的区域的背景颜色设置为预设颜色;当所述实际内容为文字时,设置所述文字的颜色为所述预设颜色;当所述实际内容为图片时,删除所述图片;或者,The first identifier is a preset color, and the first identifier is set in the area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively, the actual webpage and the webpage to be tested are actually present The background color of the area of the content is set to a preset color; when the actual content is text, the color of the text is set as the preset color; when the actual content is a picture, the picture is deleted; or,所述第一标识为边框,所述分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识,包括:分别在所述预设网页和待测网页中存在实际内容的区域设置边框;其中,所述边框与所述存在实际内容的区域的边界重合。 The first identifier is a border, and the first identifier is set in an area where the actual content exists in the preset webpage and the webpage to be tested, respectively, including: respectively, the actual content exists in the preset webpage and the webpage to be tested A locale border; wherein the border coincides with a boundary of the area where the actual content exists.
- 根据权利要求2所述的网页数据处理方法,其特征在于,判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:The webpage data processing method according to claim 2, wherein determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested includes:分别计算所述预设网页中设置有所述第一标识的区域的第一总面积,以及所述待测网页中设置有所述第一标识的区域的第二总面积;Calculating, respectively, a first total area of the area where the first identifier is set in the preset webpage, and a second total area of the area where the first identifier is set in the webpage to be tested;计算所述第一总面积和第二总面积之间的第三比值;Calculating a third ratio between the first total area and the second total area;判断所述第三比值是否在预设范围内;Determining whether the third ratio is within a preset range;如果所述第三比值在预设范围内,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。If the third ratio is within the preset range, determining that the preset webpage matches the area in which the first identifier is set in the webpage to be tested, otherwise determining the preset webpage and the to-be-tested The area in the web page where the first identifier is set does not match.
- 根据权利要求4所述的网页数据处理方法,其特征在于,在判定所述待测网页存在过滤问题后,所述方法还包括:The webpage data processing method according to claim 4, wherein after determining that the webpage to be tested has a filtering problem, the method further comprises:如果所述第三比值小于所述预设范围的最小值,则判定所述待测网页存在过滤失效;If the third ratio is less than the minimum value of the preset range, determining that the webpage to be tested has a filter failure;如果所述第三比值大于所述预设范围的最大值,则判定所述待测网页存在误过滤。If the third ratio is greater than the maximum value of the preset range, it is determined that the webpage to be tested has error filtering.
- 根据权利要求3所述的网页数据处理方法,其特征在于,当所述第一标识为预设颜色时,判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:The webpage data processing method according to claim 3, wherein when the first identifier is a preset color, determining the preset webpage and the area of the webpage to be tested that is provided with the first identifier Whether it matches, including:比较所述预设网页和待测网页中与同一预设比较点相对应的区域的颜色是否相同;Comparing whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;计算所述颜色比较结果为不相同的预设比较点的个数与预设比较点的总个数之间的第一比值;Calculating a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;判断所述第一比值是否小于第一预设比值;Determining whether the first ratio is smaller than a first preset ratio;如果所述第一比值小于第一预设比值,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。If the first ratio is smaller than the first preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the preset webpage and the waiting The area in the webpage where the first identifier is set does not match.
- 根据权利要求6所述的网页数据处理方法,其特征在于,在判定所述待测网页存在过滤问题后,所述方法还包括: The webpage data processing method according to claim 6, wherein after determining that the webpage to be tested has a filtering problem, the method further includes:判断所述待测网页中,颜色比较结果为不同的预设比较点对应的第一区域的颜色,是否与所述预设颜色相同;Determining, in the webpage to be tested, whether the color of the first region corresponding to the different preset comparison points is the same as the preset color;如果所述第一区域的颜色与预设颜色相同,则判定所述第一区域存在过滤失效,否则判定所述第一区域存在误过滤。If the color of the first area is the same as the preset color, it is determined that the first area has a filter failure, otherwise it is determined that the first area has a false filter.
- 根据权利要求3所述的网页数据处理方法,其特征在于,当所述第一标识为边框时,判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:The webpage data processing method according to claim 3, wherein when the first identifier is a border, it is determined whether the preset webpage and the area of the webpage to be tested in which the first identifier is disposed are Matches, including:计算所述预设网页中设置有所述边框的区域和待测网页中设置有所述边框的区域不重叠的部分的面积,与所述预设网页中设置有所述边框的区域的总面积之间的第二比值;Calculating an area of a portion of the preset webpage where the border is disposed, and a portion of the webpage to be tested that does not overlap with the area where the border is disposed, and a total area of the area where the border is disposed in the preset webpage a second ratio between;判断所述第二比值是否小于第二预设比值;Determining whether the second ratio is smaller than a second preset ratio;如果所述第二比值小于第二预设比值,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。If the second ratio is smaller than the second preset ratio, determining that the preset webpage matches an area in the webpage to be tested that is provided with the first identifier, otherwise determining the preset webpage and the waiting The area in the webpage where the first identifier is set does not match.
- 根据权利要求8所述的网页数据处理方法,其特征在于,在判定所述待测网页存在过滤问题后,所述方法还包括:The webpage data processing method according to claim 8, wherein after determining that the webpage to be tested has a filtering problem, the method further comprises:当所述预设网页中,与所述待测网页中设置有所述边框的第一区域相对应的区域未设置所述边框时,判定所述第一区域存在过滤失效;When the preset webpage is not located in an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure;当所述预设网页中,与所述待测网页中未设置所述边框的第二区域相对应的区域设置有所述边框时,判定所述第二区域存在误过滤。And determining, in the preset webpage, that the second region has a false filter when the border corresponding to the second region where the border is not disposed in the webpage to be tested is set.
- 根据权利要求2至9任一项所述的网页数据处理方法,其特征在于,在判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配之前,所述网页数据处理方法还包括:The webpage data processing method according to any one of claims 2 to 9, wherein before determining whether the preset webpage and the area of the webpage to be tested are provided with the first identifier, The webpage data processing method further includes:分别将所述预设网页和待测网页划分为一一对应的多个比较区域;Separating the preset webpage and the webpage to be tested into a plurality of comparison areas corresponding one by one;相应的,所述判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配,包括:Correspondingly, the determining whether the preset webpage matches the area in which the first identifier is set in the webpage to be tested includes:分别判断所述预设网页与所述待测网页之间相对应的每对比较区域中设置有所述第一标识的区域是否相匹配。 Determining, respectively, whether the areas in which the first identifier is disposed in each pair of comparison areas corresponding to the preset webpage and the webpage to be tested are matched.
- 根据权利要求1所述的网页数据处理方法,其特征在于,其中,The webpage data processing method according to claim 1, wherein获取待测网页包括:获取所述待测网页的统一资源定位符,Obtaining a webpage to be tested includes: obtaining a uniform resource locator of the webpage to be tested,将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果包括:利用广告过滤规则的关键字对所述统一资源定位符进行匹配;如果所述统一资源定位符与所述关键字匹配,则将所述统一资源定位符与所述关键字对应的广告过滤规则进行匹配,Matching the webpage to be tested with a pre-set matching condition, and obtaining a matching result includes: matching the uniform resource locator by using a keyword of the advertisement filtering rule; if the uniform resource locator matches the keyword And matching the uniform resource locator with an advertisement filtering rule corresponding to the keyword,根据所述匹配结果确定所述待测网页的过滤情况包括:如果所述统一资源定位符与所述关键字对应的广告过滤规则匹配,则利用所述广告过滤规则进行广告过滤。Determining, according to the matching result, the filtering situation of the webpage to be tested includes: if the uniform resource locator matches an advertisement filtering rule corresponding to the keyword, performing advertisement filtering by using the advertisement filtering rule.
- 根据权利要求11所述的网页数据处理方法,其特征在于,The web page data processing method according to claim 11, wherein在所述利用广告过滤规则的关键字对所述统一资源定位符进行匹配之前,所述方法还包括:获取与所述广告过滤规则对应的关键字;建立与所述广告过滤规则对应的关键字的字典树;Before the matching the uniform resource locator by using the keyword of the advertisement filtering rule, the method further includes: acquiring a keyword corresponding to the advertisement filtering rule; and establishing a keyword corresponding to the advertisement filtering rule Dictionary tree其中,所述利用广告过滤规则的关键字对所述统一资源定位符进行匹配步骤包括:获取所述字典树中的关键字;判断所述统一资源定位符与所述字典树中的关键字是否匹配。The step of performing the matching of the uniform resource locator by using the keyword of the advertisement filtering rule includes: acquiring a keyword in the dictionary tree; determining whether the uniform resource locator and the keyword in the dictionary tree are match.
- 根据权利要求12所述的网页数据处理方法,其特征在于,The web page data processing method according to claim 12, characterized in that所述获取与所述广告过滤规则对应的关键字包括:读取所述广告过滤规则的文件;从所述广告过滤规则的文件中提取所述关键字;The acquiring a keyword corresponding to the advertisement filtering rule includes: reading a file of the advertisement filtering rule; and extracting the keyword from a file of the advertisement filtering rule;所述建立与广告过滤规则对应的关键字的字典树包括:建立所述关键字与所述广告过滤规则的对应关系;根据提取的关键字建立所述字典树。The dictionary tree for establishing a keyword corresponding to the advertisement filtering rule includes: establishing a correspondence between the keyword and the advertisement filtering rule; and establishing the dictionary tree according to the extracted keyword.
- 根据权利要求11所述的网页数据处理方法,其特征在于,The web page data processing method according to claim 11, wherein所述利用广告过滤规则的关键字对所述统一资源定位符进行匹配包括:判断所述统一资源定位符与所述广告过滤规则的关键字是否匹配,其中,如果判断出所述统一资源定位符与所述广告过滤规则的关键字匹配,则将所述关键字对应的广告过滤规则转换为正则表达式;The matching the uniform resource locator by using the keyword of the advertisement filtering rule includes: determining whether the uniform resource locator matches a keyword of the advertisement filtering rule, wherein if the uniform resource locator is determined Matching the keyword of the advertisement filtering rule, converting the advertisement filtering rule corresponding to the keyword into a regular expression;将所述统一资源定位符与所述关键字对应的广告过滤规则进行匹配包括:将所述统一资源定位符与所述正则表达式进行匹配; Matching the uniform resource locator with the advertisement filtering rule corresponding to the keyword includes: matching the uniform resource locator with the regular expression;其中,如果所述统一资源定位符与所述正则表达式匹配,则输出所述正则表达式对应的广告过滤规则,通过输出的所述正则表达式对应的广告过滤规则进行广告过滤。If the uniform resource locator matches the regular expression, the advertisement filtering rule corresponding to the regular expression is output, and the advertisement filtering rule is performed by outputting the advertisement filtering rule corresponding to the regular expression.
- 根据权利要求14所述的网页数据处理方法,其特征在于,在获取所述待测网页的统一资源定位符之后,所述方法还包括:The webpage data processing method according to claim 14, wherein after the obtaining the uniform resource locator of the webpage to be tested, the method further comprises:将所述统一资源定位符传入分段器;Transmitting the uniform resource locator to the segmenter;在所述分段器中对所述统一资源定位符进行分段,得到多个分段字符;Segmenting the uniform resource locator in the segmenter to obtain a plurality of segment characters;其中,所述利用广告过滤规则的关键字对所述统一资源定位符进行匹配包括:逐个将所述多个分段字符与关键字匹配器中的关键字进行匹配。The matching the uniform resource locator by using the keyword of the advertisement filtering rule includes: matching the plurality of segment characters to keywords in the keyword matcher one by one.
- 一种网页数据处理装置,其特征在于,包括处理器,所述处理器用于执行以下程序模块:A webpage data processing apparatus, comprising: a processor, wherein the processor is configured to execute the following program modules:网页获取单元,用于获取待测网页;a webpage obtaining unit, configured to obtain a webpage to be tested;网页匹配单元,用于将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果,其中,所述匹配条件包括广告过滤规则的关键字和所述关键字对应的广告过滤规则,或者所述匹配条件包括与所述待测网页的网页地址对应的预设网页,所述预设网页中预先设置有第一标识的区域:以及a webpage matching unit, configured to match the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or The matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, and the preset webpage is preset with an area of the first identifier:结果确定单元,用于根据所述匹配结果确定所述待测网页的过滤情况。a result determining unit, configured to determine, according to the matching result, a filtering situation of the webpage to be tested.
- 根据权利要求16所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 16, wherein所述网页获取单元还用于在获取所述待测网页的同时,获取所述待测网页的网页地址对应的预设网页;The webpage obtaining unit is further configured to acquire a preset webpage corresponding to the webpage address of the webpage to be tested while acquiring the webpage to be tested;所述装置还包括:网页标记单元,用于分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识;The device further includes: a webpage marking unit, configured to respectively set a first identifier in an area where the actual content exists in the preset webpage and the webpage to be tested;所述网页匹配单元还用于判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配;The webpage matching unit is further configured to determine whether the preset webpage matches an area in the webpage to be tested, where the first identifier is set;所述结果确定单元还用于在所述预设网页与待测网页中设置有所述第一标识的区域相匹配时,判定所述待测网页不存在过滤问题,否则判定所述待测网页存在过滤问题。The result determining unit is further configured to: when the preset webpage matches the area where the first identifier is set in the webpage to be tested, determine that the webpage to be tested does not have a filtering problem, otherwise determine the webpage to be tested. There is a filtering issue.
- 根据权利要求17所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 17, wherein:所述第一标识包括预设颜色,所述网页标记单元包括: The first identifier includes a preset color, and the webpage marking unit includes:背景设置单元,用于分别将所述预设网页和待测网页中存在实际内容的区域的背景颜色设置为预设颜色;a background setting unit, configured to respectively set a background color of an area where the actual content exists in the preset webpage and the webpage to be tested as a preset color;文字处理单元,用于当所述预设网页和/或待测网页中的实际内容为文字时,设置所述文字的颜色为所述预设颜色;a word processing unit, configured to set a color of the text to be the preset color when the actual content in the preset webpage and/or the webpage to be tested is a text;图片处理单元,用于当所述预设网页和/或待测网页中的实际内容为图片时,删除所述图片;或者,a picture processing unit, configured to delete the picture when the actual content in the preset webpage and/or the webpage to be tested is a picture; or所述第一标识包括边框,所述网页标记单元包括:The first identifier includes a border, and the webpage marking unit includes:边框设置单元,用于分别在所述预设网页和待测网页中存在实际内容的区域设置边框;其中,所述边框与所述存在实际内容的区域的边界重合。a frame setting unit, configured to respectively set a border in an area where the actual content exists in the preset webpage and the webpage to be tested; wherein the border overlaps with a boundary of the area where the actual content exists.
- 根据权利要求17所述的网页数据处理装置,其特征在于,所述网页匹配单元包括:The webpage data processing apparatus according to claim 17, wherein the webpage matching unit comprises:面积计算单元,用于分别计算所述预设网页中设置有所述第一标识的区域的第一总面积,以及所述待测网页中设置有所述第一标识的区域的第二总面积;An area calculating unit, configured to separately calculate a first total area of the area in which the first identifier is set in the preset webpage, and a second total area of the area in which the first identifier is disposed in the webpage to be tested ;第三计算单元,用于计算所述第一总面积和第二总面积之间的第三比值;a third calculating unit, configured to calculate a third ratio between the first total area and the second total area;第三判定单元,用于判断所述第三比值是否在预设范围内;如果所述第三比值在预设范围内,则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配。a third determining unit, configured to determine whether the third ratio is within a preset range; if the third ratio is within a preset range, determining that the preset webpage and the webpage to be tested are set in the The area of the first identifier is matched, and the preset webpage is determined to not match the area in which the first identifier is set in the webpage to be tested.
- 根据权利要求19所述的网页数据处理装置,其特征在于,还包括:The webpage data processing apparatus according to claim 19, further comprising:第三子确定单元,用于在所述结果确定单元判定所述待测网页存在过滤问题后,比较所述第三比值、所述预设范围的最小值,以及所述预设范围的最大值,并在所述第三比值小于所述预设范围的最小值时,判定所述待测网页存在过滤失效,在如果所述第三比值大于所述预设范围的最大值时,判定所述待测网页存在误过滤。a third sub-determination unit, configured to compare the third ratio, a minimum value of the preset range, and a maximum value of the preset range after the result determining unit determines that the webpage to be tested has a filtering problem And determining, when the third ratio is less than the minimum value of the preset range, that the webpage to be tested has a filter failure, and determining that the third ratio is greater than a maximum value of the preset range. There is error filtering on the web page to be tested.
- 根据权利要求18所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 18, wherein当所述第一标识为预设颜色时,所述网页匹配单元包括:When the first identifier is a preset color, the webpage matching unit includes:颜色比较单元,用于比较所述预设网页和待测网页中与同一预设比较点相对应的区域的颜色是否相同; a color comparison unit, configured to compare whether the color of the area corresponding to the same preset comparison point in the preset webpage and the webpage to be tested is the same;第一计算单元,用于计算所述颜色比较结果为不相同的预设比较点的个数与预设比较点的总个数之间的第一比值;a first calculating unit, configured to calculate a first ratio between the number of preset comparison points that are different from the color comparison result and the total number of preset comparison points;第一判定单元,用于判断所述第一比值是否小于第一预设比值,并在所述第一比值大于第一预设比值时,判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配;a first determining unit, configured to determine whether the first ratio is smaller than a first preset ratio, and when the first ratio is greater than the first preset ratio, determining that the preset webpage is set in the webpage to be tested The area that has the first identifier does not match, and the preset webpage is determined to match the area in which the first identifier is set in the webpage to be tested;当所述第一标识为边框时,所述网页匹配单元包括:When the first identifier is a border, the webpage matching unit includes:第二计算单元,用于计算所述预设网页和待测网页中多边形图框不重叠的部分的面积与所述预设网页中多边形图框的总面积之间的第二比值;a second calculating unit, configured to calculate a second ratio between an area of a portion of the preset webpage and the webpage to be tested that does not overlap with a polygon frame and a total area of the polygon frame in the preset webpage;第二判定单元,用于在所述第二比值不大于第二预设比值时,判定所述预设网页与所述待测网页中设置有所述第一标识的区域不匹配,否则判定所述预设网页与所述待测网页中设置有所述第一标识的区域相匹配。a second determining unit, configured to determine, when the second ratio is not greater than the second preset ratio, that the preset webpage does not match an area in the webpage to be tested that is provided with the first identifier, otherwise The preset webpage matches an area of the webpage to be tested in which the first identifier is disposed.
- 根据权利要求21所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 21, wherein当所述第一标识为预设颜色时,所述网页数据处理装置还包括:When the first identifier is a preset color, the webpage data processing apparatus further includes:第一子确定单元,用于在所述结果确定单元判定所述待测网页存在过滤问题后,判断所述待测网页中,颜色比较结果为不同的预设比较点对应的第一区域的颜色,是否与所述预设颜色相同,并在所述第一区域的颜色与预设颜色相同时,判定所述第一区域存在过滤失效,否则判定所述第一区域存在误过滤;a first sub-determining unit, configured to determine, after the result determining unit determines that the webpage to be tested has a filtering problem, the color comparison result is a color of the first region corresponding to different preset comparison points Whether it is the same as the preset color, and when the color of the first area is the same as the preset color, determining that the first area has a filter failure, otherwise determining that the first area has a false filter;当所述第一标识为边框时,所述网页数据处理装置还包括:When the first identifier is a border, the webpage data processing apparatus further includes:第二子确定单元,用于在所述结果确定单元判定所述待测网页存在过滤问题后,执行如下判定:The second sub-determining unit is configured to: after the result determining unit determines that the webpage to be tested has a filtering problem, perform the following determination:如果所述预设网页中,与所述待测网页中设置有所述边框的第一区域相对应的区域未设置所述边框,则判定所述第一区域存在过滤失效;If the preset webpage is not provided with an area corresponding to the first area where the border is disposed in the webpage to be tested, determining that the first area has filtering failure;如果所述预设网页中,与所述待测网页中未设置所述边框的第二区域相对应的区域设置有所述边框时,则判定所述第二区域存在误过滤。If the preset webpage is provided with the border corresponding to the second area where the border is not disposed in the webpage to be tested, it is determined that the second area has error filtering.
- 根据权利要求17至22任一项所述的网页数据处理装置,其特征在于,还包括:The webpage data processing apparatus according to any one of claims 17 to 22, further comprising:区域分割单元,用于分别将所述预设网页和待测网页划分为一一对应的多个比较区域;a region dividing unit, configured to respectively divide the preset webpage and the webpage to be tested into a plurality of comparison regions corresponding to one-to-one correspondence;相应的,所述网页匹配单元包括: Correspondingly, the webpage matching unit comprises:第一子匹配单元,用于分别判断所述预设网页与所述待测网页之间相对应的每对比较区域中设置有所述第一标识的区域是否相匹配。The first sub-matching unit is configured to determine whether the regions in which the first identifier is disposed in each pair of comparison regions corresponding to the preset webpage and the webpage to be tested are respectively matched.
- 根据权利要求16所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 16, wherein所述网页获取单元包括:第一获取单元,用于获取所述待测网页的统一资源定位符,The webpage obtaining unit includes: a first acquiring unit, configured to acquire a uniform resource locator of the webpage to be tested,所述网页匹配单元包括:第一匹配单元,用于利用广告过滤规则的关键字对所述统一资源定位符进行匹配;第二匹配单元,用于当所述统一资源定位符与所述关键字匹配时,将所述统一资源定位符与所述关键字对应的广告过滤规则进行匹配;以及The webpage matching unit includes: a first matching unit, configured to use the keyword of the advertisement filtering rule to match the uniform resource locator; and a second matching unit, configured to: when the uniform resource locator and the keyword Matching, the uniform resource locator is matched with an advertisement filtering rule corresponding to the keyword;所述结果确定单元包括:过滤单元,用于当所述统一资源定位符与所述关键字对应的广告过滤规则匹配时,利用所述广告过滤规则进行广告过滤。The result determining unit includes: a filtering unit, configured to perform advertisement filtering by using the advertisement filtering rule when the uniform resource locator matches an advertisement filtering rule corresponding to the keyword.
- 根据权利要求24所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 24, wherein所述装置还包括:第二获取单元,用于在所述利用广告过滤规则的关键字对所述统一资源定位符进行匹配之前,获取与所述广告过滤规则对应的关键字;建立单元,用于建立与所述广告过滤规则对应的关键字的字典树;The device further includes: a second acquiring unit, configured to acquire a keyword corresponding to the advertisement filtering rule before the matching the uniform resource locator by using a keyword of the advertisement filtering rule; a dictionary tree for establishing a keyword corresponding to the advertisement filtering rule;其中,所述第一匹配单元包括:获取模块,用于获取所述字典树中的关键字;第一判断模块,用于判断所述统一资源定位符与所述字典树中的关键字是否匹配。The first matching unit includes: an obtaining module, configured to acquire a keyword in the dictionary tree; and a first determining module, configured to determine whether the uniform resource locator matches a keyword in the dictionary tree .
- 根据权利要求25所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 25, wherein所述第二获取单元包括:读取模块,用于读取所述广告过滤规则的文件;提取模块,用于从所述广告过滤规则的文件中提取所述关键字;The second obtaining unit includes: a reading module, configured to read a file of the advertisement filtering rule; and an extracting module, configured to extract the keyword from a file of the advertisement filtering rule;所述建立单元包括:第一建立模块,用于建立所述关键字与所述广告过滤规则的对应关系;第二建立模块,用于根据提取的关键字建立所述字典树。The establishing unit includes: a first establishing module, configured to establish a correspondence between the keyword and the advertisement filtering rule; and a second establishing module, configured to establish the dictionary tree according to the extracted keyword.
- 根据权利要求24所述的网页数据处理装置,其特征在于,A web page data processing apparatus according to claim 24, wherein所述第一匹配单元包括:第二判断模块,用于判断所述统一资源定位符与所述广告过滤规则的关键字是否匹配,其中,如果判断出所述统一资源定位符与所述广告过滤规则的关键字匹配,则将所述关键字对应的广告过滤规则转换为正则表达式; The first matching unit includes: a second determining module, configured to determine whether the uniform resource locator matches a keyword of the advertisement filtering rule, where if the uniform resource locator is determined to be filtered by the advertisement The keyword matching of the rule converts the advertisement filtering rule corresponding to the keyword into a regular expression;所述第二匹配单元包括:第一匹配模块,用于将所述统一资源定位符与所述正则表达式进行匹配;The second matching unit includes: a first matching module, configured to match the uniform resource locator with the regular expression;所述过滤单元还用于当所述统一资源定位符与所述正则表达式匹配时,输出所述正则表达式对应的广告过滤规则,通过输出的所述正则表达式对应的广告过滤规则进行广告过滤。The filtering unit is further configured to: when the uniform resource locator matches the regular expression, output an advertisement filtering rule corresponding to the regular expression, and advertise by outputting an advertisement filtering rule corresponding to the regular expression. filter.
- 根据权利要求27所述的网页数据处理装置,其特征在于,所述装置还包括:The webpage data processing apparatus according to claim 27, wherein the apparatus further comprises:传入单元,用于在获取在浏览器中输入的统一资源定位符之后,将所述统一资源定位符传入分段器;An incoming unit, configured to: after obtaining the uniform resource locator input in the browser, the uniform resource locator to the segmenter;分段单元,用于在所述分段器中对所述统一资源定位符进行分段,得到多个分段字符;a segmentation unit, configured to segment the uniform resource locator in the segmenter to obtain a plurality of segment characters;其中,所述第一匹配单元包括:第二匹配模块,用于逐个将所述多个分段字符与关键字匹配器中的关键字进行匹配。The first matching unit includes: a second matching module, configured to match the plurality of segment characters to keywords in the keyword matcher one by one.
- 一种具有处理器可执行的程序代码的计算机可读介质,应用于一网页数据处理设备,其特征在于,所述程序代码使处理器执行下述步骤:A computer readable medium having processor-executable program code for use in a web page data processing apparatus, wherein the program code causes the processor to perform the steps of:获取待测网页;Obtain the web page to be tested;将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果,其中,所述匹配条件包括广告过滤规则的关键字和所述关键字对应的广告过滤规则,或者所述匹配条件包括与所述待测网页的网页地址对应的预设网页,所述预设网页中预先设置有第一标识的区域:以及Matching the webpage to be tested with a pre-set matching condition to obtain a matching result, where the matching condition includes a keyword of the advertisement filtering rule and an advertisement filtering rule corresponding to the keyword, or the matching condition includes a preset webpage corresponding to the webpage address of the webpage to be tested, where the first webpage is preset in the preset webpage:根据所述匹配结果确定所述待测网页的过滤情况。Determining, according to the matching result, a filtering situation of the webpage to be tested.
- 根据权利要求29所述的计算机可读介质,其特征在于,A computer readable medium according to claim 29, wherein在获取待测网页的同时,所述程序代码还使处理器获取所述待测网页的网页地址对应的预设网页,While acquiring the webpage to be tested, the program code further causes the processor to acquire a preset webpage corresponding to the webpage address of the webpage to be tested,在将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果之前,所述程序代码还使处理器分别在所述预设网页和待测网页中存在实际内容的区域设置第一标识;The program code further causes the processor to set the first identifier in the area where the actual content exists in the preset webpage and the webpage to be tested, before the matching webpage is matched with the pre-set matching condition to obtain a matching result. ;将所述待测网页与预先设置的匹配条件进行匹配,得到匹配结果包括:判断所述预设网页与所述待测网页中设置有所述第一标识的区域是否相匹配; Matching the webpage to be tested with the pre-set matching condition, and obtaining a matching result includes: determining whether the preset webpage matches an area in the webpage to be tested, where the first identifier is set;根据所述匹配结果确定所述待测网页的过滤情况包括:如果所述预设网页与待测网页中设置有所述第一标识的区域相匹配,则判定所述待测网页不存在过滤问题,否则判定所述待测网页存在过滤问题。Determining, according to the matching result, the filtering situation of the webpage to be tested includes: if the preset webpage matches an area in which the first identifier is set in the webpage to be tested, determining that the webpage to be tested does not have a filtering problem Otherwise, it is determined that the webpage to be tested has a filtering problem.
- 一种计算机程序,其特征在于,用于执行权利要求1至15中任一项所述的网页数据处理方法。 A computer program for performing the web page data processing method according to any one of claims 1 to 15.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410180750.1A CN105095236A (en) | 2014-04-30 | 2014-04-30 | Advertisement filtering method and device |
CN201410182175.9A CN104008131B (en) | 2014-04-30 | 2014-04-30 | A kind of web data processing method and processing device |
CN201410180750.1 | 2014-04-30 | ||
CN201410182175.9 | 2014-04-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015165245A1 true WO2015165245A1 (en) | 2015-11-05 |
Family
ID=54358118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/090841 WO2015165245A1 (en) | 2014-04-30 | 2014-11-11 | Webpage data processing method and device |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2015165245A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512282A (en) * | 2015-12-07 | 2016-04-20 | 小米科技有限责任公司 | Notification method and device and terminal |
CN107193889A (en) * | 2017-05-02 | 2017-09-22 | 努比亚技术有限公司 | Ad blocking method, terminal and computer-readable recording medium |
CN108628817A (en) * | 2017-03-15 | 2018-10-09 | 腾讯科技(深圳)有限公司 | A kind of data processing method and device |
CN110413866A (en) * | 2018-04-27 | 2019-11-05 | 北京搜狗科技发展有限公司 | Data processing method and device, the device for data processing |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090132373A1 (en) * | 2007-11-20 | 2009-05-21 | Daniel Redlich | Revenue Sharing System that Optimizes Ad Revenue with Preformatted Page Generator and Preview Distribution System |
KR20100008935A (en) * | 2008-07-17 | 2010-01-27 | 주식회사 엔톰애드 | Method and apparatus for providing internet advertisement by using morphological analysis |
CN102054030A (en) * | 2010-12-17 | 2011-05-11 | 惠州Tcl移动通信有限公司 | Mobile terminal webpage display control method and device |
CN102521331A (en) * | 2011-12-06 | 2012-06-27 | 中国科学院计算机网络信息中心 | Webpage redirection cheating detection method and device |
CN102830958A (en) * | 2011-06-16 | 2012-12-19 | 奇智软件(北京)有限公司 | Method and system for obtaining interface control information |
CN102857493A (en) * | 2012-06-30 | 2013-01-02 | 华为技术有限公司 | Content filtering method and device |
CN103020266A (en) * | 2012-12-25 | 2013-04-03 | 北京奇虎科技有限公司 | Method and device for extracting webpage text content |
CN104008131A (en) * | 2014-04-30 | 2014-08-27 | 广州市动景计算机科技有限公司 | Processing method and device for web page data |
-
2014
- 2014-11-11 WO PCT/CN2014/090841 patent/WO2015165245A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090132373A1 (en) * | 2007-11-20 | 2009-05-21 | Daniel Redlich | Revenue Sharing System that Optimizes Ad Revenue with Preformatted Page Generator and Preview Distribution System |
KR20100008935A (en) * | 2008-07-17 | 2010-01-27 | 주식회사 엔톰애드 | Method and apparatus for providing internet advertisement by using morphological analysis |
CN102054030A (en) * | 2010-12-17 | 2011-05-11 | 惠州Tcl移动通信有限公司 | Mobile terminal webpage display control method and device |
CN102830958A (en) * | 2011-06-16 | 2012-12-19 | 奇智软件(北京)有限公司 | Method and system for obtaining interface control information |
CN102521331A (en) * | 2011-12-06 | 2012-06-27 | 中国科学院计算机网络信息中心 | Webpage redirection cheating detection method and device |
CN102857493A (en) * | 2012-06-30 | 2013-01-02 | 华为技术有限公司 | Content filtering method and device |
CN103020266A (en) * | 2012-12-25 | 2013-04-03 | 北京奇虎科技有限公司 | Method and device for extracting webpage text content |
CN104008131A (en) * | 2014-04-30 | 2014-08-27 | 广州市动景计算机科技有限公司 | Processing method and device for web page data |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512282A (en) * | 2015-12-07 | 2016-04-20 | 小米科技有限责任公司 | Notification method and device and terminal |
CN108628817A (en) * | 2017-03-15 | 2018-10-09 | 腾讯科技(深圳)有限公司 | A kind of data processing method and device |
CN107193889A (en) * | 2017-05-02 | 2017-09-22 | 努比亚技术有限公司 | Ad blocking method, terminal and computer-readable recording medium |
CN110413866A (en) * | 2018-04-27 | 2019-11-05 | 北京搜狗科技发展有限公司 | Data processing method and device, the device for data processing |
CN110413866B (en) * | 2018-04-27 | 2024-02-02 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9448999B2 (en) | Method and device to detect similar documents | |
US10515142B2 (en) | Method and apparatus for extracting webpage information | |
US9304979B2 (en) | Authorized syndicated descriptions of linked web content displayed with links in user-generated content | |
JP2013541774A (en) | Website scanning device and website scanning method | |
WO2015165245A1 (en) | Webpage data processing method and device | |
CN106326091B (en) | Method and system for detecting browser webpage compatibility | |
CN104657423A (en) | Method and device thereof for sharing contents of applications | |
US9984486B2 (en) | Method and apparatus for voice information augmentation and displaying, picture categorization and retrieving | |
US20150347818A1 (en) | Method, system, and application for obtaining complete resource according to blob images | |
CN106599017B (en) | Scanning analytic method, device and the mobile terminal of installation kit | |
CN102880613A (en) | Identification method of porno pictures and equipment thereof | |
CN109040346B (en) | Method, device and equipment for screening effective domain names in extensive domain name resolution | |
CN108900554B (en) | HTTP asset detection method, system, device and computer medium | |
CN105138579A (en) | Method and device for obtaining keywords and recommending information based on keywords | |
US11080322B2 (en) | Search methods, servers, and systems | |
CN101807192A (en) | Webpage optical character recognition processing method used for mobile communication equipment terminal | |
CN108710860B (en) | Video news segmentation method and device | |
CN110851680A (en) | Web crawler identification method and device | |
CN102508892B (en) | System and method for quickly previewing pictures | |
CN104615770B (en) | A kind of recommendation method and device of mobile terminal favorites data | |
US20150261857A1 (en) | Method And Device For Accessing Websites Via Keywords | |
CN109902269A (en) | A kind of document display method, device, electronic equipment and readable storage medium storing program for executing | |
CN110955855B (en) | Information interception method, device and terminal | |
CN107169057B (en) | Method and device for detecting repeated pictures | |
US20190332859A1 (en) | Method for identifying main picture in web page |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14890851 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.03.2017) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14890851 Country of ref document: EP Kind code of ref document: A1 |