CN113905275B - Webpage filtering method and intelligent device - Google Patents

Webpage filtering method and intelligent device Download PDF

Info

Publication number
CN113905275B
CN113905275B CN202111113915.XA CN202111113915A CN113905275B CN 113905275 B CN113905275 B CN 113905275B CN 202111113915 A CN202111113915 A CN 202111113915A CN 113905275 B CN113905275 B CN 113905275B
Authority
CN
China
Prior art keywords
filtering
filter
target link
triemap
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111113915.XA
Other languages
Chinese (zh)
Other versions
CN113905275A (en
Inventor
李平安
陆兴
易舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Electronic Technology Shenzhen Co ltd
Original Assignee
Hisense Electronic Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Electronic Technology Shenzhen Co ltd filed Critical Hisense Electronic Technology Shenzhen Co ltd
Priority to CN202111113915.XA priority Critical patent/CN113905275B/en
Publication of CN113905275A publication Critical patent/CN113905275A/en
Application granted granted Critical
Publication of CN113905275B publication Critical patent/CN113905275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/454Content or additional data filtering, e.g. blocking advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/812Monomedia components thereof involving advertisement data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/858Linking data to content, e.g. by linking an URL to a video object, by creating a hotspot
    • H04N21/8586Linking data to content, e.g. by linking an URL to a video object, by creating a hotspot by using a URL

Abstract

The application discloses a webpage filtering method and intelligent equipment, which are characterized in that when an operation instruction for starting a browser application is received, a filtering list is obtained, and each filter included in the filtering list is stored in a TrieMap or a TrieNet; when an operation instruction of a user for accessing a target link is received, acquiring link information associated with the target link; filtering the target link according to the link information, the filtering list, the TrieMap and/or the TrieNet; if the filtering result is that the release is allowed, sending an access request for the target link to the server; if the filtering result is interception, the access request of the target link is not sent to the server. The application not only ensures the use safety of the browser, but also improves the speed of web page filtering and accessing and reduces the memory expenditure.

Description

Webpage filtering method and intelligent device
Technical Field
The application relates to the field of intelligent televisions, in particular to a webpage filtering method and intelligent equipment.
Background
In some application scenarios, a browser may be installed in the intelligent device, when a user browses a web page, the user often sees a jump link inserted in a website, for example, the user may actively request to access the advertisement page, or may falsely trigger the advertisement link, however, the user may have a risk of accessing the advertisement page, for example, some advertisements may be linked to illegal websites, virus websites, malicious websites, or the like, and the browser may filter the web page to be accessed before requesting to access the advertisement link, and intercept the web page with a risk in time.
The browser may obtain a filter list (filter list) provided by a third party, where the filter list includes any number of filters, each defining a filter condition, and web links satisfying the filter condition are to be intercepted or allowed to pass. Because the number of the filters in the filter list can be thousands to millions, on one hand, when the webpage links to be accessed are matched with the filter list in a traversing way, the browser filtering efficiency is low, the time consumption is long, and on the other hand, the intelligent device needs to consume a larger memory to store each filter in the list, and especially for devices such as an intelligent television with relatively low memory configuration, the running of other application programs or processes can be influenced by the insufficient memory.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a webpage filtering method and intelligent equipment, which can improve the efficiency of filtering and screening the webpage to be accessed by a browser and reduce the occupation of the memory of the intelligent equipment.
An embodiment of a first aspect provides a web page filtering method, including:
when an operation instruction for starting a browser application is received, a filtering list is obtained, and each filter included in the filtering list is stored in a TrieMap or a TrieNet;
When an operation instruction of a user for accessing a target link is received, acquiring link information associated with the target link;
filtering the target link according to the link information, the filtering list, the TrieMap and/or the TrieNet;
if the filtering result is that the release is allowed, sending an access request for the target link to the server;
if the filtering result is interception, the access request of the target link is not sent to the server.
In a first exemplary implementation, the filtering the target link includes: and if the source of the object to be accessed indicated in the link information is a Mainframe, outputting a filtering result as allowing release.
In a second exemplary implementation, the filtering the target link includes:
if the source of the object to be accessed indicated in the link information is not a Mainframe, calling a white list defined in the filtering list from the TrieNet, wherein the white list comprises a plurality of authorized URLs, and any jump links configured in the webpage corresponding to the authorized URLs are allowed to be released;
and if the reference indicated in the link information is in the white list, outputting a filtering result as allowing release.
In a third exemplary implementation, the filtering the target link includes:
if the reference indicated by the link information is not in the white list, extracting a first keyword of the reference;
if a first external filter matched with the first keyword is queried in the TrieMap and a reference meets the filtering condition defined by the first external filter, adding the reference to a white list, and outputting a filtering result as allowing release; wherein the filtering conditions comprise basic filtering rules and additional filtering options, and the filtering conditions of the exceptional filter define the conditions which are met by the permission target link.
In a fourth exemplary implementation, the filtering the target link includes:
if the first exception filter matched with the first keyword is not queried in the TrieMap or the reference does not meet the filtering condition of the first exception filter, matching the reference with the exception filter in the TrieNet;
and if the reference is matched with the filter condition defined by the second exceptional filter in the TrieNet, adding the reference to a white list, and outputting a filter result as allowing release.
In a fifth exemplary implementation, the filtering the target link includes:
if the second exceptional filter is not matched from the TrieNet, the reference does not add a white list, and a second keyword of a target link URL indicated by the link information is extracted;
if a first blocking filter matched with the second keyword is queried in the TrieMap and the target link URL meets the filtering condition defined by the first blocking filter, outputting a filtering result to intercept; wherein the filter condition of the blocking filter defines a condition satisfied by intercepting the target link.
In a sixth exemplary implementation, the filtering the target link includes:
if the first blocking filter matched with the second keyword is not queried in the TrieMap or the target link URL does not meet the filtering condition defined by the first blocking filter, matching the target link URL with the blocking filter in the TrieNet;
if the target link URL is matched with the target link URL to meet the filtering condition defined by the second blocking filter in the TrieNet, outputting a filtering result to intercept;
and if the second blocking filter is not matched from the TrieNet, outputting a filtering result as a permitted release.
In a seventh exemplary implementation, after obtaining the filter list, the method further includes the steps of:
reading a filter included in the filter list, and preprocessing the filter;
if the keywords of the filter cannot be extracted, storing the filter in a TrieNet;
if at least one keyword can be extracted from the filter, determining whether the keyword is occupied in the TrieMap;
if all keywords are occupied in the TrieMap, storing the filter in the TrieNet;
and if the keyword is not occupied in the TrieMap, storing the keyword and the filter in the TrieMap in a [ key, value ] form.
In an eighth exemplary implementation, the preprocessing the filter includes:
identifying and counting the type of the filter according to the type identifier of the filter; types of filters include exception filters and blocking filters;
the basic filtering rules and the additional filtering options in the filtering conditions defined by the filter are separated.
An embodiment of the second aspect provides an intelligent device, where a browser application is configured in the intelligent device, and a filter module is configured in the browser application, where the filter module is configured to execute:
When an operation instruction for starting a browser application is received, a filtering list is obtained, and each filter included in the filtering list is stored in a TrieMap or a TrieNet;
when an operation instruction of a user for accessing a target link is received, acquiring link information associated with the target link;
filtering the target link according to the link information, the filtering list, the TrieMap and/or the TrieNet;
if the filtering result is that the release is allowed, sending an access request for the target link to the server;
if the filtering result is interception, the access request of the target link is not sent to the server.
In the technical scheme provided by the application, the Trie is introduced into the Map and the Set to realize the storage of the filter data, so that the TrieMap and the TrieNet are obtained, the Trie can save the memory by utilizing the common prefix of a plurality of filters in the filtering list, the space complexity is reduced, the time complexity of the Trie in filtering is only related to the searched URL, and is irrelevant to the number of the filters in the filtering list, and the length of the URL is generally in the range of tens to hundreds of characters, so that the efficiency of filtering and matching is also improved. Based on the filter storage structure of the TrieMap and the TrieNet, referring to the link information of the target link to be accessed and the related content defined in the filtering list, outputting a filtering result, if the filtering result is allowable, considering the target link to be safe and not filtered, and sending an http request for accessing the target link to a server; if the filtering result is intercepted, and the target link is at risk and needs to be filtered, an http request for accessing the target link is not sent to the server. Therefore, the application not only ensures the use safety of the browser, but also improves the speed of web page filtering and accessing, and reduces the memory overhead when storing filter data.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly explain the drawings required for the embodiments, and it is apparent that the drawings in the following description are only some embodiments of the present application and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
A schematic diagram of a usage scenario of a smart tv is exemplarily shown in fig. 1;
a flowchart of a first web page filtering method is exemplarily shown in fig. 2;
a flowchart of a second web page filtering method is exemplarily shown in fig. 3;
FIG. 4 is a graph illustrating performance versus Set memory overhead;
FIG. 5 is a graph illustrating performance versus Set traversal time;
FIG. 6 is a graph illustrating performance versus Map memory overhead;
a performance versus Map of Map traversal time is shown schematically in fig. 7;
a performance versus time graph of URL matching to filter is illustrated in fig. 8.
Detailed Description
For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.
It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.
In an exemplary application scenario, a browser application may be configured in the smart device, and after the user starts the browser application, the user may jump the application page to a corresponding web page by inputting a keyword or a website of interest, or clicking a link pushed in a homepage of the browser, so that the user browses a web page message. The intelligent device is not limited to an intelligent television, an intelligent mobile phone, a tablet personal computer, a computer and the like, and only needs to have a display function, install a browser application and support browser operation and user interaction operation. Regarding smart tv as an example, as shown in fig. 1, a user may operate the smart tv 200 through the remote controller 100 or the terminal device 300, and the smart tv 200 may be used as a client, may also perform data communication with the server 400, and may allow the smart tv 200 to perform communication connection through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. After the user starts the browser application, any webpage link can be accessed in the browser, the intelligent television 200 generates and sends an http request to the server 400 according to the webpage link, the server 400 responds to the http request and sends webpage data to the intelligent television 200, and the intelligent device 200 jumps the webpage and loads and displays the received webpage data.
During the use of the browser by the user, advertisements may be implanted in some web pages in different forms such as popup windows, banners, icons, etc., and the user may quickly jump to the web page linked by the advertisement mark by clicking the advertisement mark such as popup windows, banners or icons, etc., where the advertisement in the embodiment of the present application is broad and includes but is not limited to media materials, information, commodity promotion and trade, platform activity promotion, etc. pushed by the operator. The user's access to the advertisement page may then be at risk, for example, some websites may be in advertisement form to induce the user to access, and these websites may be illegal websites, malicious websites or virus websites, etc., once the user accesses the website, the user may cause problems such as poisoning of the smart device, or theft of personal privacy information, etc.
In an exemplary implementation manner, in order to avoid the risks, the use safety of the browser is improved, the browser can utilize a self-configured webpage filtering engine to filter and check the webpage to be accessed before requesting to access the webpage from the server, if the webpage to be accessed is checked to have the risks, the webpage to be accessed is intercepted in time, an http request is not sent to the server, the intelligent equipment end does not display the content data of the webpage to be accessed, optionally, the user is prompted that the webpage to be accessed has been intercepted due to the safety risks, and the user access retry is avoided; if the web page to be accessed is checked to be risk-free, allowing the web page link to be released, sending an http request for accessing the web page to a server, and then jumping to the page and loading and displaying the web page data.
In one exemplary implementation, the web page filtering engine performs filtering checking on the web page to be accessed based mainly on a filter list (filtering list) provided by a third party, where the filter list includes any number of filters, and each filter is manually maintained and defined with its filtering condition. The filters are distributed in the form of text files, which are combined and collectively referred to as a filter list.
In one exemplary implementation, the filters may be divided into two types, blocking filters and Exception filters. The Blocking filter filtering condition defines a condition which is met by intercepting the webpage link, for example, the webpage link A meets the Blocking filter7 filtering condition, and then the webpage link A is intercepted; the filter condition of the permission filter defines a condition that allows the release of the web page link, for example, the web page link B satisfies the filter condition of the permission filter32, and allows the release of the web page link B. Optionally, a type identifier may be added to the filtering condition of each filter, where the type identifier is used to describe a specific type of the filter, so as to facilitate the intelligent device end to identify the type of the filter.
In an exemplary implementation, the filtering criteria necessarily include a base filtering rule (rule), which may be a regular expression. Optionally, the basic filtering rules may include one or more URLs (Uniform Resource Locator, uniform resource locators) that are used to describe a link address (i.e., a web address) of a network resource, and when a URL requested to be accessed by a user matches a URL included in a filter, the URL requested to be accessed satisfies the basic filtering rules of the filter. The basic filtering rules are not limited to directly specifying the URLs that are intercepted or allowed to pass, but may be customized to other rule grammars, such as when the URLs of the web pages to be accessed carry the characters or character strings specified by the rules, that is, the basic filtering rules are satisfied.
In an exemplary implementation, the filtering conditions optionally include additional filtering options (options), which are further added and defined by a filter maintainer on the basis of basic rules, the options not being necessary, and the options may or may not be set in the filter. For a filter provided with additional filtering options, the filtering conditions of the filter are only satisfied if the basic filtering rules and the additional filtering options are satisfied at the same time. The number and type of options involved in the additional filtering options are not limited, and additional filtering options include, but are not limited to, the source of the object to be accessed, the file type pointed to by the URL, and the like.
In an exemplary implementation, the filtering conditions of the Blocking filter and the excursion filter may specify the same URL, for example, according to the URL of the web page link C, while matching to the Blocking filter15 and the excursion filter6, the Blocking filter15 intercepts the URL of the web page link C, and the excursion filter6 allows the URL of the web page link C to be released. The filtering scheme provided by the embodiment of the application is as follows: a single URL meets the filtering conditions of the Blocking filter and the extraction filter at the same time, namely two different types of filters of the Blocking filter and the extraction filter are matched at the same time, and the URL is intercepted; if the single URL is only matched with the Blocking filter and is not matched with the Exception filter, intercepting the URL; if the single URL is only matched with the admission filter and is not matched with the Blocking filter, allowing the URL to be released; if a single URL does not match any type of filter, then the URL is allowed to pass.
When an operation instruction for starting the browser application is received, the web page filtering engine reads the filter list issued by the third party into the memory, and because the filter list may comprise thousands to millions of filters, on one hand, after the filter list is loaded into the memory, a larger memory space is occupied, and particularly for devices such as an intelligent television with relatively low memory configuration, the running of other application programs or processes may be affected due to the insufficient memory; on the other hand, when a user accesses a certain webpage, the URL of the webpage is subjected to traversal matching with each filter in the filter list until a filtering result is finally output, which is certainly time-consuming, and results in low browser filtering efficiency and long waiting time when the user accesses the webpage. Therefore, the defects existing in the process of filtering the web page by the browser are mainly reflected in time complexity and memory space complexity.
In order to overcome the technical defects, in an exemplary implementation manner, firstly, the application introduces a Trie algorithm into maps and sets to improve the storage structure of filter data, thereby constructing a TrieMap and a TrieSet.
The Trie is also called Prefix Tree, chinese is often translated into a Prefix Tree or dictionary, which can provide high operation efficiency for character strings, and all child nodes of any node of the Trie except the root node have a common Prefix. Therefore, during searching and matching, the common prefix of the URL or the keywords thereof can be fully utilized to eliminate a large number of meaningless operations, for example, the URL has the common prefix of 'https:// www', the common prefix part can be skipped, and matching is carried out according to node characters or character strings formed by the bifurcated Trie, so that the time complexity and the time cost when filtering the webpage are reduced, the optimal time complexity of the Trie is Θ (1), and the average and worse time complexity is Θ (key_length).
The common Trie includes a Burst Trie and a HAT Trie, where the Burst Trie is the fastest data structure for storing and retrieving variable-length strings, and can greatly reduce memory consumption, and the Burst Trie compresses nodes with common prefixes into a socket, but the socket is a data structure that is not cache friendly, so that the Burst Trie cannot consider cache-friendly and concurrency performance, and if the cache hit rate of the Burst Trie is to be ensured on a multi-core processor, the concurrency performance will be weakened. The HAT Trie is improved based on the Burst Trie, and the Bucket is replaced by an Array Hash Table, which is a more cache friendly data structure than the Bucket, so that the HAT Trie has an advantage over the Burst Trie in terms of access speed.
Set is a data structure called collection, map is a data structure called dictionary, set and Map can store non-repeated values, set is to store elements in the form of [ value, value ], map is to store elements in the form of [ key, value ]/[ key, value ], preferably HAT Trie is used to store filter data in Map and Set, thus building a TrieSet and TrieMap data model, the improvement is that: firstly, the effective character scale of the filter is smaller, the probability of the public prefix is very large, and the sparse situation is not easy to generate; secondly, the time complexity on the filtered webpage is only related to the searched URL, and is irrelevant to the number of filters in the filtering list, and the length scale of the URL is smaller and is generally in the range of tens to hundreds of characters, so that the efficiency of filtering and matching is improved, and the time complexity and the time cost are reduced; in addition, the TrieNet and the TrieMap can simplify data storage, reduce the complexity of the memory space and the memory overhead, and realize double optimization and improvement of the memory and the matching speed.
In one exemplary implementation, the reading and initializing process of the filter list needs to be completed before the user accesses a certain web page link. When an operation instruction for starting the browser application is received, the webpage filtering engine acquires a filter list preset and issued by a third party, reads the filters included in the filter list into a memory, and pre-processes each filter, wherein the pre-processes include but are not limited to: identifying and counting the types of the filters according to the type identifiers of the filters, so that the filtering list can be conveniently distinguished from the Blocking filters and from the exclusivity filters; and when the filters are matched, whether the basic filtering rules and the additional filtering options meet the requirements or not is required to be determined, so that the basic filtering rules and the additional filtering options are required to be separated from each filter text file, and the basic filtering rules and the additional filtering options are required to be independently determined in the filtering conditions.
In an exemplary implementation, for each filter in the filter list, if the length of the filter is too short, or the filter uses a native regular expression, it may not be possible to effectively extract keywords (keys) from the filter, where the keywords are substrings extracted from the filter's basic filtering rules, and one filter may generate one or more keywords. Therefore, whether the keyword can be extracted from the filter to be stored is firstly judged, and if the keyword of the filter cannot be extracted, the filter is directly stored in the TrieSet.
In an exemplary implementation manner, if at least one keyword can be extracted from the filters to be stored, since the same filters may have been stored before in the TrieMap, or the existing filters in the TrieMap have the same keywords as the filters to be stored, the keywords extracted from the filters to be stored may be occupied in the TrieMap, so as to avoid repeated storage and mapping of the same keywords to a plurality of filters, and after the keywords of the filters to be stored are extracted, determining whether the keywords are occupied in the TrieMap. If all keywords are occupied in the TrieMap, storing the filter in the TrieNet; if some or some keywords are not occupied in the TrieMap, the keywords and the filter are stored in the TrieMap in the form of [ key, value ]. After all the filters in the filter list are preprocessed and stored in the mode, the initialization task of the webpage filtering engine is basically completed, and the filter list is stored in a memory in the form of TrieMap and TrieNet through initialization.
As an example, assuming that a keyword 1, a keyword 2 and a keyword 3 are extracted from a certain filter to be stored, if all of the keywords 1 to 3 are occupied in the TrieMap, the filter is directly stored in the TrieSet, and the TrieMap is not stored; if the keywords 1 and 2 are occupied in the TrieMap, but the keywords 3 are unoccupied, storing the keywords in the TrieMap according to the form of [ keywords 3, filter ] which is not stored in TrieNet; if the keyword 3 is occupied in the TrieMap, but the keyword 1 and the keyword 2 are not occupied, storing the keyword into the TrieMap according to the [ keyword 1, filter ] and the [ keyword 2, filter ], wherein the filter is not stored into the TrieMap any more, i.e. one filter is not stored into the TrieMap and the TrieSet at the same time, but is determined to be stored into the TrieMap or the TrieSet according to the extraction and the occupation condition of the keyword.
After the web page filtering engine completes the initialization processing of the filter list, if the user clicks a certain page link, the filtering matching flow can be executed. In an exemplary implementation, as illustrated in fig. 2, a web page filtering method is provided, which is executed by a filtering module (corresponding to the web page filtering engine) configured by a browser application in a smart device, and includes the following program steps:
And step S01, when an operation instruction for starting the browser application is received, a filtering list is obtained, and each filter included in the filtering list is stored in a TrieMap or a TrieNet.
Step S01 is a summary of the reading and initializing process of the filter list, and a more detailed implementation of step S01 may refer to the description of the related embodiments, which is not repeated herein.
Step S02, when an operation instruction of accessing a target link by a user is received, link information associated with the target link is acquired.
In an exemplary implementation manner, the target link may be an advertisement link implanted on the current webpage, or a resource link pushed by the current webpage, etc., and the user clicks on the target link, that is, triggers an access request to the target link; or, the target link may be a website input by the user in the website column of the browser, and the user confirms the search after manually inputting the website, that is, triggers the access request to the target link. It should be noted that, the presentation and access forms of the target links are not limited to the examples of the present application.
In an exemplary implementation, the link information includes, but is not limited to, a target link URL, a reference, a source of an object to be accessed, a ContentType pointed to by the target link URL, and so on, and parameter information that may be involved and used in subsequent filtering matches is included in the link information. The target link URL is the URL of the webpage which is triggered to be accessed by the user; the reference is the URL of the current web page, when the browser sends an http request for accessing the target link to the web server, the reference is generally carried to inform the server of which reference web page is about to jump to the web page of the target link, that is, when the web page a triggers the access request to the web page B, the URL of the web page a is the reference, and the URL of the web page B is the URL of the target link; the source of the object to be accessed is the data source of the target link URL, for example, the source is Mainframe and the like; the ContentType is used for specifying the HTTP content type of the response, defines the type of the network file, the coding of the web page, and the like, and can be used for judging the option in the filtering condition.
And S03, filtering the target link according to the link information, the filtering list, the TrieMap and/or the TrieSt.
And step S04, if the filtering result is that the release is allowed, sending an access request for the target link to the server.
Step S05, if the filtering result is interception, the access request of the target link is not sent to the server.
After the link information is obtained, filtering and matching are carried out on the target link by combining a filter filtering condition customized by a filter list and a data storage model of the TrieMap and/or TrieNet, and a final filtering result is output. If the output filtering result is 'allowed to pass', the target link is considered to be safe and accessible, the browser sends an http request for accessing the target link to the server, then receives webpage data issued by the server after responding to the http request, jumps to the webpage and loads and displays the webpage data, so that a user browses webpage content provided in the target link.
If the output filtering result is "interception", the target link is considered to be risk-inaccessible, the browser does not send an http request for accessing the target link, namely, the page access request of the user is stopped, and optionally, the user is prompted on a browser interface that the safety risk exists in the accessed webpage and the webpage is intercepted, so that the user is prevented from triggering access retry of the target link.
In an exemplary implementation, fig. 3 illustrates a more refined web page filtering method, compared to the more general scheme shown in fig. 2, after starting the browser application and completing the reading and initializing process of the filter list, the method includes the following program steps:
step S101, when an operation instruction of accessing a target link by a user is received, link information associated with the target link is acquired.
Step S102, judging whether the source of the object to be accessed is a Mainframe. If the source of the object to be accessed is a Mainframe, step S114 is executed, i.e. any web page link traffic with the source of Mainframe is allowed; if the source of the object to be accessed is not a Mainframe, further filtering and checking of the target link is required, and step S103 is performed.
Step S103, retrieving the whitelist defined in the filter list from the TrieSet.
In an exemplary implementation manner, the whitelist is preset by a third party and issued to the intelligent device side along with the filter list, the whitelist includes a plurality of authorized URLs, any jump links configured in the web page corresponding to the authorized URLs are allowed to pass, that is, the user requests access to the advertisement links in the web page of any authorized URL in the whitelist, and defaults that the advertisement links are safe and accessible. The white list is equivalent to a set of a plurality of URLs, so the white list can be stored in a TrieNet form, thereby reducing memory overhead and improving the speed of inquiring and matching the white list.
Step S104, judging whether the reference is in the white list. If the reference is in the white list, that is, the reference is matched with a certain authorized URL in the white list, step S114 is executed; if the reference is not in the white list, step S105 is performed.
Step S105, extracting the first keyword of the reference. For convenience of distinction, the present application names a keyword extracted from a reference as a first keyword and a keyword extracted from a target link URL as a second keyword.
Step S106, judging whether a first exception filter matched with the first keyword is queried in the TrieMap.
In an exemplary implementation, according to the type identifier carried in each filter, it is possible to distinguish and count which filters in the TrieMap belong to Blocking filters and which belong to extraction filters, and screen all extraction filters from the TrieMap. And matching the first keyword with each Exception filter in the TrieMap, wherein when the keyword of a certain Exception filter in the TrieMap is the same as the first keyword, the Exception filter is the queried first Exception filter. If it is found that the first exception filter exists in the TrieMap, step S107 is executed, and otherwise step S108 is executed.
Step S107, it is determined whether the reference satisfies the filtering condition of the first external filter.
In one exemplary implementation, the filter criteria necessarily include basic filter rules, optionally including additional filter options. If the filtering conditions only comprise basic filtering rules, the reference only meets the basic filtering rules, namely meets the filtering conditions; if the filtering condition includes both the basic filtering rule and the additional filtering option, the basic filtering rule and the additional filtering option need to be satisfied at the same time to satisfy the filtering condition. If the reference does not meet the filtering condition of the first external filter, executing step S108; otherwise, step S109 is performed.
Step S108, judging whether the second exception filter is matched in the TrieNet.
In one exemplary implementation, all of the extraction filters are screened from the TrieSet based on the type identification of the filters stored in the TrieSet. Matching the reference with each extraction filter in the TrieSet, and executing step S109 when the reference meets the filtering condition of a certain extraction filter in the TrieSet, wherein the extraction filter is the matched second Exception filter; if the reference does not satisfy the filtering condition of any preference filter in the TrieSet, the matching fails, and step S110 is executed.
Step S109, adding the reference to the white list. After adding the reference to the white list, step S114 is performed, and the filtering result is output as permitted release.
In step S110, the reference does not add the white list, and extracts the second keyword of the target link URL.
Step S111, judging whether a first blocking filter matched with the second keyword is inquired in the TrieMap.
In one exemplary implementation, all Blocking filters are screened from the TrieMap according to the type identification of the filters stored in the TrieMap. And matching the second keyword with each Blocking filter in the TrieMap, and when the keyword of a certain Blocking filter in the TrieMap is the same as the second keyword, obtaining the Blocking filter as the queried first Blocking filter. If it is found that the first blocking filter exists in the TrieMap, step S112 is executed, and otherwise step S113 is executed.
Step S112, judging whether the target link URL meets the filtering condition of the first blocking filter. If the target link URL does not meet the filtering condition of the first blocking filter, step S113 is performed; otherwise, step S115 is performed.
Step S113, it is determined whether the second blocking filter is matched in the TrieSet.
In one exemplary implementation, all Blocking filters are screened from the TrieSet based on the type identification of the filters stored in the TrieSet. Matching the target link URL with each Blocking filter in the TrieNet, and executing step S115 when the target link URL meets the filtering condition of a certain Blocking filter in the TrieNet, wherein the Blocking filter is the matched second Blocking filter; if the target link URL does not meet the filtering condition of any Blocking filter in the TrieSet, the matching fails, and step S114 is executed.
Step S114, outputting the filtering result to be permitted to pass, and sending an access request for the target link to the server.
Step S115, outputting the filtering result to intercept, and not sending the access request to the target link to the server.
In the above-mentioned scheme illustrated in fig. 2, the determination of the source of the object to be accessed is first performed, when the source of the object to be accessed is not a Mainframe, based on the white list preset in the filter list, the matching between the reference and the white list is first performed, if the reference belongs to the authorized page through the white list query, the direct release is allowed, and the matching between the target link URL and the TrieMap or TrieSet is not necessary, thereby improving the filtering speed to a certain extent.
When the matching of the reference and the white list fails, the Blocking filter directly defines the interception rule, but not the release rule, and when one URL simultaneously meets the filtering conditions of the Blocking filter and the extraction filter, the default Blocking priority is higher than the extraction, namely the URL is intercepted. In short, as long as the target link URL meets the filtering condition of any Blocking filter in the filtering list, namely meets the interception rule, whether the target link URL meets the release rule or not, the target link URL is directly intercepted; if the target link URL does not meet any interception rules, then the target link URL is allowed to be passed regardless of whether the target link URL meets pass rules.
If the target link URL is matched with the exclusion filter in the filtering list, even if the target link URL meets the filtering condition of a certain exclusion filter, the webpage filtering engine cannot directly pass, because the target link URL possibly also meets the filtering condition of the Blocking filter, the target link URL is matched with the Blocking filter in the filtering list once again, if the target link URL is not matched with the Blocking filter, the target link URL is indicated to meet the passing rule, any interception rule is not met, and the target link URL is naturally allowed to pass; if the target link URL is matched with the Blocking filter, the target link URL meets the release rule, but meets the interception rule, and the interception processing is preferentially performed. Therefore, the target link URL is preferentially matched with the Blocking filter in the filtering list, so that the target link URL does not need to be matched with the acceptance filter in the filtering list any more, and the filtering and access efficiency of the webpage to be accessed can be improved.
Of course, as an alternative, when the matching between the reference and the white list fails, the target link URL is matched with the admission filter in the filtering list, and then matched with the Blocking filter, which is just slower than the filtering matching in the scheme logic of fig. 2.
In the application, when the reference or the target link URL is filtered and matched with the filtering list, the TrieMap is preferentially matched, and if the matching with the TrieMap fails, the TrieMap is matched again, because the TrieMap supports the rapid searching, positioning and matching by utilizing keywords, and the TrieStet is required to be matched in a traversing way, the TrieMap is obviously superior to the TrieStet in searching and matching speed, and therefore, when the matching priority of the TrieMap is higher than that of the TrieStet, the filtering and accessing efficiency of the webpage to be accessed can be improved.
In one exemplary implementation, three representative filter lists are chosen as implementation examples, easylist, energized unified protection, and adguard, respectively. easyllist and adguard are widely used filter list, energized unified protection is the maximum known standard filter list, and the number of filters can reach tens of millions, so that the performance of the algorithm under the limit condition can be tested. In C++ STL, map has two main implementation forms, namely std:: map and std:: unorded_map, and std:: map is implemented through RB-Tree, and the latter is implemented through Hash Table. Similarly, set has two main implementations, std:: set and std:: unordered_set, respectively, implemented by RB-Tree and Hash Table, respectively. In this embodiment, treeSet and TreeMap refer to std:: set and std:: map, respectively, hashSet and HashMap refer to std:: unorded_set and std:: unorded_map, respectively. The present embodiment will use TrieSet and TrieMap, both implemented using HAT Trie, compared and analyzed with the data structures described above.
First, first test the reading and initialization of the filter
(a) Performance comparison with respect to Set
And respectively reading the three filters list into TrieSet, treeSet and HashSet, and respectively counting the memory overhead and the traversal time among the three filters.
As shown in fig. 4, the comparison result of the Set memory overhead is that the memory occupation of the TrieSet is much smaller than that of TreeSet and HashSet, and the larger the size of the filter list is, the more obvious the performance advantage of the TrieSet is, which indicates that the TrieSet has very good scalability when dealing with the filters, mainly because the more the number of filters is, the denser the filters with common prefixes therein are, and the more obvious the memory saving effect is.
The comparison of Set traversal times is shown in fig. 5, and it can be seen from fig. 5 that TrieSet is faster in traversal time than TreeSet and HashSet, and also exhibits good scalability, mainly because reeset has good cache friendliness.
(b) Comparison of Performance with respect to Map
After three types of filter lists are read into a memory, extracting keywords from each filter in each type of filter list, storing the filters capable of effectively extracting the keywords into three maps respectively according to key-value forms, and counting memory overhead and traversal time among the three types of filters.
As shown in fig. 6, the Map memory overhead results are shown in fig. 6, and it can be seen from fig. 6 that the consumption of the TrieMap to the memory is far smaller than that of TreeMap and HashMap, and the reeemap memory overhead has very excellent performance.
The Map traversal time results are shown in fig. 7, and it can be seen from fig. 7 that the TreeMap still outperforms TreeMap and HashMap in traversal time. In summary, compared with the conventional Map and Set data structures, the TrieMap and TreeNet have very remarkable performance improvement in terms of memory consumption and traversal time.
Second, second test URL matching with filter
In this example, the same URL will be used to match filters stored in the three classes of data structures, respectively. URL is https:// c.aaxads.com/aax.jsub=aaxlq 225C & hst= www.weatherbug.com & ver=1.2;
referer is https:// www.weatherbug.com/;
ContentType is SCRIPT (SCRIPT).
The matching time is shown in fig. 8, and it can be seen from fig. 8 that Trie-based is still the fastest when using the filler in the different category data structure to match URLs.
The test result shows that the Map and the Set realized by the Trie have obvious optimization on the memory consumption and the traversal time when filtering the page, and can well solve the problem of larger memory expenditure of the existing filtering algorithm, which is also the biggest defect of the existing algorithm. In the matching time, the data structure based on the Trie can also improve the matching speed to a certain extent, but because the bottleneck of time expenditure is mainly on the regular expression, the optimization degree of the Trie on the time expenditure is relatively less obvious than that of the memory optimization.
In the technical scheme provided by the application, the Trie is introduced into the Map and the Set to realize the storage of the filter data, so that the TrieMap and the TrieNet are obtained, the Trie can save the memory by utilizing the common prefix of a plurality of filters in the filtering list, the space complexity is reduced, the time complexity of the Trie in filtering is only related to the searched URL, and is irrelevant to the number of the filters in the filtering list, and the length of the URL is generally in the range of tens to hundreds of characters, so that the efficiency of filtering matching can be improved. Based on the filter storage structure of TrieMap and TrieNet, referring to the link information of the page URL, reference and the like to be accessed, filtering the content of a predefined white list and the like in the list, outputting a filtering result, and deciding whether to send an http request to a server according to the filtering result. If the filtering result is that the permission is released, the target link is considered to be safe and not filtered, and an http request for accessing the target link is sent to the server; if the filtering result is intercepted, and the target link is at risk and needs to be filtered, an http request for accessing the target link is not sent to the server. Therefore, the application not only ensures the safety of the browser, but also improves the speed of web page filtering and accessing, and obviously reduces the memory overhead when storing the filter data, and is especially suitable for intelligent devices with relatively low memory configuration such as intelligent televisions.
In an exemplary implementation, the present application further provides a computer storage medium, where a program may be stored, where the program may include program steps of a web page filtering method according to embodiments of the present application when the program is executed. The computer storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. The webpage filtering method is characterized by comprising the following steps of:
when an operation instruction for starting a browser application is received, a filtering list is obtained, wherein the filtering list comprises a filter;
if the keywords of the filter cannot be extracted, storing the filter in a TrieNet;
if at least one keyword is extracted from the filter and all keywords are occupied by the TrieMap, storing the filter in the TrieNet;
if at least one keyword is extracted from the filter and the keyword is not occupied by the TrieMap, storing the keyword and the filter in the TrieMap;
when an operation instruction of a user for accessing a target link is received, acquiring link information associated with the target link;
filtering the target link according to the link information, the filtering list, the TrieMap and/or the TrieNet;
if the filtering result is that the release is allowed, sending an access request for the target link to the server;
if the filtering result is interception, the access request of the target link is not sent to the server.
2. The method of claim 1, wherein filtering the target link comprises:
And if the source of the object to be accessed indicated in the link information is a Mainframe, outputting a filtering result as allowing release.
3. The method of claim 2, wherein filtering the target link comprises:
if the source of the object to be accessed indicated in the link information is not a Mainframe, calling a white list defined in the filtering list from the TrieNet, wherein the white list comprises a plurality of authorized URLs, and any jump links configured in the webpage corresponding to the authorized URLs are allowed to be released;
and if the reference indicated in the link information is in the white list, outputting a filtering result as allowing release.
4. The method of claim 3, wherein filtering the target links comprises:
if the reference indicated by the link information is not in the white list, extracting a first keyword of the reference;
if a first external filter matched with the first keyword is queried in the TrieMap and a reference meets the filtering condition defined by the first external filter, adding the reference to a white list, and outputting a filtering result as allowing release; wherein the filtering conditions comprise basic filtering rules and additional filtering options, and the filtering conditions of the exceptional filter define the conditions which are met by the permission target link.
5. The method of claim 4, wherein filtering the target link comprises:
if the first exception filter matched with the first keyword is not queried in the TrieMap or the reference does not meet the filtering condition of the first exception filter, matching the reference with the exception filter in the TrieNet;
and if the reference is matched with the filter condition defined by the second exceptional filter in the TrieNet, adding the reference to a white list, and outputting a filter result as allowing release.
6. The method of claim 5, wherein filtering the target link comprises:
if the second exceptional filter is not matched from the TrieNet, the reference does not add a white list, and a second keyword of a target link URL indicated by the link information is extracted;
if a first blocking filter matched with the second keyword is queried in the TrieMap and the target link URL meets the filtering condition defined by the first blocking filter, outputting a filtering result to intercept; wherein the filter condition of the blocking filter defines a condition satisfied by intercepting the target link.
7. The method of claim 6, wherein filtering the target link comprises:
if the first blocking filter matched with the second keyword is not queried in the TrieMap or the target link URL does not meet the filtering condition defined by the first blocking filter, matching the target link URL with the blocking filter in the TrieNet;
if the target link URL is matched with the target link URL to meet the filtering condition defined by the second blocking filter in the TrieNet, outputting a filtering result to intercept;
and if the second blocking filter is not matched from the TrieNet, outputting a filtering result as a permitted release.
8. The method of claim 1, wherein after obtaining the filter list, the method further comprises the steps of:
the filter is pretreated.
9. The method of claim 8, wherein the pre-processing the filter comprises:
identifying and counting the type of the filter according to the type identifier of the filter; types of filters include exception filters and blocking filters;
the basic filtering rules and the additional filtering options in the filtering conditions defined by the filter are separated.
10. An intelligent device, wherein a browser application is configured in the intelligent device, and a filtering module is configured in the browser application, and the filtering module is used for executing:
when an operation instruction for starting a browser application is received, a filtering list is obtained, wherein the filtering list comprises a filter;
if the keywords of the filter cannot be extracted, storing the filter in a TrieNet;
if at least one keyword is extracted from the filter and all keywords are occupied by the TrieMap, storing the filter in the TrieNet;
if at least one keyword is extracted from the filter and the keyword is not occupied by the TrieMap, storing the keyword and the filter in the TrieMap;
when an operation instruction of a user for accessing a target link is received, acquiring link information associated with the target link;
filtering the target link according to the link information, the filtering list, the TrieMap and/or the TrieNet;
if the filtering result is that the release is allowed, sending an access request for the target link to the server;
if the filtering result is interception, the access request of the target link is not sent to the server.
CN202111113915.XA 2021-09-23 2021-09-23 Webpage filtering method and intelligent device Active CN113905275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111113915.XA CN113905275B (en) 2021-09-23 2021-09-23 Webpage filtering method and intelligent device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111113915.XA CN113905275B (en) 2021-09-23 2021-09-23 Webpage filtering method and intelligent device

Publications (2)

Publication Number Publication Date
CN113905275A CN113905275A (en) 2022-01-07
CN113905275B true CN113905275B (en) 2023-09-15

Family

ID=79028958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111113915.XA Active CN113905275B (en) 2021-09-23 2021-09-23 Webpage filtering method and intelligent device

Country Status (1)

Country Link
CN (1) CN113905275B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116132534B (en) * 2022-07-01 2024-03-08 马上消费金融股份有限公司 Method, device, equipment and storage medium for storing service request
CN116739670B (en) * 2023-08-16 2023-10-24 北京三人行时代数字科技有限公司 Advertisement pushing marketing system and method based on big data
CN117579383B (en) * 2024-01-15 2024-03-22 杭州优云科技股份有限公司 Method, device and equipment for detecting and intercepting active HTTP response

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6065055A (en) * 1998-04-20 2000-05-16 Hughes; Patrick Alan Inappropriate site management software
JP2006309357A (en) * 2005-04-26 2006-11-09 Matsushita Electric Ind Co Ltd Support device for filtering harmful site
CN106528657A (en) * 2016-10-19 2017-03-22 广东欧珀移动通信有限公司 Control method and device for browser skipping to application program
CN107436873A (en) * 2016-05-25 2017-12-05 北京奇虎科技有限公司 A kind of network address jump method, device and transferring device
CN108959565A (en) * 2018-07-04 2018-12-07 广东小天才科技有限公司 A kind of method, apparatus and server of web page contents filtering

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050050472A1 (en) * 2003-08-28 2005-03-03 Faseler Walter L. Method and apparatus for storing and accessing URL links
EP3398309A1 (en) * 2015-12-29 2018-11-07 THOMSON Licensing Url filtering method and device
WO2017165230A1 (en) * 2016-03-21 2017-09-28 Lips Labs Inc. In-memory suppression of query generated indexes and interfaces for navigating indexed content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6065055A (en) * 1998-04-20 2000-05-16 Hughes; Patrick Alan Inappropriate site management software
JP2006309357A (en) * 2005-04-26 2006-11-09 Matsushita Electric Ind Co Ltd Support device for filtering harmful site
CN107436873A (en) * 2016-05-25 2017-12-05 北京奇虎科技有限公司 A kind of network address jump method, device and transferring device
CN106528657A (en) * 2016-10-19 2017-03-22 广东欧珀移动通信有限公司 Control method and device for browser skipping to application program
CN108959565A (en) * 2018-07-04 2018-12-07 广东小天才科技有限公司 A kind of method, apparatus and server of web page contents filtering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
凌波 ; 柳景超 ; 张志祥 ; .基于Windows终端信息过滤的网络访问控制研究.计算机工程与设计.2011,(第01期),全文. *

Also Published As

Publication number Publication date
CN113905275A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN113905275B (en) Webpage filtering method and intelligent device
US10250526B2 (en) Method and apparatus for increasing subresource loading speed
US10375102B2 (en) Malicious web site address prompt method and router
US9729499B2 (en) Browser and method for domain name resolution by the same
CN102722563B (en) Method and device for displaying page
CN105608134B (en) A kind of network crawler system and its web page crawl method based on multithreading
US20150234927A1 (en) Application search method, apparatus, and terminal
US8131753B2 (en) Apparatus and method for accessing and indexing dynamic web pages
US10440042B1 (en) Domain feature classification and autonomous system vulnerability scanning
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
CN102799610A (en) Method and system for collecting network information
CN103617241B (en) Search information processing method, browser terminal and server
CN110365810B (en) Domain name caching method, device and equipment based on web crawler and storage medium
US20200142674A1 (en) Extracting web api endpoint data from source code
CN108154024B (en) Data retrieval method and device and electronic equipment
CN102882988A (en) Method, device and equipment for acquiring address information of resource information
CN111125485A (en) Website URL crawling method based on Scapy
CN110955855A (en) Information interception method, device and terminal
CN111061972B (en) AC searching optimization method and device for URL path matching
US9584537B2 (en) System and method for detecting mobile cyber incident
CN113746941B (en) Method, device and storage medium for removing restriction of third-party cookie
CN106919600A (en) One kind failure network address access method and terminal
CN106612336A (en) Picture preloading method and picture preloading device
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN106302797B (en) A kind of cookie access De-weight method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant