CN116502009A

CN116502009A - Webpage filtering method, device, equipment and storage medium

Info

Publication number: CN116502009A
Application number: CN202310751453.7A
Authority: CN
Inventors: 陈志丰
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-07-28
Anticipated expiration: 2043-06-25
Also published as: CN116502009B

Abstract

The invention relates to the technical field of Internet, and discloses a webpage filtering method, device, equipment and storage medium, wherein the method comprises the following steps: when a webpage to be filtered is received, word segmentation processing is carried out on text content in the webpage to be filtered so as to obtain word segmentation information corresponding to the text content; acquiring word segmentation identifiers corresponding to the word segmentation information respectively, and generating a word segmentation identifier sequence of the text content according to the word segmentation identifiers; matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, wherein the preset prefix tree stores the corresponding relation between the identification sequence and the filtering rules; and filtering the webpage to be filtered according to the matching result. According to the method and the device, the word segmentation identification sequence of the text content in the webpage to be filtered is matched with the filtering rule in the preset prefix tree, so that the webpage to be filtered is filtered according to the matching result, and the technical problem that the filtering of the webpage in the trillion level cannot be realized in the prior art, and the search experience of a user is affected is solved.

Description

Webpage filtering method, device, equipment and storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for filtering web pages.

Background

With the rapid development of internet technology, search engines become one of the important ways for people to acquire information, and great convenience is brought to people's life. But a great amount of bad information exists in web pages, videos and pictures in the Internet, which is not only unfavorable for social stability, but also toxic to physical and mental health of vast netizens. Therefore, when the search engine records the search information of the user, the search engine needs to identify and filter bad information, and provides a clean and green search service for the user.

In the existing scheme, bad webpages can be filtered in real time through instructions issued by a worker department or a network letter office, anti-cheating technology, offline mining and other rules. However, due to the large data template in the internet, the filtering mode in the prior art cannot realize the filtering of the trillion-level web pages, so that the searching experience of users is affected.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a webpage filtering method, device, equipment and storage medium, and aims to solve the technical problem that the filtering of trillion-level webpages cannot be realized in the prior art, and the searching experience of users is affected.

In order to achieve the above object, the present invention provides a web page filtering method, which includes the following steps:

when a webpage to be filtered is received, word segmentation processing is carried out on text content in the webpage to be filtered so as to obtain word segmentation information corresponding to the text content;

acquiring word segmentation identifiers corresponding to the word segmentation information respectively, and generating a word segmentation identifier sequence of the text content according to the word segmentation identifiers;

matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, wherein the preset prefix tree stores the corresponding relation between the identification sequence and the filtering rules;

and filtering the webpage to be filtered according to the matching result.

Optionally, before the step of performing word segmentation processing on text content in the web page to be filtered to obtain word segmentation information corresponding to the text content when the web page to be filtered is received, the method further includes:

obtaining the identifications corresponding to all words in the text corresponding to the filtering rules according to the filtering rule word list;

and generating an identification sequence of the text corresponding to the filtering rule according to the identification, and establishing a preset prefix tree based on the identification sequence and the filtering rule.

Optionally, the step of generating the identification sequence of the text corresponding to the filtering rule according to the identification, and establishing a preset prefix tree based on the identification sequence and the filtering rule includes:

arranging the identifications in a preset arrangement mode, and generating an identification sequence of the text corresponding to the filtering rule according to an arrangement result;

and establishing the preset prefix tree by taking the mark in the mark sequence as a node of the preset prefix tree and determining the node as a leaf node of the preset prefix tree by the filtering rule.

Optionally, the step of establishing the preset prefix tree with the node identified in the identification sequence as a preset prefix tree and the filtering rule determined as a leaf node of the preset prefix tree includes:

storing the marks in the mark sequence to the nodes of a preset prefix tree correspondingly according to a monotonically increasing arrangement mode;

and storing the filtering rules corresponding to the identification sequences into leaf nodes of the corresponding preset prefix trees respectively so as to establish the preset prefix trees based on the nodes and the leaf nodes.

Optionally, before the step of obtaining the identifiers corresponding to all the words in the text corresponding to the filtering rule according to the filtering rule vocabulary, the method further includes:

Acquiring word identifiers corresponding to words to be filtered;

and establishing a filtering rule word list through a multimode matching algorithm based on the word to be filtered and the word identification.

Optionally, the step of obtaining word segmentation identifiers corresponding to the word segmentation information respectively and generating the word segmentation identifier sequence of the text content according to the word segmentation identifiers includes:

performing format conversion on word segmentation character strings in each word segmentation information to obtain word segmentation identifiers corresponding to the word segmentation information;

and arranging the word segmentation identifiers in the preset arrangement mode, and generating a word segmentation identifier sequence of the text content according to an arrangement result.

Optionally, before the step of matching the word segmentation identifier sequence with each filtering rule in the preset prefix tree, the method further includes:

recursively inquiring the word segmentation identification sequence in a preset prefix tree;

judging whether a filtering rule corresponding to the word segmentation identification sequence exists or not according to the query result;

and if so, executing the step of matching the word segmentation identification sequence with each filtering rule in a preset prefix tree.

In addition, in order to achieve the above object, the present invention also proposes a web page filtering apparatus, the apparatus comprising: a memory, a processor, and a web page filter program stored on the memory and executable on the processor, the web page filter program configured to implement the steps of the web page filtering method as described above.

In addition, to achieve the above object, the present invention also proposes a storage medium having stored thereon a web page filtering program which, when executed by a processor, implements the steps of the web page filtering method as described above.

In addition, in order to achieve the above object, the present invention also provides a web page filtering device, which includes: the system comprises an information acquisition module, a sequence generation module, a rule matching module and a webpage filtering module;

the information acquisition module is used for carrying out word segmentation processing on text content in the webpage to be filtered when the webpage to be filtered is received so as to acquire word segmentation information corresponding to the text content;

the sequence generation module is used for acquiring word segmentation identifiers corresponding to the word segmentation information respectively and generating a word segmentation identifier sequence of the text content according to the word segmentation identifiers;

the rule matching module is used for matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, and the preset prefix tree stores the corresponding relation between the identification sequence and the filtering rule;

and the webpage filtering module is used for filtering the webpage to be filtered according to the matching result.

When a webpage to be filtered is received, word segmentation processing is carried out on text content in the webpage to be filtered so as to obtain word segmentation information corresponding to the text content; acquiring word segmentation identifiers corresponding to the word segmentation information respectively, and generating a word segmentation identifier sequence of the text content according to the word segmentation identifiers; matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, wherein the preset prefix tree stores the corresponding relation between the identification sequence and the filtering rules; filtering the webpage to be filtered according to the matching result; compared with the prior art, the method filters the bad webpage through instructions issued by a worker letter part or a network letter, anti-cheating technology, offline mining and other rules, and as the word segmentation identification sequence of text content in the webpage to be filtered is matched with each filtering rule in the preset prefix tree and the webpage to be filtered is filtered according to the matching result, the technical problem that the trillion-level webpage cannot be filtered and user searching experience is affected in the prior art is solved.

Drawings

FIG. 1 is a schematic diagram of a configuration of a web page filtering device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart of a web page filtering method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for generating a word segmentation identifier sequence according to a first embodiment of the web page filtering method of the present invention;

FIG. 4 is a diagram illustrating a filtering rule vocabulary according to a first embodiment of the web page filtering method of the present invention;

FIG. 5 is a flowchart illustrating rule matching in the first embodiment of the web page filtering method of the present invention;

FIG. 6 is a flowchart illustrating a web page filtering method according to a second embodiment of the present invention;

FIG. 7 is a flowchart of a method for constructing a preset prefix tree according to a second embodiment of the present invention;

FIG. 8 is a flowchart illustrating a third embodiment of a web page filtering method according to the present invention;

fig. 9 is a block diagram of a first embodiment of a web page filtering apparatus according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a web page filtering device of a hardware running environment according to an embodiment of the present invention.

As shown in fig. 1, the web page filtering apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the structure shown in fig. 1 is not limiting of the web page filtering apparatus and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a web page filter program may be included in the memory 1005 as one type of storage medium.

In the web page filtering apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the web page filtering apparatus of the present invention may be disposed in the web page filtering apparatus, and the web page filtering apparatus calls the web page filtering program stored in the memory 1005 through the processor 1001 and executes the web page filtering method provided by the embodiment of the present invention.

An embodiment of the invention provides a web page filtering method, referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the web page filtering method of the invention.

In this embodiment, the web page filtering method includes the following steps:

step S10: when a webpage to be filtered is received, word segmentation processing is carried out on text content in the webpage to be filtered so as to obtain word segmentation information corresponding to the text content.

It should be noted that, the execution body of the method of this embodiment may be a web page filtering device that filters a web page with bad information when a user searches, or other web page filtering systems that can implement the same or similar functions and include the web page filtering device. The web page filtering method provided in this embodiment and the following embodiments will be specifically described with a web page filtering system (hereinafter referred to as a system).

It should be understood that, the above-mentioned web page to be filtered may be a web page that has bad information and needs to be filtered, and the specific web page type is not limited in this embodiment. In practical application, in order to provide a good search environment for users when the users search, the system needs to filter bad webpages in a data screening stage, an inverted index library building stage and a webpage online recall stage after webpage crawling is completed.

It may be appreciated that the text content may be text content in a web page to be filtered, where the text content may be text content in text in a web page, or text content in a video or a picture in a web page, which is not limited in this embodiment.

The word segmentation process may be a process of splitting text content into individual words. According to the embodiment, word segmentation processing can be performed on text contents through a multimode matching algorithm (such as an AC automaton), after word segmentation processing is performed on the text contents, single words corresponding to all the text contents in a webpage to be filtered can be obtained, and the words are filtered according to a preset filtering rule word list so as to form word segmentation information corresponding to the text contents based on unfiltered words. The multimode matching algorithm is an algorithm for searching a plurality of modes in a text, and can be used for detecting malicious codes, attack behaviors and the like in network traffic.

In a specific implementation, when a search request of a user is received, a corresponding webpage to be filtered can be determined according to search content corresponding to the search request, text content in characters, pictures and videos in the webpage to be filtered is obtained, then the text content is split, single characters corresponding to the text content are obtained, and word segmentation information corresponding to the text content is obtained based on the single characters.

Further, in order to improve the efficiency of web page filtering, the step S10 may include: when a webpage to be filtered is received, judging content attributes corresponding to text content in the webpage to be filtered; judging whether illegal words exist in a text title corresponding to the text content when the content attribute belongs to a target attribute; if the text content does not exist, word segmentation processing is carried out on the text content so as to obtain word segmentation information corresponding to the text content.

It should be appreciated that the content attribute may be a category attribute corresponding to text content, for example: politics, economy, entertainment, sports, etc., to which this embodiment is not limited.

It can be appreciated that, in this embodiment, the target attribute may be set to politics, or other content categories needing to be filtered with emphasis, so as to enhance detection of text content with the content attribute as the target attribute in the web page, thereby improving accuracy of web page filtering. In practical application, a violation vocabulary library can be preset for the text content of the target attribute, and when the text content in the webpage to be filtered is detected to be the target attribute, whether the violation vocabulary exists in the text title corresponding to the text content can be judged through the preset violation vocabulary library. If the text content exists, directly filtering the webpage to be filtered corresponding to the text content, so as to realize the primary filtering of the webpage to be filtered; if not, performing secondary filtering on the text content, namely continuously executing word segmentation processing on the text content to obtain word segmentation information corresponding to the text content. According to the embodiment, the first-stage filtering can be directly performed on the webpage to be filtered corresponding to the text content according to the text title of the text content, and the second-stage filtering can be continuously performed on the webpage to be filtered which is not filtered in the first-stage filtering, so that the accuracy of filtering the webpage is improved.

Further, in order to improve the data processing efficiency, after the step of determining the content attribute corresponding to the text content in the web page to be filtered when the web page to be filtered is received, the method may further include: judging whether illegal words exist in a text title corresponding to the text content when the content attribute does not belong to a target attribute; and if not, carrying out response processing on the webpage to be filtered.

In a specific implementation, if the content attribute corresponding to the text content of the webpage to be filtered does not belong to the target attribute, the text content is indicated to be non-key content, multiple times of inspection can be omitted from the text content, at this time, the text title corresponding to the text content can be inspected, and if the text title does not contain the vocabulary in the preset illegal vocabulary library, the webpage to be filtered corresponding to the text content can be directly responded, so that the data processing efficiency is improved.

Step S20: and obtaining word segmentation identifiers corresponding to the word segmentation information respectively, and generating a word segmentation identifier sequence of the text content according to the word segmentation identifiers.

The word segmentation identifier may be an identifier corresponding to each word in the word segmentation information.

It should be understood that the word segmentation identification sequence may be a sequence obtained by arranging identification numbers corresponding to words in the word segmentation information. The word segmentation method and device can be used for generating an orderly word segmentation identification sequence corresponding to text content by arranging identifications corresponding to words in an ascending order according to the sizes of word segmentation identifications.

It is to be understood that the step S20 may specifically include: performing format conversion on word segmentation character strings in each word segmentation information to obtain word segmentation identifiers corresponding to the word segmentation information; and arranging the word segmentation identifiers in a preset arrangement mode, and generating a word segmentation identifier sequence of the text content according to an arrangement result.

It should be noted that the word segmentation string may be a word in the word segmentation information. Correspondingly, the format conversion of the word segmentation character string can be that the word segmentation character string is converted into a corresponding word segmentation identifier.

It should be understood that the above-mentioned preset arrangement mode may be an ascending arrangement mode, that is, the segmentation is arranged according to the size of the segmentation identifier, and the arranged segmentation identifier forms a segmentation identifier sequence corresponding to the text content.

In a specific implementation, referring to fig. 3, fig. 3 is a schematic flow chart of generating a word segmentation identifier sequence in a first embodiment of the web page filtering method according to the present invention. As shown in fig. 3, if the text content Input is "chef and eat a steamed stuffed bun, at this time, word segmentation processing can be performed on the text content by an AC automaton, characters not in a preset word list to be filtered are filtered, so as to keep characters" chef, eat, face, package and son "in the word list to be filtered, format conversion (Encode by word table) is performed on word segmentation strings in each word segmentation information according to the word list to be filtered, so as to obtain word content" chef, eat, face, package and son "corresponding word segmentation identifications" 1, 10,6, 11,2 and 3", and word segmentation identifications are arranged in ascending order (Sorted Encoding) according to the size of the word segmentation identifications, so that word segmentation identification sequences" 1,2,3,6, 10 and 11 "corresponding to the text content are generated according to the arrangement result.

Step S30: matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, wherein the preset prefix tree stores the corresponding relation between the identification sequence and the filtering rules.

It should be noted that, the preset prefix tree may be a pre-constructed tree storing the identification sequence and the filtering rule, and the identification sequence and the filtering rule are stored correspondingly in the preset prefix tree.

It should be appreciated that the filtering rules described above may be rules that filter text content.

In a specific implementation, in this embodiment, the word segmentation identifier sequence may be matched with each filtering rule in a preset prefix tree in a preset matching manner, where the preset matching manner includes: at least one of complete matching, fuzzy matching, prefix matching, suffix matching, inclusion matching and combination matching. The complete matching can be a matching mode of successful matching when the webpage content is completely the same as the user query content; the fuzzy matching can be a matching mode that the web page content is successfully matched when the content is only added before and after the content is queried by a user and only spaces and common punctuations are added in the middle; prefix matching can be a matching mode that the webpage content is successfully matched only when the content is added after the user inquires the content; the suffix matching can be a matching mode that the webpage content is successfully matched only when the content is added in front of the user query content; the inclusion matching can be a matching mode that the web page content is successfully matched only when the content is added before and after the user inquires the content and the content is not added in the middle. In this embodiment, one or more of the above matching modes may be selected for matching.

Step S40: and filtering the webpage to be filtered according to the matching result.

It should be understood that, in this embodiment, the rule identifier in the word segmentation identifier sequence may be obtained according to the matching result, and the web page to be filtered may be filtered when the rule identifier is obtained.

It can be understood that if the word segmentation identification sequence corresponding to the text content is matched with the filtering rule, the web page corresponding to the text content needs to be filtered; if the word segmentation identification sequence corresponding to the text content is not matched with the filtering rule, the webpage corresponding to the text does not need to be filtered.

In a specific implementation, referring to fig. 4, fig. 4 is a schematic diagram of a filtering rule vocabulary in a first embodiment of the web page filtering method of the present invention. As shown in fig. 4, each row in the Filter Rules vocabulary (filters) represents a Filter rule, where the Filter rule is made up of one or more words (Word). For example: the filtering rule R1 is composed of three words, namely Word11, word12 and Word13, and when the text content contains three words, namely Word11, word12 and Word13, the text content is judged to be matched with the filtering rule R1. Referring to fig. 5, fig. 5 is a schematic flow chart of rule matching in the first embodiment of the web page filtering method of the present invention. As shown in fig. 5, according to a preset prefix tree (trie tree), the identification sequence corresponding to the filtering rule R1 is "1,2,3,4,4,6"; the identification sequence corresponding to the filtering rule R2 is '1, 2,3 and 11': the identification sequence corresponding to the filtering rule R3 is '2, 3,5,7 and 8'; the identification sequences corresponding to the filtering rules R4 are '1, 5,9 and 10'. In this embodiment, the text content "aged, fight, eat, face, package, sub" corresponds to the word segmentation identifier sequence of "1, 10,6, 11,2,3", wherein all identifiers in the filtering rule R2 are included at the same time, which means that the text content matches the filtering rule R2, at this time, the rule identifiers "1,2,3, 11" can be obtained, and after the rule identifiers are obtained, the web pages to be filtered corresponding to the text content are filtered.

The embodiment discloses that when a webpage to be filtered is received, word segmentation processing is carried out on text content in the webpage to be filtered so as to obtain word segmentation information corresponding to the text content; acquiring word segmentation identifiers corresponding to the word segmentation information respectively, and generating a word segmentation identifier sequence of the text content according to the word segmentation identifiers; matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, wherein the preset prefix tree stores the corresponding relation between the identification sequence and the filtering rules; filtering the webpage to be filtered according to the matching result; compared with the prior art, the method filters bad webpages through instructions issued by a worker department or a network credit office, anti-cheating technology, offline mining and other rules, because the embodiment matches the word segmentation identification sequence of text content in the webpages to be filtered with each filtering rule in the preset prefix tree and filters the webpages to be filtered according to the matching result, the technical problem that the prior art cannot filter trillion-level webpages and influences the searching experience of users is solved.

Referring to fig. 6, fig. 6 is a flowchart illustrating a web page filtering method according to a second embodiment of the present invention.

Based on the first embodiment, in this embodiment, before step S10, the method further includes:

Step S01: and obtaining the identifications corresponding to all words in the text corresponding to the filtering rules according to the filtering rule word list.

It should be noted that, the filtering rule vocabulary may be a table storing the correspondence between the word to be filtered and the identifier corresponding to the word to be filtered.

It should be understood that, before the step S01, the method further includes: acquiring word identifiers corresponding to words to be filtered; and establishing a filtering rule word list through a multimode matching algorithm based on the word to be filtered and the word identification.

It can be understood that the word to be filtered may be a word that contains bad information and needs to be filtered, and the number and the kind of the word to be filtered are not limited in this embodiment.

In a specific implementation, the word to be filtered can be determined first, for example, the word to be filtered can be determined first, namely 'old, package, son, small, king, eating, old, shop, inner, bucket and face', word identifiers corresponding to the word to be filtered are respectively set to be 1-11, and finally a filtering rule word list is established through a multimode matching algorithm. Meanwhile, referring to fig. 7, fig. 7 is a flowchart illustrating the construction of a preset prefix tree in the second embodiment of the web page filtering method according to the present invention. As shown in fig. 7, since the mapping relationship between the word to be filtered and the word identifier corresponding to the word to be filtered is stored in the filtering rule vocabulary, the filtering rule vocabulary can be queried to obtain identifiers (word-ids) corresponding to all the words in the text corresponding to the filtering rule. For example: the text corresponding to the filtering rule R1 is Chen Xiaoxiao for eating steamed stuffed buns, and the identifications corresponding to all words in the text corresponding to the filtering rule R1 are 1,4,4,6,2,3 respectively; the text corresponding to the filtering rule R2 is a Chen Baozi face, and the identifications corresponding to all words in the text corresponding to the filtering rule R2 are 1,2,3 and 11 respectively; the text corresponding to the filtering rule R3 is 'old Wang Baozi shop', and the identifications corresponding to all words in the text corresponding to the filtering rule R3 are '7,5,2,3,8'; the text corresponding to the filtering rule R4 is 'Chenwang inner bucket', and the identifications corresponding to all words in the text corresponding to the filtering rule R4 are '1, 5,9 and 10'.

Step S02: and generating an identification sequence of the text corresponding to the filtering rule according to the identification, and establishing a preset prefix tree based on the identification sequence and the filtering rule.

It should be noted that, the step S02 may specifically include: arranging the identifications in a preset arrangement mode, and generating an identification sequence of the text corresponding to the filtering rule according to an arrangement result; and establishing the preset prefix tree by taking the mark in the mark sequence as a node of the preset prefix tree and determining the node as a leaf node of the preset prefix tree by the filtering rule.

It should be understood that after the identifiers corresponding to each filtering rule are obtained, the identifiers may be sorted according to an ascending order arrangement, so as to obtain an identifier sequence of the text corresponding to each filtering rule.

It may be appreciated that the step of establishing the preset prefix tree for the leaf node of the preset prefix tree determined by the filtering rule with the node of the preset prefix tree identified in the identification sequence may specifically include: storing the marks in the mark sequence to the nodes of a preset prefix tree correspondingly according to a monotonically increasing arrangement mode; and storing the filtering rules corresponding to the identification sequences into leaf nodes of the corresponding preset prefix trees respectively so as to establish the preset prefix trees based on the nodes and the leaf nodes.

It should be noted that, in this embodiment, the multimode matching algorithm is combined with the prefix tree, so as to support rapid filtering of large-scale bad vocabulary. For the application scene that the trillion-level webpage is matched with the tens of millions of level rule word list, the matching time can be reduced from tens of minutes to tens of milliseconds, so that the bad webpage displayed to the user by the search engine is filtered in real time, the user experience effect is obviously improved, and the search environment is greening.

In a specific implementation, after the identifiers corresponding to all the words in the text corresponding to each filtering rule are obtained, the identifiers can be arranged according to the size of the identifiers and in an ascending order arrangement manner so as to obtain the identifier sequence of the text corresponding to each filtering rule. As shown in fig. 7, after word-ids are arranged in ascending order, the identification sequence of the text corresponding to the filtering rule R1 is "1,2,3,4,4,6", the identification sequence of the text corresponding to the filtering rule R2 is "1,2,3, 11", the identification sequence of the text corresponding to the filtering rule R3 is "2,3,5,7,8", and the identification sequence of the text corresponding to the filtering rule R4 is "1,5,9, 10". At this time, according to the size of each identifier in the identifier sequence, the identifiers in the identifier sequence are correspondingly stored in each node of a preset prefix tree in a monotonically increasing arrangement mode, and the filtering rules corresponding to each identifier sequence are stored in corresponding leaf nodes, so as to establish the preset prefix tree.

According to the embodiment, the filtering rule word list is built based on the words to be filtered and word identifications corresponding to the words to be filtered, identifications corresponding to all the words in texts corresponding to the filtering rules are obtained according to the filtering rule word list, and identification sequences of the texts corresponding to the filtering rules are generated according to the identifications, so that a preset prefix tree is built based on the identification sequences and the filtering rules, and therefore word segmentation identification sequences of text contents in the webpages to be filtered can be directly matched with each filtering rule in the preset prefix tree, webpages to be filtered are filtered according to matching results, and webpage filtering efficiency is further improved.

Referring to fig. 8, fig. 8 is a flowchart illustrating a third embodiment of a web page filtering method according to the present invention.

Based on the foregoing embodiments, in this embodiment, after step S30, the method further includes:

step S301: and carrying out recursive query on the word segmentation identification sequence in a preset prefix tree.

It should be noted that, when the preset prefix tree is constructed, the identifiers in the identifier sequence are stored according to a monotonically increasing arrangement mode (i.e. the identifiers are stored in sequence according to the size of the identifiers), and the word segmentation identifier sequences corresponding to the text content are arranged according to an ascending arrangement mode, i.e. the word segmentation identifier sequences are also arranged in sequence according to the size of the word segmentation identifiers, so that the recursive query can be performed on the word segmentation identifier sequences in the preset prefix tree at this time. In addition, the matching efficiency of the word segmentation identification sequence and the corresponding filtering rule can be improved by inquiring the word segmentation identification sequence in a recursive inquiry mode, so that the efficiency of filtering the webpage is improved.

Step S302: and judging whether a filtering rule corresponding to the segmentation identification sequence exists or not according to the query result.

Step S303: and if so, executing the step of matching the word segmentation identification sequence with each filtering rule in a preset prefix tree.

It should be understood that if the filtering rule corresponding to the word segmentation identification sequence is found to exist in the preset prefix tree after the recursive query, the filtering rule indicates that the webpage where the text content corresponding to the word segmentation identification sequence is located needs to be filtered; if the filtering rule corresponding to the word segmentation identification sequence does not exist in the preset prefix tree, the webpage in which the text content corresponding to the word segmentation identification sequence is located is not required to be filtered, and the webpage is not processed at the moment.

In a specific implementation, the word segmentation identification sequences can be queried sequentially according to the identification sequence stored by each node in the preset prefix tree so as to judge whether a filtering rule matched with the word segmentation identification sequences exists in the preset prefix tree, and if so, the webpage containing text content corresponding to the word segmentation identification sequences is filtered; if the text content does not exist, the webpage where the text content corresponding to the segmentation identification sequence exists is not processed.

According to the embodiment, recursive query is carried out on the segmentation identification sequence in the preset prefix tree, and whether the filtering rule corresponding to the segmentation identification sequence exists or not is judged according to the query result, so that whether the webpage corresponding to the text content of the segmentation identification sequence needs to be filtered or not can be judged according to the judging result, and the webpage filtering efficiency is improved.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a webpage filtering program, and the webpage filtering program realizes the steps of the webpage filtering method when being executed by a processor.

Referring to fig. 9, fig. 9 is a block diagram illustrating a first embodiment of a web filtering apparatus according to the present invention.

As shown in fig. 9, a web page filtering apparatus according to an embodiment of the present invention includes: an information acquisition module 901, a sequence generation module 902, a rule matching module 903 and a web page filtering module 904;

the information acquisition module 901 is configured to perform word segmentation processing on text content in a webpage to be filtered when receiving the webpage to be filtered, so as to acquire word segmentation information corresponding to the text content;

the sequence generating module 902 is configured to obtain word segmentation identifiers corresponding to the word segmentation information respectively, and generate a word segmentation identifier sequence of the text content according to the word segmentation identifiers;

the rule matching module 903 is configured to match the word segmentation identifier sequence with each filtering rule in a preset prefix tree, where a correspondence between the identifier sequence and the filtering rule is stored in the preset prefix tree;

the web page filtering module 904 is configured to filter the web page to be filtered according to the matching result.

Further, the sequence generating module 902 is further configured to perform format conversion on the word segmentation character strings in each word segmentation information, so as to obtain a word segmentation identifier corresponding to each word segmentation information; and arranging the word segmentation identifiers in the preset arrangement mode, and generating a word segmentation identifier sequence of the text content according to an arrangement result.

The webpage filtering device of the embodiment discloses that when a webpage to be filtered is received, word segmentation processing is carried out on text content in the webpage to be filtered so as to obtain word segmentation information corresponding to the text content; acquiring word segmentation identifiers corresponding to the word segmentation information respectively, and generating a word segmentation identifier sequence of the text content according to the word segmentation identifiers; matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, wherein the preset prefix tree stores the corresponding relation between the identification sequence and the filtering rules; filtering the webpage to be filtered according to the matching result; compared with the prior art, the method filters bad webpages through instructions issued by a worker department or a network credit office, anti-cheating technology, offline mining and other rules, because the embodiment matches the word segmentation identification sequence of text content in the webpages to be filtered with each filtering rule in the preset prefix tree and filters the webpages to be filtered according to the matching result, the technical problem that the prior art cannot filter trillion-level webpages and influences the searching experience of users is solved.

Based on the first embodiment of the web page filtering device of the present invention, a second embodiment of the web page filtering device of the present invention is provided.

In this embodiment, the information obtaining module 901 is further configured to obtain, according to a filtering rule vocabulary, identifiers corresponding to all words in a text corresponding to a filtering rule; and generating an identification sequence of the text corresponding to the filtering rule according to the identification, and establishing a preset prefix tree based on the identification sequence and the filtering rule.

Further, the information obtaining module 901 is further configured to arrange the identifiers in a preset arrangement manner, and generate an identifier sequence of a text corresponding to the filtering rule according to an arrangement result; and establishing the preset prefix tree by taking the mark in the mark sequence as a node of the preset prefix tree and determining the node as a leaf node of the preset prefix tree by the filtering rule.

Further, the information obtaining module 901 is further configured to store the identifiers in the identifier sequence to nodes of a preset prefix tree correspondingly according to a monotonically increasing arrangement mode; and storing the filtering rules corresponding to the identification sequences into leaf nodes of the corresponding preset prefix trees respectively so as to establish the preset prefix trees based on the nodes and the leaf nodes.

Further, the information obtaining module 901 is further configured to obtain a word identifier corresponding to the word to be filtered; and establishing a filtering rule word list through a multimode matching algorithm based on the word to be filtered and the word identification.

Based on the above embodiments of the apparatus, a third embodiment of the web page filtering apparatus of the present invention is presented.

In this embodiment, the rule matching module 903 is further configured to recursively query the word segmentation identifier sequence in a preset prefix tree; judging whether a filtering rule corresponding to the word segmentation identification sequence exists or not according to the query result; and if so, executing the step of matching the word segmentation identification sequence with each filtering rule in a preset prefix tree.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

The invention discloses A1, a webpage filtering method, which comprises the following steps:

and filtering the webpage to be filtered according to the matching result.

A2, the web page filtering method as described in A1, wherein when receiving a web page to be filtered, the step of performing word segmentation on text content in the web page to be filtered to obtain word segmentation information corresponding to the text content is preceded by the steps of:

A3, the web page filtering method according to A2, the step of generating the identification sequence of the text corresponding to each filtering rule according to the identification, and establishing a preset prefix tree based on the identification sequence and the filtering rule, includes:

arranging the identifications in a preset arrangement mode, and generating an identification sequence of the text corresponding to each filtering rule according to an arrangement result;

A4, the webpage filtering method according to A3, wherein the step of establishing the preset prefix tree by using the mark in the mark sequence as a node of the preset prefix tree and using the filtering rule to determine as a leaf node of the preset prefix tree comprises the following steps:

A5, before the step of obtaining the identifiers corresponding to all words in the text corresponding to each filtering rule according to the filtering rule word list, the webpage filtering method according to A2 further comprises:

acquiring word identifiers corresponding to words to be filtered;

A6, the web page filtering method according to A3, wherein the step of obtaining the word segmentation identifications corresponding to the word segmentation information respectively and generating the word segmentation identification sequence of the text content according to the word segmentation identifications comprises the following steps:

A7, before the step of matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, the webpage filtering method according to A1 further comprises:

A8, the web page filtering method as described in A1, wherein the step of matching the word segmentation identification sequence with each filtering rule in a preset prefix tree comprises the following steps:

matching the word segmentation identification sequence with each filtering rule in a preset prefix tree in a preset matching mode, wherein the preset matching mode comprises the following steps: at least one of complete matching, fuzzy matching, prefix matching, suffix matching, inclusion matching and combination matching.

A9, the webpage filtering method according to A1, wherein the step of filtering the webpage to be filtered according to the matching result comprises the following steps:

and acquiring rule identifiers in the word segmentation identifier sequence according to the matching result, and filtering the webpage to be filtered when the rule identifiers are acquired.

A10, the webpage filtering method as described in A3, wherein the preset arrangement mode comprises: an ascending order arrangement mode.

A11, the web page filtering method as described in A1, wherein when a web page to be filtered is received, performing word segmentation on text content in the web page to be filtered to obtain word segmentation information corresponding to the text content, and the step comprises the following steps:

when a webpage to be filtered is received, judging content attributes corresponding to text content in the webpage to be filtered;

judging whether illegal words exist in a text title corresponding to the text content when the content attribute belongs to a target attribute;

if the text content does not exist, word segmentation processing is carried out on the text content so as to obtain word segmentation information corresponding to the text content.

A12, the web page filtering method as described in A11, after the step of judging the content attribute corresponding to the text content in the web page to be filtered when the web page to be filtered is received, further includes:

judging whether illegal words exist in a text title corresponding to the text content when the content attribute does not belong to a target attribute;

and if not, carrying out response processing on the webpage to be filtered.

The invention also discloses B13, a webpage filtering device, the device includes: a memory, a processor, and a web page filter program stored on the memory and executable on the processor, the web page filter being configured to implement the steps of the web page filtering method as described above.

The invention also discloses a C14, a storage medium, the storage medium storing a web page filtering program, the web page filtering program realizing the steps of the web page filtering method when being executed by a processor.

The invention also discloses a D15 and a webpage filtering device, wherein the webpage filtering device comprises: the system comprises an information acquisition module, a sequence generation module, a rule matching module and a webpage filtering module;

The webpage filtering device as described in the step D16, wherein the information obtaining module is further configured to obtain, according to the filtering rule word list, identifiers corresponding to all words in the text corresponding to each filtering rule;

The information acquisition module is further used for generating an identification sequence of the text corresponding to each filtering rule according to the identification, and establishing a preset prefix tree based on the identification sequence and the filtering rules.

The webpage filtering device as described in the step D17, wherein the information obtaining module is further configured to arrange the identifiers in a preset arrangement manner, and generate an identifier sequence of a text corresponding to each filtering rule according to an arrangement result;

the information acquisition module is further configured to determine an identifier in the identifier sequence as a node of a preset prefix tree, and determine the filtering rule as a leaf node of the preset prefix tree, so as to establish the preset prefix tree based on the node and the leaf node.

D18, the web page filtering device as described in D17, where the information obtaining module is further configured to store the identifiers in the identifier sequence to nodes of a preset prefix tree correspondingly according to a monotonically increasing arrangement manner;

the information acquisition module is further configured to store the filtering rule corresponding to the identification sequence to leaf nodes of the preset prefix tree corresponding to the filtering rule respectively, so as to establish the preset prefix tree based on the nodes and the leaf nodes.

The webpage filtering device as described in D16, wherein the information obtaining module is further configured to obtain a word identifier corresponding to the word to be filtered;

the information acquisition module is further used for establishing a filtering rule word list through a multimode matching algorithm based on the word to be filtered and the word identification.

D20, the web page filtering device as described in D17, where the sequence generating module is further configured to perform format conversion on a word segmentation string in each word segmentation information, so as to obtain a word segmentation identifier corresponding to each word segmentation information;

the sequence generation module is further configured to arrange the word segmentation identifiers in the preset arrangement manner, and generate a word segmentation identifier sequence of the text content according to an arrangement result.

Claims

1. A web page filtering method, characterized in that the web page filtering method comprises:

And filtering the webpage to be filtered according to the matching result.

2. The method for filtering a web page according to claim 1, wherein, when receiving the web page to be filtered, before the step of performing word segmentation processing on text content in the web page to be filtered to obtain word segmentation information corresponding to the text content, the method further comprises:

3. The web page filtering method as claimed in claim 2, wherein the step of generating the identification sequence of the text corresponding to the filtering rule according to the identification, and establishing a preset prefix tree based on the identification sequence and the filtering rule comprises:

4. The web page filtering method as claimed in claim 3, wherein the step of establishing the preset prefix tree with the node identified as the preset prefix tree in the identification sequence and the leaf node of the preset prefix tree determined as the filtering rule comprises:

5. The web page filtering method as recited in claim 2, wherein before the step of obtaining the identifiers corresponding to all words in the text corresponding to the filtering rule according to the filtering rule vocabulary, the method further comprises:

acquiring word identifiers corresponding to words to be filtered;

6. The web page filtering method as claimed in claim 3, wherein the step of obtaining word segmentation identifiers corresponding to the word segmentation information respectively, and generating the word segmentation identifier sequence of the text content according to the word segmentation identifiers comprises the steps of:

7. The web page filtering method as recited in claim 1, wherein before the step of matching the word segmentation identification sequence with each filtering rule in a preset prefix tree, the method further comprises:

8. A web page filtering apparatus, the apparatus comprising: a memory, a processor and a web page filter program stored on the memory and executable on the processor, the web page filter being configured to implement the steps of the web page filtering method of any one of claims 1 to 7.

9. A storage medium having stored thereon a web page filter program which when executed by a processor performs the steps of the web page filtering method of any of claims 1 to 7.

10. A web page filtering apparatus, the web page filtering apparatus comprising: the system comprises an information acquisition module, a sequence generation module, a rule matching module and a webpage filtering module;