CN111680128A

CN111680128A - Method and system for detecting web page sensitive words and related devices

Info

Publication number: CN111680128A
Application number: CN202010548352.6A
Authority: CN
Inventors: 徐凯熙; 范渊
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: DBAPPSecurity Co Ltd; Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2020-09-18

Abstract

The application provides a method for detecting web page sensitive words, which comprises the following steps: acquiring webpage data and detection requirements; extracting texts from the webpage data to obtain text keywords; and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result. The method and the device effectively segment the webpage data, respectively detect the segmented words, and reduce the false alarm rate due to the false alarm condition when the surface is regularly matched. The application also provides a method and a system for detecting the web page sensitive words, a computer readable storage medium and electronic equipment, which have the beneficial effects.

Description

Method and system for detecting web page sensitive words and related devices

Technical Field

The present application relates to the field of network security, and in particular, to a method, a system, and a device for detecting web page sensitive words.

Background

The web page sensitive words refer to improperly used words contained in web page contents, and the reasons for the occurrence of the improper use words may be that an administrator does not carefully review the uploaded contents, or the contents of a website are tampered by a hacker, and sensitive words are added to an original normal web page.

In the prior art, the sensitive words are detected by using rule matching, and according to the rule matching technology, the sensitive words can be mistakenly cut from the original normal webpage content, so that the detection result is mistakenly reported. Therefore, how to avoid the false detection of the sensitive word is a technical problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

The application aims to provide a method and a system for detecting web page sensitive words, a computer readable storage medium and electronic equipment, which can reduce the false detection rate of the sensitive words.

In order to solve the technical problem, the application provides a method for detecting web page sensitive words, which has the following specific technical scheme:

acquiring webpage data and detection requirements;

extracting texts from the webpage data to obtain text keywords;

and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.

Optionally, before the detecting the sensitive word of the text keyword by using the AC automaton based on the detection requirement, the method further includes:

generating an AC automaton based on the detection requirements.

Optionally, generating an AC automaton based on the detection requirement includes:

determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;

mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;

and generating an AC automatic machine according to the double array dictionary tree, wherein the subscript is stored in a fail table in the AC automatic machine.

Optionally, performing text extraction on the webpage data to obtain text keywords includes:

performing text segmentation on the webpage data to obtain a shortest word set;

and constructing a network by using the shortest word set as a node through Textrank, calculating a rank value of each node in the network through a PageRank iteration, and sequencing the rank values to obtain the text key words.

Optionally, performing text segmentation on the webpage data to obtain a shortest term set includes:

and performing text segmentation on the webpage data by using a lexical analyzer based on a HanLP word segmentation algorithm to obtain a shortest word set.

Optionally, after obtaining the sensitive word detection result, the method further includes:

and filtering a false alarm result in the sensitive word detection result according to the category of the text keyword to obtain an accurate detection result.

The present application further provides a web page sensitive word detection system, including:

the acquisition module is used for acquiring webpage data and detection requirements;

the text extraction module is used for extracting texts from the webpage data to obtain text keywords;

and the detection module is used for detecting the sensitive words of the text keywords by utilizing an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.

Optionally, the method further includes:

and the AC automaton generating module is used for generating the AC automaton based on the detection requirement.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth above.

The present application further provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method described above when calling the computer program in the memory.

The application provides a method for detecting web page sensitive words, which comprises the following steps: acquiring webpage data and detection requirements; extracting texts from the webpage data to obtain text keywords; and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.

The method and the device effectively segment the webpage data, respectively detect the segmented words, and reduce the false alarm rate due to the false alarm condition when the surface is regularly matched. The application also provides a method for detecting the web page sensitive words, a detection system, a computer readable storage medium and an electronic device, which have the beneficial effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting web page sensitive words according to an embodiment of the present application;

fig. 2 is a goto representation of an AC automaton provided in the embodiments of the present application;

fig. 3 is a schematic structural diagram of an AC automaton provided in the embodiment of the present application;

fig. 4 is a schematic structural diagram of a web page sensitive word detection system according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for detecting web page sensitive words according to an embodiment of the present application, where the method includes:

s101: acquiring webpage data and detection requirements;

the step aims to acquire the webpage data to be detected and the corresponding detection requirement. Because the sensitive word detection emphasis of each website is different, and the corresponding detection requirements naturally differ, a suitable sensitive word can be selected according to the detection requirements to form a sensitive word stock, so that a corresponding AC automatic engine can be generated subsequently.

S102: extracting texts from the webpage data to obtain text keywords;

the method aims to extract the text to obtain the text keywords, and avoids misinformation when rules are matched, namely avoiding misreading and misinformation caused by splitting words.

The method for extracting the text is not specifically limited, and the embodiment provides a preferred text extraction method, that is, text segmentation is performed on webpage data to obtain a shortest term set, then Textrank is used to construct a network by using the shortest term set as a node, a rank value of each node in the network is iteratively calculated by using PageRank, and the rank values are sorted to obtain text keywords. Firstly, text segmentation is needed, and the word segmentation is carried out on the crawled webpage content by utilizing a lexical analyzer to segment the webpage content into a shortest word set which accords with logic. The text segmentation is not particularly limited, and the text segmentation may be performed on the web page data by using a lexical analyzer based on the HanLP segmentation algorithm, or may be performed on the basis of a lexical analyzer obtained by a CoreNLP tool, a final segmentation tool, or the like.

The purpose of text segmentation is to obtain the shortest set of words, it being understood that the same word exists in at most one word.

And after the shortest word set is obtained, judging whether the words are adjacent according to the positions of the words in the original webpage data text, regarding the shortest word set as nodes, and calculating the rank value of each node by using PageRank iteration.

The TextRank calculates rank value by using voting principle to make each word vote to its neighboring nodes, and the weight of the vote depends on the number of votes owned by the word (namely, the number of neighbors).

The formula for PageRank is as follows:

wherein S (V)_i) Represents a node V_iRank value of (c), ln (V)_i) Represents a node V_iSet of predecessor nodes of, Out (V)_j) Represents a node V_iD is a damping coefficient for smoothing.

The TextRank considers that N words in front of a word and N words behind the word have graph adjacent relations, and the concrete implementation is that a sliding window with the length of N is set, and all the words in the sliding window are regarded as adjacent nodes of word nodes.

The iterative calculation formula of TextRank is as follows:

wherein WS (V)_i) Representing a word node V_iScore of (a), w_jiRepresenting slave word node V_iTo word node V_jWeight value of the edge of (V)_jWord node V representing the last iteration_jScore of (a); ln (V)_i) To point to a word node V_iA set of nodes of (c); out (V)_j) Is a V_jThe node indicated by the node set, d is the damping coefficient.

The purpose of the step is to obtain text keywords, and detect the front shortest word set of a preset number after ranking rank values as the text keywords.

S103: and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on detection requirements to obtain a sensitive word detection result.

After the text keywords are obtained in step S102, sensitive word detection is performed by using an AC automaton corresponding to the detection requirement. It will be readily appreciated that the present embodiment defaults to the need to generate an AC automaton based on the detection requirements prior to performing the present step. The method for generating the AC automaton is not limited, and the present embodiment provides a preferred method for generating the AC automaton:

s1031: determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;

s1032: mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;

s1033: and generating an AC automatic machine according to the double array dictionary tree, wherein subscripts are stored in a fail table in the AC automatic machine.

In step S1031, the sensitive word group may be determined from the sensitive word library according to the detection requirement, and the corresponding dictionary tree is generated based on the sensitive word group, because the simple dictionary tree has low retrieval efficiency for the sensitive word, the method improves the retrieval efficiency for the sensitive word by generating the dual-array dictionary tree. It is within the scope of the present application that a person skilled in the art may not use a dual-array dictionary tree or use other tree structures obtained based on dictionary tree transformation for the search of sensitive words. By using the double-array dictionary tree, the use space of the original dictionary tree can be reduced, the query time is shortened through simple mathematical addition operation, and the sensitive word retrieval efficiency is improved.

In step S1032, two integer arrays, that is, a double array, are used to form a base array and a check array, where each element of the base array represents a tree node, where the unicode code of the current node character is recorded, and the check array represents a predecessor state of a certain state. Setting both base and check to be 0 in an initial state, indicating that the state is changed to an idle state if the base and check values of the corresponding state are both 0 during construction, setting the base value to be a negative number if a certain state is a complete word, and taking the base value to be the negative number of the state position if the certain state is a complete word and the word is not a prefix of other words.

The relationship of base and check satisfies the following condition:

base[s]+c＝t

check[t]＝s

where s is the subscript of the current state, t is the subscript of the state transition, and c is the numeric value of the input character, and in this embodiment, unicode codes of characters are used.

On the basis of the double-array Trie constructed in step S1032, when each sensitive word is mapped to a double array, the subscripts of the sensitive word in the double array are recorded at the same time for recording in the fail table of the AC automaton.

The generation principle of the AC automaton is explained below with { he, she, his, hers } samples:

the AC automaton is composed of a goto table, a fail table and an output table.

The goto table is also called success table, and belongs to the prefix tree as well as the double-array Trie. After parsing the sample into the goto table, as shown in fig. 2, fig. 2 is a schematic diagram of the goto table of the AC automaton provided in the embodiment of the present application, where states 2,5,7, and 9 represent successful states, that is, they correspond to a word in the sample.

The output table records the corresponding relationship between the state and the words in the sample, and the output table corresponding to the goto table is shown in the following table 1:

TABLE 1 goto table and output table corresponding relation table

State numbering	output table
		2	he
5	he,she
		7	his
9	hers

The elements in the Output table have two types, one is a word in the sample corresponding to the path from the initial state to the current state, such as state 2; another is the word in the sample corresponding to the suffix of the path, as in state 5.

The fail table stores the one-to-one relationship between the saved transaction states and stores the best state that should be rolled back after the state transition fails. The best state is worth remembering the state of the longest suffix of a string on a match, taking state 5 as an example, where 5 corresponds to a word she, if goto fails after adding a letter, then he is the longest successful suffix of she, corresponding to state 2, so that when state 5 fails, state 2 is the best choice, and after fail to state 2, the AC automaton remembers the prefix he and is ready to accept the letter r.

Building a fail table: the current state is recorded as S,

and returning the state S for transferring the table according to the state after the character c is transferred, and if the state S fails, finding a backspace point according to the longest suffix principle, wherein the backspace point is the record of the fail table.

(1) In the initial state, the goto table is full and never fails, so state 0 has no fail pointer, and all the states connected to the initial state have their fail pointers pointing to the initial state.

(2) Performing breadth-first traversal from the initial state, if the current state S receives the character c to reach the state T, tracing back along the fail pointer of S to know that the first predecessor state F is found, so that F. Null then points the fail pointer of T to f.goto (c), taking state 3 as an example, accepts character h, reaches state 4, then backtracks 3 to initial position 0.goto (h) to state 1, points the fail pointer of state 4 to state 1.

(3) Since the F path is a suffix of the T path, that is, T must contain F, the output of T should also contain the output of F, and t.output + ═ f.output is updated.

According to the above steps, the final AC automaton of the sample is shown in fig. 3, fig. 3 is a schematic structural diagram of the AC automaton provided in the embodiment of the present application, and the dashed arrow represents a fail pointer and the solid arrow represents a goto pointer.

The word segmentation is firstly carried out on the webpage data by utilizing the word segmentation algorithm, and the words after word segmentation are respectively detected, so that the situation of false alarm when rules are matched is avoided, and the rate of false alarm is reduced.

On the basis of the above embodiment, as a preferred embodiment, after the sensitive word detection result is obtained, the false positive result in the sensitive word detection result may be filtered according to the category to which the text keyword belongs, so as to obtain an accurate detection result.

After the text keywords are determined, in order to prevent word segmentation failure or wrong word segmentation, false alarm filtering is performed on the sensitive word detection result, and false alarm is further avoided. In addition, other ways of reducing false alarms are possible, such as recording the position of the searched sensitive word in the web page, intercepting and recording the content of the web page, performing further review by a human, and the like. In the embodiment, the TextRank algorithm is used for extracting the key words of the webpage content, and the detection result with low sensitivity in the same category is filtered according to the category to which the key words belong, so that the false alarm rate is reduced.

In the following, a web page sensitive word detection system provided in an embodiment of the present application is introduced, and the web page sensitive word detection system described below and the web page sensitive word detection method described above may be referred to in a corresponding manner.

an obtaining module 100, configured to obtain webpage data and a detection requirement;

the text extraction module 200 is configured to perform text extraction on the webpage data to obtain text keywords;

and the detection module 300 is configured to perform sensitive word detection on the text keyword by using an AC automaton based on the detection requirement, so as to obtain a sensitive word detection result.

Based on the above embodiment, as a preferred embodiment, the system may further include:

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and when the processor calls the computer program in the memory, the steps provided in the foregoing embodiments may be implemented. Of course, the electronic device may also include various network interfaces, power supplies, and the like.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system provided by the embodiment, since the system corresponds to the method provided by the embodiment, the description is simple, and the relevant points can be referred to the description of the method part.

The principle and embodiments of the present application are explained herein by using specific examples, and the above descriptions of the embodiments are only used to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method for detecting web page sensitive words is characterized by comprising the following steps:

acquiring webpage data and detection requirements;

extracting texts from the webpage data to obtain text keywords;

2. The method for detecting web page sensitive words according to claim 1, wherein before the detecting the sensitive words of the text keywords by using an AC automaton based on the detection requirement, the method further comprises:

generating an AC automaton based on the detection requirements.

3. The method for detecting web page sensitive words according to claim 2, wherein generating an AC automaton based on the detection requirement comprises:

4. The method for detecting the web page sensitive words according to claim 1, wherein the step of extracting the text of the web page data to obtain the text keywords comprises the steps of:

performing text segmentation on the webpage data to obtain a shortest word set;

and constructing a network by using the shortest word set as a node through Textrank, iteratively calculating a rank value of each node in the network through PageRank, and sequencing the rank values to obtain the text keywords.

5. The method for detecting the web page sensitive words according to claim 4, wherein the step of performing text segmentation on the web page data to obtain a shortest word set comprises the following steps:

6. The method for detecting the web page sensitive word according to claim 1, after obtaining the sensitive word detection result, further comprising:

7. A web page sensitive word detection system, comprising:

and the detection module is used for detecting the sensitive words of the text keywords by utilizing the AC automaton based on the detection requirements to obtain a sensitive word detection result.

8. The web page sensitive word detection system of claim 7, further comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

10. An electronic device, comprising a memory in which a computer program is stored and a processor which, when called upon in the memory, implements the steps of the method according to any one of claims 1-6.