CN111680128A - Method and system for detecting web page sensitive words and related devices - Google Patents
Method and system for detecting web page sensitive words and related devices Download PDFInfo
- Publication number
- CN111680128A CN111680128A CN202010548352.6A CN202010548352A CN111680128A CN 111680128 A CN111680128 A CN 111680128A CN 202010548352 A CN202010548352 A CN 202010548352A CN 111680128 A CN111680128 A CN 111680128A
- Authority
- CN
- China
- Prior art keywords
- sensitive
- web page
- detection
- text
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000001514 detection method Methods 0.000 claims abstract description 76
- 230000011218 segmentation Effects 0.000 claims description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013016 damping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a method for detecting web page sensitive words, which comprises the following steps: acquiring webpage data and detection requirements; extracting texts from the webpage data to obtain text keywords; and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result. The method and the device effectively segment the webpage data, respectively detect the segmented words, and reduce the false alarm rate due to the false alarm condition when the surface is regularly matched. The application also provides a method and a system for detecting the web page sensitive words, a computer readable storage medium and electronic equipment, which have the beneficial effects.
Description
Technical Field
The present application relates to the field of network security, and in particular, to a method, a system, and a device for detecting web page sensitive words.
Background
The web page sensitive words refer to improperly used words contained in web page contents, and the reasons for the occurrence of the improper use words may be that an administrator does not carefully review the uploaded contents, or the contents of a website are tampered by a hacker, and sensitive words are added to an original normal web page.
In the prior art, the sensitive words are detected by using rule matching, and according to the rule matching technology, the sensitive words can be mistakenly cut from the original normal webpage content, so that the detection result is mistakenly reported. Therefore, how to avoid the false detection of the sensitive word is a technical problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
The application aims to provide a method and a system for detecting web page sensitive words, a computer readable storage medium and electronic equipment, which can reduce the false detection rate of the sensitive words.
In order to solve the technical problem, the application provides a method for detecting web page sensitive words, which has the following specific technical scheme:
acquiring webpage data and detection requirements;
extracting texts from the webpage data to obtain text keywords;
and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
Optionally, before the detecting the sensitive word of the text keyword by using the AC automaton based on the detection requirement, the method further includes:
generating an AC automaton based on the detection requirements.
Optionally, generating an AC automaton based on the detection requirement includes:
determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;
mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;
and generating an AC automatic machine according to the double array dictionary tree, wherein the subscript is stored in a fail table in the AC automatic machine.
Optionally, performing text extraction on the webpage data to obtain text keywords includes:
performing text segmentation on the webpage data to obtain a shortest word set;
and constructing a network by using the shortest word set as a node through Textrank, calculating a rank value of each node in the network through a PageRank iteration, and sequencing the rank values to obtain the text key words.
Optionally, performing text segmentation on the webpage data to obtain a shortest term set includes:
and performing text segmentation on the webpage data by using a lexical analyzer based on a HanLP word segmentation algorithm to obtain a shortest word set.
Optionally, after obtaining the sensitive word detection result, the method further includes:
and filtering a false alarm result in the sensitive word detection result according to the category of the text keyword to obtain an accurate detection result.
The present application further provides a web page sensitive word detection system, including:
the acquisition module is used for acquiring webpage data and detection requirements;
the text extraction module is used for extracting texts from the webpage data to obtain text keywords;
and the detection module is used for detecting the sensitive words of the text keywords by utilizing an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
Optionally, the method further includes:
and the AC automaton generating module is used for generating the AC automaton based on the detection requirement.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth above.
The present application further provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method described above when calling the computer program in the memory.
The application provides a method for detecting web page sensitive words, which comprises the following steps: acquiring webpage data and detection requirements; extracting texts from the webpage data to obtain text keywords; and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
The method and the device effectively segment the webpage data, respectively detect the segmented words, and reduce the false alarm rate due to the false alarm condition when the surface is regularly matched. The application also provides a method for detecting the web page sensitive words, a detection system, a computer readable storage medium and an electronic device, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for detecting web page sensitive words according to an embodiment of the present application;
fig. 2 is a goto representation of an AC automaton provided in the embodiments of the present application;
fig. 3 is a schematic structural diagram of an AC automaton provided in the embodiment of the present application;
fig. 4 is a schematic structural diagram of a web page sensitive word detection system according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for detecting web page sensitive words according to an embodiment of the present application, where the method includes:
s101: acquiring webpage data and detection requirements;
the step aims to acquire the webpage data to be detected and the corresponding detection requirement. Because the sensitive word detection emphasis of each website is different, and the corresponding detection requirements naturally differ, a suitable sensitive word can be selected according to the detection requirements to form a sensitive word stock, so that a corresponding AC automatic engine can be generated subsequently.
S102: extracting texts from the webpage data to obtain text keywords;
the method aims to extract the text to obtain the text keywords, and avoids misinformation when rules are matched, namely avoiding misreading and misinformation caused by splitting words.
The method for extracting the text is not specifically limited, and the embodiment provides a preferred text extraction method, that is, text segmentation is performed on webpage data to obtain a shortest term set, then Textrank is used to construct a network by using the shortest term set as a node, a rank value of each node in the network is iteratively calculated by using PageRank, and the rank values are sorted to obtain text keywords. Firstly, text segmentation is needed, and the word segmentation is carried out on the crawled webpage content by utilizing a lexical analyzer to segment the webpage content into a shortest word set which accords with logic. The text segmentation is not particularly limited, and the text segmentation may be performed on the web page data by using a lexical analyzer based on the HanLP segmentation algorithm, or may be performed on the basis of a lexical analyzer obtained by a CoreNLP tool, a final segmentation tool, or the like.
The purpose of text segmentation is to obtain the shortest set of words, it being understood that the same word exists in at most one word.
And after the shortest word set is obtained, judging whether the words are adjacent according to the positions of the words in the original webpage data text, regarding the shortest word set as nodes, and calculating the rank value of each node by using PageRank iteration.
The TextRank calculates rank value by using voting principle to make each word vote to its neighboring nodes, and the weight of the vote depends on the number of votes owned by the word (namely, the number of neighbors).
The formula for PageRank is as follows:
wherein S (V)i) Represents a node ViRank value of (c), ln (V)i) Represents a node ViSet of predecessor nodes of, Out (V)j) Represents a node ViD is a damping coefficient for smoothing.
The TextRank considers that N words in front of a word and N words behind the word have graph adjacent relations, and the concrete implementation is that a sliding window with the length of N is set, and all the words in the sliding window are regarded as adjacent nodes of word nodes.
The iterative calculation formula of TextRank is as follows:
wherein WS (V)i) Representing a word node ViScore of (a), wjiRepresenting slave word node ViTo word node VjWeight value of the edge of (V)jWord node V representing the last iterationjScore of (a); ln (V)i) To point to a word node ViA set of nodes of (c); out (V)j) Is a VjThe node indicated by the node set, d is the damping coefficient.
The purpose of the step is to obtain text keywords, and detect the front shortest word set of a preset number after ranking rank values as the text keywords.
S103: and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on detection requirements to obtain a sensitive word detection result.
After the text keywords are obtained in step S102, sensitive word detection is performed by using an AC automaton corresponding to the detection requirement. It will be readily appreciated that the present embodiment defaults to the need to generate an AC automaton based on the detection requirements prior to performing the present step. The method for generating the AC automaton is not limited, and the present embodiment provides a preferred method for generating the AC automaton:
s1031: determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;
s1032: mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;
s1033: and generating an AC automatic machine according to the double array dictionary tree, wherein subscripts are stored in a fail table in the AC automatic machine.
In step S1031, the sensitive word group may be determined from the sensitive word library according to the detection requirement, and the corresponding dictionary tree is generated based on the sensitive word group, because the simple dictionary tree has low retrieval efficiency for the sensitive word, the method improves the retrieval efficiency for the sensitive word by generating the dual-array dictionary tree. It is within the scope of the present application that a person skilled in the art may not use a dual-array dictionary tree or use other tree structures obtained based on dictionary tree transformation for the search of sensitive words. By using the double-array dictionary tree, the use space of the original dictionary tree can be reduced, the query time is shortened through simple mathematical addition operation, and the sensitive word retrieval efficiency is improved.
In step S1032, two integer arrays, that is, a double array, are used to form a base array and a check array, where each element of the base array represents a tree node, where the unicode code of the current node character is recorded, and the check array represents a predecessor state of a certain state. Setting both base and check to be 0 in an initial state, indicating that the state is changed to an idle state if the base and check values of the corresponding state are both 0 during construction, setting the base value to be a negative number if a certain state is a complete word, and taking the base value to be the negative number of the state position if the certain state is a complete word and the word is not a prefix of other words.
The relationship of base and check satisfies the following condition:
base[s]+c=t
check[t]=s
where s is the subscript of the current state, t is the subscript of the state transition, and c is the numeric value of the input character, and in this embodiment, unicode codes of characters are used.
On the basis of the double-array Trie constructed in step S1032, when each sensitive word is mapped to a double array, the subscripts of the sensitive word in the double array are recorded at the same time for recording in the fail table of the AC automaton.
The generation principle of the AC automaton is explained below with { he, she, his, hers } samples:
the AC automaton is composed of a goto table, a fail table and an output table.
The goto table is also called success table, and belongs to the prefix tree as well as the double-array Trie. After parsing the sample into the goto table, as shown in fig. 2, fig. 2 is a schematic diagram of the goto table of the AC automaton provided in the embodiment of the present application, where states 2,5,7, and 9 represent successful states, that is, they correspond to a word in the sample.
The output table records the corresponding relationship between the state and the words in the sample, and the output table corresponding to the goto table is shown in the following table 1:
TABLE 1 goto table and output table corresponding relation table
State numbering | output table |
2 | he |
5 | he,she |
7 | his |
9 | hers |
The elements in the Output table have two types, one is a word in the sample corresponding to the path from the initial state to the current state, such as state 2; another is the word in the sample corresponding to the suffix of the path, as in state 5.
The fail table stores the one-to-one relationship between the saved transaction states and stores the best state that should be rolled back after the state transition fails. The best state is worth remembering the state of the longest suffix of a string on a match, taking state 5 as an example, where 5 corresponds to a word she, if goto fails after adding a letter, then he is the longest successful suffix of she, corresponding to state 2, so that when state 5 fails, state 2 is the best choice, and after fail to state 2, the AC automaton remembers the prefix he and is ready to accept the letter r.
Building a fail table: the current state is recorded as S,and returning the state S for transferring the table according to the state after the character c is transferred, and if the state S fails, finding a backspace point according to the longest suffix principle, wherein the backspace point is the record of the fail table.
(1) In the initial state, the goto table is full and never fails, so state 0 has no fail pointer, and all the states connected to the initial state have their fail pointers pointing to the initial state.
(2) Performing breadth-first traversal from the initial state, if the current state S receives the character c to reach the state T, tracing back along the fail pointer of S to know that the first predecessor state F is found, so that F. Null then points the fail pointer of T to f.goto (c), taking state 3 as an example, accepts character h, reaches state 4, then backtracks 3 to initial position 0.goto (h) to state 1, points the fail pointer of state 4 to state 1.
(3) Since the F path is a suffix of the T path, that is, T must contain F, the output of T should also contain the output of F, and t.output + ═ f.output is updated.
According to the above steps, the final AC automaton of the sample is shown in fig. 3, fig. 3 is a schematic structural diagram of the AC automaton provided in the embodiment of the present application, and the dashed arrow represents a fail pointer and the solid arrow represents a goto pointer.
The word segmentation is firstly carried out on the webpage data by utilizing the word segmentation algorithm, and the words after word segmentation are respectively detected, so that the situation of false alarm when rules are matched is avoided, and the rate of false alarm is reduced.
On the basis of the above embodiment, as a preferred embodiment, after the sensitive word detection result is obtained, the false positive result in the sensitive word detection result may be filtered according to the category to which the text keyword belongs, so as to obtain an accurate detection result.
After the text keywords are determined, in order to prevent word segmentation failure or wrong word segmentation, false alarm filtering is performed on the sensitive word detection result, and false alarm is further avoided. In addition, other ways of reducing false alarms are possible, such as recording the position of the searched sensitive word in the web page, intercepting and recording the content of the web page, performing further review by a human, and the like. In the embodiment, the TextRank algorithm is used for extracting the key words of the webpage content, and the detection result with low sensitivity in the same category is filtered according to the category to which the key words belong, so that the false alarm rate is reduced.
In the following, a web page sensitive word detection system provided in an embodiment of the present application is introduced, and the web page sensitive word detection system described below and the web page sensitive word detection method described above may be referred to in a corresponding manner.
The present application further provides a web page sensitive word detection system, including:
an obtaining module 100, configured to obtain webpage data and a detection requirement;
the text extraction module 200 is configured to perform text extraction on the webpage data to obtain text keywords;
and the detection module 300 is configured to perform sensitive word detection on the text keyword by using an AC automaton based on the detection requirement, so as to obtain a sensitive word detection result.
Based on the above embodiment, as a preferred embodiment, the system may further include:
and the AC automaton generating module is used for generating the AC automaton based on the detection requirement.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and when the processor calls the computer program in the memory, the steps provided in the foregoing embodiments may be implemented. Of course, the electronic device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system provided by the embodiment, since the system corresponds to the method provided by the embodiment, the description is simple, and the relevant points can be referred to the description of the method part.
The principle and embodiments of the present application are explained herein by using specific examples, and the above descriptions of the embodiments are only used to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Claims (10)
1. A method for detecting web page sensitive words is characterized by comprising the following steps:
acquiring webpage data and detection requirements;
extracting texts from the webpage data to obtain text keywords;
and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
2. The method for detecting web page sensitive words according to claim 1, wherein before the detecting the sensitive words of the text keywords by using an AC automaton based on the detection requirement, the method further comprises:
generating an AC automaton based on the detection requirements.
3. The method for detecting web page sensitive words according to claim 2, wherein generating an AC automaton based on the detection requirement comprises:
determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;
mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;
and generating an AC automatic machine according to the double array dictionary tree, wherein the subscript is stored in a fail table in the AC automatic machine.
4. The method for detecting the web page sensitive words according to claim 1, wherein the step of extracting the text of the web page data to obtain the text keywords comprises the steps of:
performing text segmentation on the webpage data to obtain a shortest word set;
and constructing a network by using the shortest word set as a node through Textrank, iteratively calculating a rank value of each node in the network through PageRank, and sequencing the rank values to obtain the text keywords.
5. The method for detecting the web page sensitive words according to claim 4, wherein the step of performing text segmentation on the web page data to obtain a shortest word set comprises the following steps:
and performing text segmentation on the webpage data by using a lexical analyzer based on a HanLP word segmentation algorithm to obtain a shortest word set.
6. The method for detecting the web page sensitive word according to claim 1, after obtaining the sensitive word detection result, further comprising:
and filtering a false alarm result in the sensitive word detection result according to the category of the text keyword to obtain an accurate detection result.
7. A web page sensitive word detection system, comprising:
the acquisition module is used for acquiring webpage data and detection requirements;
the text extraction module is used for extracting texts from the webpage data to obtain text keywords;
and the detection module is used for detecting the sensitive words of the text keywords by utilizing the AC automaton based on the detection requirements to obtain a sensitive word detection result.
8. The web page sensitive word detection system of claim 7, further comprising:
and the AC automaton generating module is used for generating the AC automaton based on the detection requirement.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. An electronic device, comprising a memory in which a computer program is stored and a processor which, when called upon in the memory, implements the steps of the method according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010548352.6A CN111680128A (en) | 2020-06-16 | 2020-06-16 | Method and system for detecting web page sensitive words and related devices |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010548352.6A CN111680128A (en) | 2020-06-16 | 2020-06-16 | Method and system for detecting web page sensitive words and related devices |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111680128A true CN111680128A (en) | 2020-09-18 |
Family
ID=72455192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010548352.6A Pending CN111680128A (en) | 2020-06-16 | 2020-06-16 | Method and system for detecting web page sensitive words and related devices |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680128A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100361A (en) * | 2020-11-12 | 2020-12-18 | 南京中孚信息技术有限公司 | Character string multimode fuzzy matching method based on AC automaton |
CN112150251A (en) * | 2020-10-09 | 2020-12-29 | 北京明朝万达科技股份有限公司 | Article name management method and device |
CN114266247A (en) * | 2021-12-20 | 2022-04-01 | 中国农业银行股份有限公司 | Sensitive word filtering method and device, storage medium and electronic equipment |
CN116502009A (en) * | 2023-06-25 | 2023-07-28 | 北京奇虎科技有限公司 | Webpage filtering method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255118A (en) * | 2017-07-11 | 2019-01-22 | 普天信息技术有限公司 | A kind of keyword extracting method and device |
CN109684469A (en) * | 2018-12-13 | 2019-04-26 | 平安科技(深圳)有限公司 | Filtering sensitive words method, apparatus, computer equipment and storage medium |
CN109902290A (en) * | 2019-01-23 | 2019-06-18 | 广州杰赛科技股份有限公司 | A kind of term extraction method, system and equipment based on text information |
CN109918548A (en) * | 2019-04-08 | 2019-06-21 | 上海凡响网络科技有限公司 | A kind of methods and applications of automatic detection document sensitive information |
-
2020
- 2020-06-16 CN CN202010548352.6A patent/CN111680128A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255118A (en) * | 2017-07-11 | 2019-01-22 | 普天信息技术有限公司 | A kind of keyword extracting method and device |
CN109684469A (en) * | 2018-12-13 | 2019-04-26 | 平安科技(深圳)有限公司 | Filtering sensitive words method, apparatus, computer equipment and storage medium |
CN109902290A (en) * | 2019-01-23 | 2019-06-18 | 广州杰赛科技股份有限公司 | A kind of term extraction method, system and equipment based on text information |
CN109918548A (en) * | 2019-04-08 | 2019-06-21 | 上海凡响网络科技有限公司 | A kind of methods and applications of automatic detection document sensitive information |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112150251A (en) * | 2020-10-09 | 2020-12-29 | 北京明朝万达科技股份有限公司 | Article name management method and device |
CN112100361A (en) * | 2020-11-12 | 2020-12-18 | 南京中孚信息技术有限公司 | Character string multimode fuzzy matching method based on AC automaton |
CN112100361B (en) * | 2020-11-12 | 2021-02-26 | 南京中孚信息技术有限公司 | Character string multimode fuzzy matching method based on AC automaton |
CN114266247A (en) * | 2021-12-20 | 2022-04-01 | 中国农业银行股份有限公司 | Sensitive word filtering method and device, storage medium and electronic equipment |
CN116502009A (en) * | 2023-06-25 | 2023-07-28 | 北京奇虎科技有限公司 | Webpage filtering method, device, equipment and storage medium |
CN116502009B (en) * | 2023-06-25 | 2023-10-31 | 北京奇虎科技有限公司 | Webpage filtering method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109885692B (en) | Knowledge data storage method, apparatus, computer device and storage medium | |
US8452763B1 (en) | Extracting and scoring class-instance pairs | |
CN111680128A (en) | Method and system for detecting web page sensitive words and related devices | |
US8073877B2 (en) | Scalable semi-structured named entity detection | |
US7191116B2 (en) | Methods and systems for determining a language of a document | |
US20140052688A1 (en) | System and Method for Matching Data Using Probabilistic Modeling Techniques | |
US8255405B2 (en) | Term extraction from service description documents | |
US20080147642A1 (en) | System for discovering data artifacts in an on-line data object | |
US20080147578A1 (en) | System for prioritizing search results retrieved in response to a computerized search query | |
US20130103662A1 (en) | Methods and apparatuses for generating search expressions from content, for applying search expressions to content collections, and/or for analyzing corresponding search results | |
CN101425071A (en) | Location expression detection device and computer readable medium | |
CN104063387A (en) | Device and method abstracting keywords in text | |
Yerra et al. | A sentence-based copy detection approach for web documents | |
CN112395881B (en) | Material label construction method and device, readable storage medium and electronic equipment | |
JP2008282366A (en) | Query response device, query response method, query response program, and recording medium with program recorded thereon | |
CN108345694B (en) | Document retrieval method and system based on theme database | |
CN111325018A (en) | Domain dictionary construction method based on web retrieval and new word discovery | |
Yang et al. | Ontology generation for large email collections. | |
US8862586B2 (en) | Document analysis system | |
Morbidoni et al. | Leveraging linked entities to estimate focus time of short texts | |
CN110309258B (en) | Input checking method, server and computer readable storage medium | |
Alsmadi et al. | Issues related to the detection of source code plagiarism in students assignments | |
Jeong et al. | Determining the titles of Web pages using anchor text and link analysis | |
JP2009205499A (en) | Web page specification apparatus, web page specification method, and program for specifying web page | |
CN101310274B (en) | A knowledge correlation search engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200918 |