CN111680128A - Method and system for detecting web page sensitive words and related devices - Google Patents

Method and system for detecting web page sensitive words and related devices Download PDF

Info

Publication number
CN111680128A
CN111680128A CN202010548352.6A CN202010548352A CN111680128A CN 111680128 A CN111680128 A CN 111680128A CN 202010548352 A CN202010548352 A CN 202010548352A CN 111680128 A CN111680128 A CN 111680128A
Authority
CN
China
Prior art keywords
sensitive
web page
detection
text
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010548352.6A
Other languages
Chinese (zh)
Inventor
徐凯熙
范渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Hangzhou Dbappsecurity Technology Co Ltd
Original Assignee
Hangzhou Dbappsecurity Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dbappsecurity Technology Co Ltd filed Critical Hangzhou Dbappsecurity Technology Co Ltd
Priority to CN202010548352.6A priority Critical patent/CN111680128A/en
Publication of CN111680128A publication Critical patent/CN111680128A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method for detecting web page sensitive words, which comprises the following steps: acquiring webpage data and detection requirements; extracting texts from the webpage data to obtain text keywords; and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result. The method and the device effectively segment the webpage data, respectively detect the segmented words, and reduce the false alarm rate due to the false alarm condition when the surface is regularly matched. The application also provides a method and a system for detecting the web page sensitive words, a computer readable storage medium and electronic equipment, which have the beneficial effects.

Description

Method and system for detecting web page sensitive words and related devices
Technical Field
The present application relates to the field of network security, and in particular, to a method, a system, and a device for detecting web page sensitive words.
Background
The web page sensitive words refer to improperly used words contained in web page contents, and the reasons for the occurrence of the improper use words may be that an administrator does not carefully review the uploaded contents, or the contents of a website are tampered by a hacker, and sensitive words are added to an original normal web page.
In the prior art, the sensitive words are detected by using rule matching, and according to the rule matching technology, the sensitive words can be mistakenly cut from the original normal webpage content, so that the detection result is mistakenly reported. Therefore, how to avoid the false detection of the sensitive word is a technical problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
The application aims to provide a method and a system for detecting web page sensitive words, a computer readable storage medium and electronic equipment, which can reduce the false detection rate of the sensitive words.
In order to solve the technical problem, the application provides a method for detecting web page sensitive words, which has the following specific technical scheme:
acquiring webpage data and detection requirements;
extracting texts from the webpage data to obtain text keywords;
and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
Optionally, before the detecting the sensitive word of the text keyword by using the AC automaton based on the detection requirement, the method further includes:
generating an AC automaton based on the detection requirements.
Optionally, generating an AC automaton based on the detection requirement includes:
determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;
mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;
and generating an AC automatic machine according to the double array dictionary tree, wherein the subscript is stored in a fail table in the AC automatic machine.
Optionally, performing text extraction on the webpage data to obtain text keywords includes:
performing text segmentation on the webpage data to obtain a shortest word set;
and constructing a network by using the shortest word set as a node through Textrank, calculating a rank value of each node in the network through a PageRank iteration, and sequencing the rank values to obtain the text key words.
Optionally, performing text segmentation on the webpage data to obtain a shortest term set includes:
and performing text segmentation on the webpage data by using a lexical analyzer based on a HanLP word segmentation algorithm to obtain a shortest word set.
Optionally, after obtaining the sensitive word detection result, the method further includes:
and filtering a false alarm result in the sensitive word detection result according to the category of the text keyword to obtain an accurate detection result.
The present application further provides a web page sensitive word detection system, including:
the acquisition module is used for acquiring webpage data and detection requirements;
the text extraction module is used for extracting texts from the webpage data to obtain text keywords;
and the detection module is used for detecting the sensitive words of the text keywords by utilizing an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
Optionally, the method further includes:
and the AC automaton generating module is used for generating the AC automaton based on the detection requirement.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth above.
The present application further provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method described above when calling the computer program in the memory.
The application provides a method for detecting web page sensitive words, which comprises the following steps: acquiring webpage data and detection requirements; extracting texts from the webpage data to obtain text keywords; and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
The method and the device effectively segment the webpage data, respectively detect the segmented words, and reduce the false alarm rate due to the false alarm condition when the surface is regularly matched. The application also provides a method for detecting the web page sensitive words, a detection system, a computer readable storage medium and an electronic device, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for detecting web page sensitive words according to an embodiment of the present application;
fig. 2 is a goto representation of an AC automaton provided in the embodiments of the present application;
fig. 3 is a schematic structural diagram of an AC automaton provided in the embodiment of the present application;
fig. 4 is a schematic structural diagram of a web page sensitive word detection system according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for detecting web page sensitive words according to an embodiment of the present application, where the method includes:
s101: acquiring webpage data and detection requirements;
the step aims to acquire the webpage data to be detected and the corresponding detection requirement. Because the sensitive word detection emphasis of each website is different, and the corresponding detection requirements naturally differ, a suitable sensitive word can be selected according to the detection requirements to form a sensitive word stock, so that a corresponding AC automatic engine can be generated subsequently.
S102: extracting texts from the webpage data to obtain text keywords;
the method aims to extract the text to obtain the text keywords, and avoids misinformation when rules are matched, namely avoiding misreading and misinformation caused by splitting words.
The method for extracting the text is not specifically limited, and the embodiment provides a preferred text extraction method, that is, text segmentation is performed on webpage data to obtain a shortest term set, then Textrank is used to construct a network by using the shortest term set as a node, a rank value of each node in the network is iteratively calculated by using PageRank, and the rank values are sorted to obtain text keywords. Firstly, text segmentation is needed, and the word segmentation is carried out on the crawled webpage content by utilizing a lexical analyzer to segment the webpage content into a shortest word set which accords with logic. The text segmentation is not particularly limited, and the text segmentation may be performed on the web page data by using a lexical analyzer based on the HanLP segmentation algorithm, or may be performed on the basis of a lexical analyzer obtained by a CoreNLP tool, a final segmentation tool, or the like.
The purpose of text segmentation is to obtain the shortest set of words, it being understood that the same word exists in at most one word.
And after the shortest word set is obtained, judging whether the words are adjacent according to the positions of the words in the original webpage data text, regarding the shortest word set as nodes, and calculating the rank value of each node by using PageRank iteration.
The TextRank calculates rank value by using voting principle to make each word vote to its neighboring nodes, and the weight of the vote depends on the number of votes owned by the word (namely, the number of neighbors).
The formula for PageRank is as follows:
Figure BDA0002541580110000041
wherein S (V)i) Represents a node ViRank value of (c), ln (V)i) Represents a node ViSet of predecessor nodes of, Out (V)j) Represents a node ViD is a damping coefficient for smoothing.
The TextRank considers that N words in front of a word and N words behind the word have graph adjacent relations, and the concrete implementation is that a sliding window with the length of N is set, and all the words in the sliding window are regarded as adjacent nodes of word nodes.
The iterative calculation formula of TextRank is as follows:
Figure BDA0002541580110000051
wherein WS (V)i) Representing a word node ViScore of (a), wjiRepresenting slave word node ViTo word node VjWeight value of the edge of (V)jWord node V representing the last iterationjScore of (a); ln (V)i) To point to a word node ViA set of nodes of (c); out (V)j) Is a VjThe node indicated by the node set, d is the damping coefficient.
The purpose of the step is to obtain text keywords, and detect the front shortest word set of a preset number after ranking rank values as the text keywords.
S103: and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on detection requirements to obtain a sensitive word detection result.
After the text keywords are obtained in step S102, sensitive word detection is performed by using an AC automaton corresponding to the detection requirement. It will be readily appreciated that the present embodiment defaults to the need to generate an AC automaton based on the detection requirements prior to performing the present step. The method for generating the AC automaton is not limited, and the present embodiment provides a preferred method for generating the AC automaton:
s1031: determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;
s1032: mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;
s1033: and generating an AC automatic machine according to the double array dictionary tree, wherein subscripts are stored in a fail table in the AC automatic machine.
In step S1031, the sensitive word group may be determined from the sensitive word library according to the detection requirement, and the corresponding dictionary tree is generated based on the sensitive word group, because the simple dictionary tree has low retrieval efficiency for the sensitive word, the method improves the retrieval efficiency for the sensitive word by generating the dual-array dictionary tree. It is within the scope of the present application that a person skilled in the art may not use a dual-array dictionary tree or use other tree structures obtained based on dictionary tree transformation for the search of sensitive words. By using the double-array dictionary tree, the use space of the original dictionary tree can be reduced, the query time is shortened through simple mathematical addition operation, and the sensitive word retrieval efficiency is improved.
In step S1032, two integer arrays, that is, a double array, are used to form a base array and a check array, where each element of the base array represents a tree node, where the unicode code of the current node character is recorded, and the check array represents a predecessor state of a certain state. Setting both base and check to be 0 in an initial state, indicating that the state is changed to an idle state if the base and check values of the corresponding state are both 0 during construction, setting the base value to be a negative number if a certain state is a complete word, and taking the base value to be the negative number of the state position if the certain state is a complete word and the word is not a prefix of other words.
The relationship of base and check satisfies the following condition:
base[s]+c=t
check[t]=s
where s is the subscript of the current state, t is the subscript of the state transition, and c is the numeric value of the input character, and in this embodiment, unicode codes of characters are used.
On the basis of the double-array Trie constructed in step S1032, when each sensitive word is mapped to a double array, the subscripts of the sensitive word in the double array are recorded at the same time for recording in the fail table of the AC automaton.
The generation principle of the AC automaton is explained below with { he, she, his, hers } samples:
the AC automaton is composed of a goto table, a fail table and an output table.
The goto table is also called success table, and belongs to the prefix tree as well as the double-array Trie. After parsing the sample into the goto table, as shown in fig. 2, fig. 2 is a schematic diagram of the goto table of the AC automaton provided in the embodiment of the present application, where states 2,5,7, and 9 represent successful states, that is, they correspond to a word in the sample.
The output table records the corresponding relationship between the state and the words in the sample, and the output table corresponding to the goto table is shown in the following table 1:
TABLE 1 goto table and output table corresponding relation table
State numbering output table
2 he
5 he,she
7 his
9 hers
The elements in the Output table have two types, one is a word in the sample corresponding to the path from the initial state to the current state, such as state 2; another is the word in the sample corresponding to the suffix of the path, as in state 5.
The fail table stores the one-to-one relationship between the saved transaction states and stores the best state that should be rolled back after the state transition fails. The best state is worth remembering the state of the longest suffix of a string on a match, taking state 5 as an example, where 5 corresponds to a word she, if goto fails after adding a letter, then he is the longest successful suffix of she, corresponding to state 2, so that when state 5 fails, state 2 is the best choice, and after fail to state 2, the AC automaton remembers the prefix he and is ready to accept the letter r.
Building a fail table: the current state is recorded as S,
Figure BDA0002541580110000061
and returning the state S for transferring the table according to the state after the character c is transferred, and if the state S fails, finding a backspace point according to the longest suffix principle, wherein the backspace point is the record of the fail table.
(1) In the initial state, the goto table is full and never fails, so state 0 has no fail pointer, and all the states connected to the initial state have their fail pointers pointing to the initial state.
(2) Performing breadth-first traversal from the initial state, if the current state S receives the character c to reach the state T, tracing back along the fail pointer of S to know that the first predecessor state F is found, so that F. Null then points the fail pointer of T to f.goto (c), taking state 3 as an example, accepts character h, reaches state 4, then backtracks 3 to initial position 0.goto (h) to state 1, points the fail pointer of state 4 to state 1.
(3) Since the F path is a suffix of the T path, that is, T must contain F, the output of T should also contain the output of F, and t.output + ═ f.output is updated.
According to the above steps, the final AC automaton of the sample is shown in fig. 3, fig. 3 is a schematic structural diagram of the AC automaton provided in the embodiment of the present application, and the dashed arrow represents a fail pointer and the solid arrow represents a goto pointer.
The word segmentation is firstly carried out on the webpage data by utilizing the word segmentation algorithm, and the words after word segmentation are respectively detected, so that the situation of false alarm when rules are matched is avoided, and the rate of false alarm is reduced.
On the basis of the above embodiment, as a preferred embodiment, after the sensitive word detection result is obtained, the false positive result in the sensitive word detection result may be filtered according to the category to which the text keyword belongs, so as to obtain an accurate detection result.
After the text keywords are determined, in order to prevent word segmentation failure or wrong word segmentation, false alarm filtering is performed on the sensitive word detection result, and false alarm is further avoided. In addition, other ways of reducing false alarms are possible, such as recording the position of the searched sensitive word in the web page, intercepting and recording the content of the web page, performing further review by a human, and the like. In the embodiment, the TextRank algorithm is used for extracting the key words of the webpage content, and the detection result with low sensitivity in the same category is filtered according to the category to which the key words belong, so that the false alarm rate is reduced.
In the following, a web page sensitive word detection system provided in an embodiment of the present application is introduced, and the web page sensitive word detection system described below and the web page sensitive word detection method described above may be referred to in a corresponding manner.
The present application further provides a web page sensitive word detection system, including:
an obtaining module 100, configured to obtain webpage data and a detection requirement;
the text extraction module 200 is configured to perform text extraction on the webpage data to obtain text keywords;
and the detection module 300 is configured to perform sensitive word detection on the text keyword by using an AC automaton based on the detection requirement, so as to obtain a sensitive word detection result.
Based on the above embodiment, as a preferred embodiment, the system may further include:
and the AC automaton generating module is used for generating the AC automaton based on the detection requirement.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and when the processor calls the computer program in the memory, the steps provided in the foregoing embodiments may be implemented. Of course, the electronic device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system provided by the embodiment, since the system corresponds to the method provided by the embodiment, the description is simple, and the relevant points can be referred to the description of the method part.
The principle and embodiments of the present application are explained herein by using specific examples, and the above descriptions of the embodiments are only used to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for detecting web page sensitive words is characterized by comprising the following steps:
acquiring webpage data and detection requirements;
extracting texts from the webpage data to obtain text keywords;
and performing sensitive word detection on the text keywords by using an AC (alternating current) automaton based on the detection requirements to obtain a sensitive word detection result.
2. The method for detecting web page sensitive words according to claim 1, wherein before the detecting the sensitive words of the text keywords by using an AC automaton based on the detection requirement, the method further comprises:
generating an AC automaton based on the detection requirements.
3. The method for detecting web page sensitive words according to claim 2, wherein generating an AC automaton based on the detection requirement comprises:
determining a sensitive phrase according to the detection requirement, and generating a dictionary tree corresponding to the sensitive phrase;
mapping each state in the dictionary tree to a double array by using a double array dictionary tree generation algorithm to generate a double array dictionary tree, and recording subscripts of the states in the double array;
and generating an AC automatic machine according to the double array dictionary tree, wherein the subscript is stored in a fail table in the AC automatic machine.
4. The method for detecting the web page sensitive words according to claim 1, wherein the step of extracting the text of the web page data to obtain the text keywords comprises the steps of:
performing text segmentation on the webpage data to obtain a shortest word set;
and constructing a network by using the shortest word set as a node through Textrank, iteratively calculating a rank value of each node in the network through PageRank, and sequencing the rank values to obtain the text keywords.
5. The method for detecting the web page sensitive words according to claim 4, wherein the step of performing text segmentation on the web page data to obtain a shortest word set comprises the following steps:
and performing text segmentation on the webpage data by using a lexical analyzer based on a HanLP word segmentation algorithm to obtain a shortest word set.
6. The method for detecting the web page sensitive word according to claim 1, after obtaining the sensitive word detection result, further comprising:
and filtering a false alarm result in the sensitive word detection result according to the category of the text keyword to obtain an accurate detection result.
7. A web page sensitive word detection system, comprising:
the acquisition module is used for acquiring webpage data and detection requirements;
the text extraction module is used for extracting texts from the webpage data to obtain text keywords;
and the detection module is used for detecting the sensitive words of the text keywords by utilizing the AC automaton based on the detection requirements to obtain a sensitive word detection result.
8. The web page sensitive word detection system of claim 7, further comprising:
and the AC automaton generating module is used for generating the AC automaton based on the detection requirement.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. An electronic device, comprising a memory in which a computer program is stored and a processor which, when called upon in the memory, implements the steps of the method according to any one of claims 1-6.
CN202010548352.6A 2020-06-16 2020-06-16 Method and system for detecting web page sensitive words and related devices Pending CN111680128A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010548352.6A CN111680128A (en) 2020-06-16 2020-06-16 Method and system for detecting web page sensitive words and related devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010548352.6A CN111680128A (en) 2020-06-16 2020-06-16 Method and system for detecting web page sensitive words and related devices

Publications (1)

Publication Number Publication Date
CN111680128A true CN111680128A (en) 2020-09-18

Family

ID=72455192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010548352.6A Pending CN111680128A (en) 2020-06-16 2020-06-16 Method and system for detecting web page sensitive words and related devices

Country Status (1)

Country Link
CN (1) CN111680128A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100361A (en) * 2020-11-12 2020-12-18 南京中孚信息技术有限公司 Character string multimode fuzzy matching method based on AC automaton
CN112150251A (en) * 2020-10-09 2020-12-29 北京明朝万达科技股份有限公司 Article name management method and device
CN114266247A (en) * 2021-12-20 2022-04-01 中国农业银行股份有限公司 Sensitive word filtering method and device, storage medium and electronic equipment
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255118A (en) * 2017-07-11 2019-01-22 普天信息技术有限公司 A kind of keyword extracting method and device
CN109684469A (en) * 2018-12-13 2019-04-26 平安科技(深圳)有限公司 Filtering sensitive words method, apparatus, computer equipment and storage medium
CN109902290A (en) * 2019-01-23 2019-06-18 广州杰赛科技股份有限公司 A kind of term extraction method, system and equipment based on text information
CN109918548A (en) * 2019-04-08 2019-06-21 上海凡响网络科技有限公司 A kind of methods and applications of automatic detection document sensitive information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255118A (en) * 2017-07-11 2019-01-22 普天信息技术有限公司 A kind of keyword extracting method and device
CN109684469A (en) * 2018-12-13 2019-04-26 平安科技(深圳)有限公司 Filtering sensitive words method, apparatus, computer equipment and storage medium
CN109902290A (en) * 2019-01-23 2019-06-18 广州杰赛科技股份有限公司 A kind of term extraction method, system and equipment based on text information
CN109918548A (en) * 2019-04-08 2019-06-21 上海凡响网络科技有限公司 A kind of methods and applications of automatic detection document sensitive information

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112150251A (en) * 2020-10-09 2020-12-29 北京明朝万达科技股份有限公司 Article name management method and device
CN112100361A (en) * 2020-11-12 2020-12-18 南京中孚信息技术有限公司 Character string multimode fuzzy matching method based on AC automaton
CN112100361B (en) * 2020-11-12 2021-02-26 南京中孚信息技术有限公司 Character string multimode fuzzy matching method based on AC automaton
CN114266247A (en) * 2021-12-20 2022-04-01 中国农业银行股份有限公司 Sensitive word filtering method and device, storage medium and electronic equipment
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium
CN116502009B (en) * 2023-06-25 2023-10-31 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
US8452763B1 (en) Extracting and scoring class-instance pairs
CN111680128A (en) Method and system for detecting web page sensitive words and related devices
US8073877B2 (en) Scalable semi-structured named entity detection
US7191116B2 (en) Methods and systems for determining a language of a document
US20140052688A1 (en) System and Method for Matching Data Using Probabilistic Modeling Techniques
US8255405B2 (en) Term extraction from service description documents
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
US20080147578A1 (en) System for prioritizing search results retrieved in response to a computerized search query
US20130103662A1 (en) Methods and apparatuses for generating search expressions from content, for applying search expressions to content collections, and/or for analyzing corresponding search results
CN101425071A (en) Location expression detection device and computer readable medium
CN104063387A (en) Device and method abstracting keywords in text
Yerra et al. A sentence-based copy detection approach for web documents
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
JP2008282366A (en) Query response device, query response method, query response program, and recording medium with program recorded thereon
CN108345694B (en) Document retrieval method and system based on theme database
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
Yang et al. Ontology generation for large email collections.
US8862586B2 (en) Document analysis system
Morbidoni et al. Leveraging linked entities to estimate focus time of short texts
CN110309258B (en) Input checking method, server and computer readable storage medium
Alsmadi et al. Issues related to the detection of source code plagiarism in students assignments
Jeong et al. Determining the titles of Web pages using anchor text and link analysis
JP2009205499A (en) Web page specification apparatus, web page specification method, and program for specifying web page
CN101310274B (en) A knowledge correlation search engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200918