CN113076390A - Forbidden word query method and device - Google Patents

Forbidden word query method and device Download PDF

Info

Publication number
CN113076390A
CN113076390A CN202110424388.8A CN202110424388A CN113076390A CN 113076390 A CN113076390 A CN 113076390A CN 202110424388 A CN202110424388 A CN 202110424388A CN 113076390 A CN113076390 A CN 113076390A
Authority
CN
China
Prior art keywords
forbidden
word
words
detected
tree structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110424388.8A
Other languages
Chinese (zh)
Inventor
彭丽娥
吴小光
吴焕倪
唐晓玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen South China City Network Technology Co ltd
Original Assignee
Shenzhen South China City Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen South China City Network Technology Co ltd filed Critical Shenzhen South China City Network Technology Co ltd
Priority to CN202110424388.8A priority Critical patent/CN113076390A/en
Publication of CN113076390A publication Critical patent/CN113076390A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

In order to overcome the defects of the prior art, the invention provides a forbidden word query method and a forbidden word query device, wherein the forbidden word query method comprises the following steps: constructing a forbidden word database by using a tree structure mode; loading the forbidden word database into a cache, and splitting the unit to be detected into single words according to the word order; matching the split first single character with a root directory of the forbidden word data, and after the matching is successful, matching each level of nodes of the tree structure mode with the split subsequent single characters in the unit to be detected one by one in sequence to determine the forbidden words in the unit to be detected; and processing the detected forbidden words, and sending a notice to prohibit sending the forbidden words. The forbidden word query method provided by the invention does not need to traverse the whole forbidden word database set, and only needs to perform the mnemonic matching on the visit child node at the beginning of the corresponding keyword, so that the retrieval range is greatly reduced, and the query performance is improved.

Description

Forbidden word query method and device
Technical Field
The invention relates to the technical field of computers, in particular to a forbidden word query method, a forbidden word query device, a forbidden word query system, an electronic device and a storage medium.
Background
With the popularization of internet technology, network services become an essential part of people's life, and are convenient for people's life.
With the development of networks, some illegal opinions continuously play a negative role on the networks, so that some prohibited words are generally shielded in the contents of broadcasters, articles and the like in the networks. In some websites or platforms, the use of a canonical language, contraband, has become a mandatory approach. It carries out language specification by setting some or some kind of vocabulary as illegal vocabulary. Website banned words are mainly distributed in three places: the front end code of the website, the characters (title, description and content) added in the background and all pictures of the website.
The current conventional method is as follows: manually checking which forbidden words of which page are contained through the front page, and then entering the background to change the files of the corresponding column one by one; or directly filling the searched and replaced words by using a search and replacement function carried by the background, and completely replacing.
The method for realizing the background replacement generally comprises the following steps: all forbidden words are configured through the configuration file, the configured forbidden words are read through the starting item, and the forbidden words are placed in a cache set (in a hashSet). And in the process of submitting form data, gradually and circularly matching a forbidden word set through contents, and counting forbidden words and prompting a front-end user if some forbidden words exist.
However, the forbidden words are stored in the hashSet, and when the forbidden words are searched, the position of the forbidden words can be determined only by traversing the set, namely, matching the set for a unit to be detected in a circulating manner; although the method can also be used for finding out the information of the forbidden words, the efficiency is low.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a forbidden word query method and a forbidden word query device, which are used for solving at least one of the technical problems.
Specifically, the technical scheme is as follows:
a forbidden word query method comprises the following steps:
constructing a forbidden word database by using a tree structure mode;
loading the forbidden word database into a cache, and splitting the unit to be detected into single words according to the word order;
matching the split first single character with a root directory of the forbidden word data, and after the matching is successful, matching each level of nodes of the tree structure mode with the split subsequent single characters in the unit to be detected one by one in sequence to determine the forbidden words in the unit to be detected;
and processing the detected forbidden words, and sending a notice to prohibit sending the forbidden words.
The method for constructing the forbidden word database by using the tree structure mode comprises the following steps:
dividing the forbidden words into single words according to the word order;
and taking the single character ordered at the first position as a root directory, forming a tree structure by other single characters in a language order, and constructing the forbidden word database.
The method for constructing the forbidden word database by using the tree structure mode comprises the following steps:
the step of updating the forbidden word database comprises the following steps:
dividing forbidden words to be input into single words according to a word order;
matching the split first single character with the root directory of the forbidden word data:
if the single character does not exist, a tree structure with the single character as a root directory is established.
The establishment of the tree structure taking the single character as the root directory comprises the following steps:
and the split single characters correspond to nodes at all levels of the tree structure according to the language order.
The step of processing the detected forbidden words comprises the following steps:
and deleting or replacing the forbidden words.
A root directory and N-level nodes exist in the tree structure mode; the first single character after each forbidden word is split corresponds to the root directory; and the N single characters after the first single character correspond to the N-level nodes one by one, and N is a positive integer.
An information auditing system, comprising:
the forbidden word database module is used for constructing a forbidden word database of a tree structure mode;
the front-end module exchanges data with the forbidden word database module and is used for acquiring an external unit to be detected;
the character acquisition module is used for exchanging data with the front-end module, splitting the unit to be detected through a word sequence, and matching the split single characters with the forbidden word database module between a root directory and each level of nodes so as to determine the forbidden words in the unit to be detected;
and the processing module is used for exchanging data with the front-end module and the character acquisition module and deleting or replacing the forbidden words in the unit to be detected after the character acquisition module determines the forbidden words.
The forbidden word database module adopts a DFA algorithm and comprises the following steps: a forbidden word database;
and the forbidden word database is stored and put into a cache set when the information auditing system is started.
An electronic device for contraband detection, comprising:
a storage medium for storing a computer program,
a processing unit, in data exchange with the storage medium, configured to execute the computer program through the processing unit when performing forbidden word detection, so as to perform the steps of the forbidden word query method according to any one of claims 1 to 6.
A computer-readable storage medium having a computer program stored therein;
the computer program, when running, performs the steps of the illicit word query method as claimed in any one of claims 1 to 6.
The invention has at least the following beneficial effects:
according to the method for inquiring the forbidden words, after a forbidden word database is built through a tree structure mode, a unit to be detected is divided into single words according to a word sequence, the unit to be detected and the forbidden word database are matched and compared according to the word sequence, and the forbidden words in the unit to be detected are determined; and finally, processing the detected forbidden words and sending a notice to prohibit sending the forbidden words. The forbidden word query method provided by the invention does not need to traverse the whole forbidden word database set, and only needs to perform the mnemonic matching on the visit child node at the beginning of the corresponding keyword, so that the retrieval range is greatly reduced, and the query performance is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a block diagram of the system of the present invention.
Fig. 3 is an embodiment of the present invention.
Fig. 4 is a flow chart of the invention as applied to the DFA algorithm.
100, a forbidden word database module; 200. a front end module; 300. a text acquisition module; 400. and a processing module.
Detailed Description
Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
As shown in fig. 1, a method for querying prohibited words includes the following steps:
s1, constructing a forbidden word database by using a tree structure mode; loading the forbidden word database into a cache, and splitting the unit to be detected into single words according to the word order;
s2, matching the split first single character with a root directory of the forbidden word data, and after the matching is successful, matching each level of nodes of the tree structure mode with the split subsequent single characters in the unit to be detected one by one in sequence to determine the forbidden words in the unit to be detected;
and S3, processing the detected forbidden word and sending a notice to prohibit sending the forbidden word.
Wherein, the tree structure described herein refers to: a nested structure of levels; the outer layer and the inner layer of a tree structure have similar structures, so the structure can be represented recursively. The various dendrograms in the classic data structure are a typical tree structure: a tree can be simply represented as root, left sub-tree, right sub-tree; the left sub-tree and the right sub-tree have own sub-trees; the tree structure refers to a data structure with one-to-many tree relationship among data elements, and is an important nonlinear data structure; in the tree structure, the root node of the tree has no precursor node, and each of the other nodes has only one precursor node. The leaf node has no subsequent node, and the number of the subsequent nodes of each of the rest nodes can be one or more.
The tree structure is that characters in a unit to be detected are divided into single characters according to the word sequence, and then the single characters which are arranged at the first position are used as a root directory to be compared with a forbidden word database; moreover, during comparison, if the root directories are consistent, all data in the forbidden word database do not need to be circularly matched, but only the forbidden word data in the root directories are compared one by one, so that the time is greatly saved.
Further, the constructing the forbidden word database by using the tree structure mode includes: dividing the forbidden words into single words according to the word order; taking the single characters ordered at the first position as a root directory, forming a tree structure by other single characters in a language order, and constructing the forbidden word database; the establishment of the tree structure taking the single character as the root directory comprises the following steps: and the split single characters correspond to nodes at all levels of the tree structure according to the language order. Wherein, a root directory and N-level nodes exist in the tree structure mode; the first single character after each forbidden word is split corresponds to the root directory; and the N single characters after the first single character correspond to the N-level nodes one by one, and N is a positive integer.
E.g., the illicit word "do you"; at this time, according to the word order, split "do you" into: "you", "good", and "do"; when constructing a forbidden word database, taking the 'you' sequentially ranked at the first position as a root directory; "good" as the first level node; "Dow" as the second level node. In the process of making a query, if: "hello stick", the process is:
the 'hello stick' is disassembled according to the language order as follows: "you" "good" "bar"; firstly, matching root directories to find the root directory represented by 'you'; then, only the data corresponding to the root directory of 'you' need to be compared; matching the first-level node data in the forbidden word database with 'good' to obtain a matching item 'good', wherein only the data to be compared is as follows: "you" this root directory and the first level node is "good" data; then, comparing the second-level nodes, and if no bar exists in the forbidden word database, judging that the forbidden word is not a forbidden word; if a "wand" is present, it is a contraband.
With the continuous progress of society, forbidden words are continuously increased, and the problem of incomplete data inevitably occurs to the established forbidden word database, so that the forbidden words cannot be shielded.
In order to solve the above problem, the "building a forbidden word database by using a tree structure mode" further includes: the step of updating the forbidden word database comprises the following steps: dividing forbidden words to be input into single words according to a word order; matching the split first single character with the root directory of the forbidden word data: if the single character does not exist, a tree structure with the single character as a root directory is established.
By the method for updating the database, the forbidden word database can be kept up to date all the time, and the problem of missed detection is avoided.
The step of processing the detected forbidden words comprises the following steps: deleting or replacing the forbidden words; the substitution described herein means that the forbidden word is replaced with a designated symbol or word, such as a uniform character like "+" or "#" or "forbidden word".
Referring to fig. 2, an information auditing system includes: the system comprises a forbidden word database module 100, a front-end module 200, a character acquisition module 300 and a processing module 400; the forbidden word database module 100 is used for constructing a forbidden word database in a tree structure mode; the front-end module 200 exchanges data with the illicit word database module 100 and is used for collecting external units to be detected; the character acquisition module 300 performs data exchange with the front-end module 200, and is configured to split the unit to be detected by a word order, and perform matching between a root directory and each level of nodes on the split individual characters and the illicit word database module 100 to determine illicit words in the unit to be detected; the processing module 400 performs data exchange with the front-end module 200 and the text acquisition module 300, and is configured to delete or replace the prohibited word in the unit to be detected after the text acquisition module 300 determines the prohibited word.
Preferably, the forbidden word database module 100 adopts a DFA algorithm, including: a forbidden word database; and the forbidden word database is stored and put into a cache set when the information auditing system is started.
Specific example I: referring to fig. 3-4, if detecting military forces male soldiers, military forces female soldiers, and military fire businessmen and military fire depot under the same military root directory, the method specifically comprises the following steps:
1. the forbidden word database is formed by inputting forbidden word data in the background and is stored in the relational database, and when the forbidden word data are updated, the forbidden word database can be newly added at any time;
2. loading a forbidden word database recorded in a background by starting a project, and putting the forbidden word database into a cache; when the back end updates the forbidden word data, the front end can be informed of updating by using a message mechanism;
3. the forbidden word database adopts a tree structure mode, and the algorithm uses a DFA algorithm;
301. inquiring whether the 'army' exists in the hashMap or not in the hashMap, if the 'army' does not exist, proving that the sensitive word starting from the 'army' does not exist, directly constructing the tree and updating;
if found in hashMap, indicating the presence of a sensitive word beginning with "military";
302. setting hashMap as hashMap, get (army), and sequentially matching team, man, soldier and the like;
303. judging whether the character of each level of nodes is the last character in the forbidden word database; if the forbidden word searching is finished, setting the flag bit flagEnd to be 1, otherwise, setting the flag bit flagEnd to be 0, and continuing to compare;
304. and returning the detected corresponding forbidden words to the front-end module.
Therefore, the method of the invention does not need to traverse the whole forbidden word database set, and only needs to match the notability of the visit child node at the beginning of the corresponding keyword, thereby greatly improving the query performance
The invention provides an electronic device for detecting forbidden words, which comprises: a storage medium and a processing unit; the storage medium is used for storing a computer program, and preferably, is a storage device such as a mobile hard disk, a hard disk or a U disk; a processing unit, preferably a CPU, performing data exchange with the storage medium, and configured to execute the computer program through the processing unit when detecting the prohibited word, so as to perform the steps of the prohibited word query method as described above.
The CPU described above can execute various appropriate actions and processes according to a program stored in a storage medium. The electronic device also includes peripherals including an input part for a keyboard, a mouse, etc., and an output part such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; in particular, the process as described in FIG. 4 may be implemented as a computer software program, according to the disclosed embodiments of the invention.
The present invention provides an embodiment comprising a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing a method as illustrated in the flowchart depicted in fig. 4. The computer program may be downloaded and installed from a network. The computer program, when executed by the CPU, performs the above-described functions defined in the system of the present invention.
The present invention provides a computer-readable storage medium having a computer program stored therein; the computer program, when running, performs the steps of the illicit word query method as described above.
In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The above-mentioned invention numbers are merely for description and do not represent the merits of the implementation scenarios.
The above disclosure is only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims (10)

1. A forbidden word query method is characterized by comprising the following steps:
constructing a forbidden word database by using a tree structure mode;
loading the forbidden word database into a cache, and splitting the unit to be detected into single words according to the word order;
matching the split first single character with a root directory of the forbidden word data, and after the matching is successful, matching each level of nodes of the tree structure mode with the split subsequent single characters in the unit to be detected one by one in sequence to determine the forbidden words in the unit to be detected;
and processing the detected forbidden words, and sending a notice to prohibit sending the forbidden words.
2. The method for querying prohibited word as claimed in claim 1, wherein the constructing the prohibited word database by using the tree structure mode includes:
dividing the forbidden words into single words according to the word order;
and taking the single character ordered at the first position as a root directory, forming a tree structure by other single characters in a language order, and constructing the forbidden word database.
3. The method for querying prohibited word as claimed in claim 1, wherein the constructing the prohibited word database by using the tree structure mode includes:
the step of updating the forbidden word database comprises the following steps:
dividing forbidden words to be input into single words according to a word order;
matching the split first single character with the root directory of the forbidden word data:
if the single character does not exist, a tree structure with the single character as a root directory is established.
4. The method of claim 3, wherein the establishing a tree structure with the single word as a root directory comprises:
and the split single characters correspond to nodes at all levels of the tree structure according to the language order.
5. The method of claim 3, wherein the step of processing the detected prohibited words comprises:
and deleting or replacing the forbidden words.
6. The forbidden word query method of claim 1, wherein:
a root directory and N-level nodes exist in the tree structure mode; the first single character after each forbidden word is split corresponds to the root directory; and the N single characters after the first single character correspond to the N-level nodes one by one, and N is a positive integer.
7. An information auditing system, comprising:
the forbidden word database module is used for constructing a forbidden word database of a tree structure mode;
the front-end module exchanges data with the forbidden word database module and is used for acquiring an external unit to be detected;
the character acquisition module is used for exchanging data with the front-end module, splitting the unit to be detected through a word sequence, and matching the split single characters with the forbidden word database module between a root directory and each level of nodes so as to determine the forbidden words in the unit to be detected;
and the processing module is used for exchanging data with the front-end module and the character acquisition module and deleting or replacing the forbidden words in the unit to be detected after the character acquisition module determines the forbidden words.
8. An information auditing system according to claim 7, characterized in that:
the forbidden word database module adopts a DFA algorithm and comprises the following steps: a forbidden word database;
and the forbidden word database is stored and put into a cache set when the information auditing system is started.
9. An electronic device for detecting contraband, comprising:
a storage medium for storing a computer program,
a processing unit, in data exchange with the storage medium, configured to execute the computer program through the processing unit when performing forbidden word detection, so as to perform the steps of the forbidden word query method according to any one of claims 1 to 6.
10. A computer-readable storage medium characterized by:
the computer readable storage medium having stored therein a computer program;
the computer program, when running, performs the steps of the illicit word query method as claimed in any one of claims 1 to 6.
CN202110424388.8A 2021-04-20 2021-04-20 Forbidden word query method and device Pending CN113076390A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110424388.8A CN113076390A (en) 2021-04-20 2021-04-20 Forbidden word query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110424388.8A CN113076390A (en) 2021-04-20 2021-04-20 Forbidden word query method and device

Publications (1)

Publication Number Publication Date
CN113076390A true CN113076390A (en) 2021-07-06

Family

ID=76618365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110424388.8A Pending CN113076390A (en) 2021-04-20 2021-04-20 Forbidden word query method and device

Country Status (1)

Country Link
CN (1) CN113076390A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014000517A1 (en) * 2012-06-26 2014-01-03 北京奇虎科技有限公司 Recommendation system and method for input searching
CN110874398A (en) * 2020-01-14 2020-03-10 广东博智林机器人有限公司 Forbidden word processing method and device, electronic equipment and storage medium
CN111914057A (en) * 2020-06-01 2020-11-10 杭州城市大数据运营有限公司 Method and device for detecting and filtering sensitive words of customer service system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014000517A1 (en) * 2012-06-26 2014-01-03 北京奇虎科技有限公司 Recommendation system and method for input searching
CN110874398A (en) * 2020-01-14 2020-03-10 广东博智林机器人有限公司 Forbidden word processing method and device, electronic equipment and storage medium
CN111914057A (en) * 2020-06-01 2020-11-10 杭州城市大数据运营有限公司 Method and device for detecting and filtering sensitive words of customer service system

Similar Documents

Publication Publication Date Title
US20050240570A1 (en) Partial query caching
US20100325136A1 (en) Error tolerant autocompletion
US11074235B2 (en) Inclusion dependency determination in a large database for establishing primary key-foreign key relationships
CN107690637B (en) Connecting semantically related data using large-table corpus
EP3532949A1 (en) Change monitoring spanning graph queries
US10275486B2 (en) Multi-system segmented search processing
US9218394B2 (en) Reading rows from memory prior to reading rows from secondary storage
US11573961B2 (en) Delta graph traversing system
CN109815238A (en) The dynamic adding method and device of database are realized with strict balanced binary tree
US7676457B2 (en) Automatic index based query optimization
US7752194B2 (en) LDAP revision history
US10459959B2 (en) Top-k query processing with conditional skips
CN113076390A (en) Forbidden word query method and device
US20090063417A1 (en) Index attribute subtypes for LDAP entries
CN113065419B (en) Pattern matching algorithm and system based on flow high-frequency content
CN113779286B (en) Method and device for managing graph data
CN103092881B (en) Intranet searching method and apparatus, search engine and terminal device
CN111159175B (en) Incomplete database Skyline query method based on index
US11256679B2 (en) Systems and methods for storing object state on hash chains
CN111858609A (en) Fuzzy query method and device for block chain
CN112835905A (en) Indexing method, device, equipment and storage medium for array type column
Kang et al. Edge-attributed community search for large graphs
EP4220472A1 (en) System and method for reference validation of spreadsheets
AbdelNaby et al. Towards efficient top-k fuzzy auto-completion queries
CN115809248B (en) Data query method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 801, Huanan City headquarters building, No.1, Huanan Avenue, Hehua community, Pinghu street, Longgang District, Shenzhen, Guangdong 518000

Applicant after: Shenzhen Huanan City Digital Technology Co.,Ltd.

Address before: 801, Huanan City headquarters building, No.1, Huanan Avenue, Hehua community, Pinghu street, Longgang District, Shenzhen, Guangdong 518000

Applicant before: Shenzhen South China City Network Technology Co.,Ltd.