CN117057344A - Sensitive word detection method, system, storage medium and electronic equipment - Google Patents

Sensitive word detection method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN117057344A
CN117057344A CN202311009383.4A CN202311009383A CN117057344A CN 117057344 A CN117057344 A CN 117057344A CN 202311009383 A CN202311009383 A CN 202311009383A CN 117057344 A CN117057344 A CN 117057344A
Authority
CN
China
Prior art keywords
sensitive
word
matching
sensitive word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311009383.4A
Other languages
Chinese (zh)
Inventor
江磊
金鹏佳
陈彪
张璐
陶明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Renyimen Technology Co ltd
Original Assignee
Shanghai Renyimen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Renyimen Technology Co ltd filed Critical Shanghai Renyimen Technology Co ltd
Priority to CN202311009383.4A priority Critical patent/CN117057344A/en
Publication of CN117057344A publication Critical patent/CN117057344A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)

Abstract

The application provides a sensitive word detection method, which comprises the following steps: receiving a sensitive word matching task, and determining text data corresponding to the sensitive word matching task; performing sensitive word matching on the text data in the memory by utilizing a prefix tree containing each scene word list, and determining hit sensitive words; and generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word. When the method and the device are used for detecting the sensitive words, the prefix tree containing the word list of each scene is used for matching the sensitive words of the text data, and the segmented storage of the sensitive words can be realized by constructing a plurality of prefix trees, so that the configurable quantity of the sensitive words is improved, and millions of level sensitive word matching can be supported. Meanwhile, the matching of the sensitive words in the memory can obviously improve the matching efficiency of the sensitive words, ensure that the prefix tree can be directly loaded in the memory, and reduce the system jitter in the matching process of the sensitive words. The application also provides a sensitive word detection system, a storage medium and electronic equipment, which have the beneficial effects.

Description

Sensitive word detection method, system, storage medium and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, a system, a storage medium, and an electronic device for detecting a sensitive word.
Background
Sensitive word matching can be used for rapidly matching sensitive words in the text, so that the sensitive word matching becomes a sharp tool for auditors to fight against junk text. However, there are currently a variety of new words and harmonic words that bypass sensitive word detection. For example, some users may use words such as " ticket", "drift" to circumvent the sensitive word "lottery", where " ticket" is not only harmonic but also polyphonic, and it is difficult for conventional pattern matching algorithms to ensure a full hit.
Disclosure of Invention
The application aims to provide a sensitive word detection method, a system, a storage medium and electronic equipment, which can improve the detection precision of sensitive words.
In order to solve the technical problems, the application provides a sensitive word detection method, which comprises the following specific technical scheme:
receiving a sensitive word matching task, and determining text data corresponding to the sensitive word matching task;
performing sensitive word matching on the text data in a memory by utilizing a prefix tree containing each scene word list, and determining hit sensitive words;
and generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word.
Optionally, performing the sensitive word matching on the text data in the memory by using a prefix tree including the vocabulary of each scene includes:
loading the vocabulary of each scene into the memory by using full refresh or real-time synchronous increment;
matching the text data by using a matching tree corresponding to the vocabulary, and determining hit sensitive word primitive words;
and carrying out combination word judgment, word validity filtering and white list filtering on the sensitive word original words to determine a hit word list.
Optionally, before performing the sensitive word matching on the text data by using the prefix tree including the vocabulary of each scene in the memory, the method further includes:
and carrying out character preprocessing on the words to be detected in the text data.
Optionally, performing character preprocessing on the word to be detected in the text data includes:
and executing any one or a combination of any several of case conversion, complex-form conversion and special character conversion on the word to be detected in the text data.
Optionally, matching the text data by using a matching tree corresponding to the vocabulary, and determining the hit sensitive word primitive word includes:
and respectively matching the words to be detected by using a common sensitive word tree, a pinyin tree, a special-shaped word tree and a white list tree, and determining the primary words of the hit sensitive words.
Optionally, the method further comprises:
when the target sensitive word is newly added or modified, reconstructing a matching tree corresponding to the target sensitive word by using incremental loading.
Optionally, the method further comprises:
maintaining the word list by utilizing a prefix tree according to scenes, and storing the prefix tree in a database; wherein each scene contains at least one vocabulary, and each vocabulary is used for maintaining sensitive words and attributes of the sensitive words.
The application also provides a sensitive word detection system, which comprises:
the task receiving module is used for receiving a sensitive word matching task and determining text data corresponding to the sensitive word matching task;
the sensitive word matching module is used for carrying out sensitive word matching on the text data by utilizing a prefix tree containing each scene word list in the memory, and determining hit sensitive words;
and the matching result generation module is used for generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word.
The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method as described above.
The application also provides an electronic device comprising a memory in which a computer program is stored and a processor which when calling the computer program in the memory implements the steps of the method as described above.
The application provides a sensitive word detection method, which comprises the following steps: receiving a sensitive word matching task, and determining text data corresponding to the sensitive word matching task; performing sensitive word matching on the text data in a memory by utilizing a prefix tree containing each scene word list, and determining hit sensitive words; and generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word.
When the method and the device are used for detecting the sensitive words, the prefix tree containing the word list of each scene is used for matching the sensitive words of the text data, and the segmented storage of the sensitive words can be realized by constructing a plurality of prefix trees, so that the configurable quantity of the sensitive words is improved, and millions of level sensitive word matching can be supported. Meanwhile, the matching of the sensitive words in the memory can obviously improve the matching efficiency of the sensitive words, the prefix tree adopted is convenient to adopt full-load and incremental load in the memory, the prefix tree is ensured to be directly loaded in the memory, and the system jitter in the matching process of the sensitive words is reduced.
The application also provides a sensitive word detection system, a storage medium and electronic equipment, which have the beneficial effects and are not repeated here.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for detecting a sensitive word according to an embodiment of the present application;
FIG. 2 is a flowchart of another method for detecting sensitive words according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a sensitive word detection system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, fig. 1 is a flowchart of a method for detecting a sensitive word according to an embodiment of the present application, where the method includes:
s101: receiving a sensitive word matching task, and determining text data corresponding to the sensitive word matching task;
s102: performing sensitive word matching on the text data in a memory by utilizing a prefix tree containing each scene word list, and determining hit sensitive words;
s103: and generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word.
The embodiment of the application does not limit how to receive the sensitive word matching task, and the text data corresponding to the sensitive word matching task can contain single sensitive words or batch sensitive words.
And then performing sensitive word matching on the text data in the memory by using the prefix tree to determine hit sensitive words. Prefix trees containing word lists can be constructed in advance, and the sensitive word quantity of each prefix tree can be determined to be set when the prefix trees are constructed, so that a plurality of prefix trees are constructed. And if the sensitive word is newly added, judging whether the last prefix tree meets the quantity of the sensitive word or not. If not, the sensitive word may be incrementally added. If so, a prefix tree may be created. The constructed prefix tree may be stored in a database such as mysql. Prefix trees may also be built according to the classification of the sensitive words, e.g. according to the type of sensitive words. Other prefix tree construction methods may also be employed by those skilled in the art, and are not limited by the examples herein.
The matching of the sensitive words in the memory can obviously reduce the matching time and improve the matching efficiency of the sensitive words. It should be noted that the same word has different sensitivity in different application scenes, i.e. it can be a sensitive word but not in other scenes. Therefore, the application scene of the sensitive word matching task can be determined first, and the prefix tree adapting to the sensitive word matching task scene is selected to detect the sensitive word.
In a possible embodiment, performing sensitive word matching on the text data by using a prefix tree containing each scene vocabulary in the memory may include the following steps:
firstly, loading word lists of all scenes into a memory by utilizing full refreshing or real-time synchronous increment;
secondly, matching the text data by using a matching tree corresponding to the word list, and determining hit sensitive word stock words;
and thirdly, carrying out combined word judgment, word validity filtering and white list filtering on the sensitive word primitive words to determine a hit word list.
Firstly, maintaining the word list by utilizing a prefix tree according to scenes, and storing the prefix tree in a database; wherein each scene contains at least one vocabulary, and each vocabulary is used for maintaining sensitive words and attributes of the sensitive words. I.e. multiple vocabularies may be added per scene. The sensitive word may contain attributes that may include sensitive word type, sensitive word list, sensitive word validity period, sensitive word matching manner, and the like. The term validity period refers to the term that is sensitive during a particular period or a period corresponding to a particular event.
And matching the text data by using the matching tree to determine the hit sensitive word primitive. The matching tree refers to a prefix tree used to perform sensitive word matching. The word primary word is a direct original word whose meaning is sensitive but not meaning, for example, a certain word in the sensitive word is changed, a pinyin or a shape-similar word is adopted to replace a certain word or a plurality of words, or a symbol is added in the word, and the word primary word is the sensitive word primary word.
And then, carrying out combined word judgment, word validity filtering and white list filtering on the sensitive word original words to determine hit sensitive words. It should be noted that, the detected sensitive word primitive word may be used to collect context judgment and check whether the current period belongs to the sensitive word. At the same time, in order to avoid false hits, through white list filtering, for example, sensitive word primitive words actually cause false hits due to clause errors, for example, a sea traffic police.
When matching the text data with the matching tree corresponding to the vocabulary and determining the hit sensitive word, the common sensitive word tree, the pinyin tree, the special word tree and the white list tree can be utilized to match the word to be detected respectively so as to determine the hit sensitive word.
When the embodiment of the application is used for detecting the sensitive words, the prefix tree containing the word list of each scene is used for matching the sensitive words with the text data, and the segmented storage of the sensitive words can be realized by constructing a plurality of prefix trees, so that the configurable quantity of the sensitive words is improved, and millions of level sensitive word matching can be supported. Meanwhile, the matching of the sensitive words in the memory can obviously improve the matching efficiency of the sensitive words, the prefix tree is convenient to adopt full-load and incremental load in the memory, the prefix tree can be ensured to be directly loaded in the memory, and the system jitter in the matching process of the sensitive words is reduced.
In a possible embodiment, before performing sensitive word matching on the text data by using a prefix tree containing the word list of each scene in the memory, character preprocessing can also be performed on the word to be detected in the text data. Here, how to perform character preprocessing is not limited, and any one or a combination of any several of case conversion, complex-form conversion, and special character conversion may be performed on the word to be detected in the text data.
In other embodiments, if a new target sensitive word is required to be added or the target sensitive word is required to be modified when the sensitive word matching is performed, the matching tree corresponding to the target sensitive word is reconstructed by using incremental loading. At this time, the matching tree needs to be completely reconstructed, and only incremental loading is needed.
In addition, in other embodiments, after determining the hit sensitive word, statistics can be performed on the hit sensitive word, that is, the hit times of the hit sensitive word are counted, so that a real-time sensitive word report or an offline report can be generated, and the same sensitive word hit in a large number can be determined in time. If the abnormal condition exists in the sensitive word which is hit for many times, the corresponding setting of the sensitive word can be recalled in time, for example, the sensitive word on the corresponding prefix tree is deleted.
Referring now to fig. 2, fig. 2 is a flowchart of another method for detecting a sensitive word according to an embodiment of the present application, where the method includes:
s201: receiving a sensitive word matching task, and determining text data corresponding to the sensitive word matching task;
s202: performing sensitive word matching on the text data in a memory by utilizing a prefix tree containing each scene word list, and determining hit sensitive words;
s203: generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word;
s204: counting the hit sensitive words, and generating a sensitive word real-time report or an offline report; if the abnormal sensitive word hit for many times is confirmed, the abnormal sensitive word is recalled.
Compared with the previous embodiment, the embodiment of the application can further count the hit times of the hit sensitive words after confirming the hit sensitive words, avoid a large number of false positives caused by the setting errors of the sensitive words, and further improve the detection precision of the sensitive words.
When no sensitive word matching is performed, each prefix tree may be stored in a database to form a word stock. The word stock can be used for realizing accurate query and fuzzy query of the sensitive words, and sub-classification, scenes, sensitivity and the like obtained by classifying the sensitive words. Sensitive word matching strategies can also be set in the word stock.
The test evaluation procedure in the word stock above is explained below:
for a single sensitive word, the sensitive word to be evaluated can be input, the on-line hit data period and the on-line hit scene are determined, namely on-line trigger data can be displayed, and sensitive repair operation is carried out on the sensitive word to be evaluated.
For batch sensitive words, a word list to be evaluated can be input, an on-line hit data period and an on-line hit scene are determined, on-line trigger data are displayed, and the word list to be evaluated containing hit objects is subjected to inverted screening according to hit times; and performing sensitive repair operation on the single word in the word list to be evaluated.
In addition, the word stock can support sensitive word importation, and can be imported one by one or imported in batches according to word classification. When the sensitive words are imported, the importing time is recorded, the batch importing is checked, and the operator and the operating time are recorded.
The word stock also supports sensitive word deletion, and for deleted sensitive words, approval can be performed. If the sensitive words are deleted in batches, parallel approval can be realized, so that the efficiency of deleting the sensitive words is improved.
The word library can also carry out sensitive word deformation management, set the deformation mode of the sensitive word, and set, select and display the deformation mode of the sensitive word. The modification mode includes but is not limited to the mode of adopting pinyin, shape-similar characters, replacing Chinese characters with special symbols and the like.
The thesaurus itself may contain version information to perform rollback of thesaurus versions. Meanwhile, the word library can execute the derivation of the sensitive words, and the derivation can be selected according to the corresponding attributes of the sensitive words, such as the selection of the sensitive words in what scene. The filtering can also be performed according to the sequence number of the sensitive word, and the sequence number can be the sequence number of the prefix tree.
The thesaurus may also contain a policy configuration of the sensitive words and set a policy version. The policy may be used to screen for sensitive words in a particular scenario, e.g., sensitive words for common posts, and a sensitive word detection policy for posts may be set. When the strategy is configured, sensitive word classification screening, scene classification screening, setting of the action duration of the sensitive word and specific operation setting of the strategy can be carried out. After the strategy configuration is completed, gray level test can be performed based on the strategy, if the gray level test is passed, the strategy can be validated, and after that, the configured strategy can be directly applied when sensitive word detection is performed in a specific scene.
It is readily understood that the policy configuration in the word stock may also contain version information in order to perform updating or rollback of policies.
In addition, the word library can also carry out statistical analysis of sensitive words, such as statistics of word stock ratio under each category at present, statistics of on-line hit rate and the like, or mismatching of trigger fluctuation of all sensitive words, so as to obtain trigger fluctuation comparison curves of all sensitive words. Sensitive word detection accuracy statistics can also be performed. And triggering an alarm mechanism when an anomaly is detected in the statistical process.
The following describes the sensitive word detection system provided in the embodiment of the present application, and the sensitive word detection system described below and the sensitive word detection method described above may be referred to correspondingly.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a sensitive word detection system provided by an embodiment of the present application, and the present application further provides a sensitive word detection system, including:
the task receiving module is used for receiving a sensitive word matching task and determining text data corresponding to the sensitive word matching task;
the sensitive word matching module is used for carrying out sensitive word matching on the text data by utilizing a prefix tree containing each scene word list in the memory, and determining hit sensitive words;
and the matching result generation module is used for generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word.
Based on the above embodiment, as a preferred embodiment, the sensitive word matching module includes:
the vocabulary loading unit is used for loading the vocabulary of each scene into the memory by utilizing full refreshing or real-time synchronous increment;
the sensitive word matching unit is used for matching the text data by utilizing a matching tree corresponding to the vocabulary, and determining hit sensitive word original words;
and the sensitive word filtering unit is used for carrying out combination word judgment, word validity filtering and white list filtering on the sensitive word original words to determine a hit word list.
Based on the above embodiment, as a preferred embodiment, further comprising:
and the preprocessing module is used for preprocessing characters of words to be detected in the text data.
Based on the above embodiments, as a preferred embodiment, the preprocessing module is configured to perform any one or a combination of any several of case conversion, complex-form conversion, and special character conversion on a word to be detected in the text data.
Based on the above embodiment, as a preferred embodiment, the sensitive word matching unit is a unit for respectively matching the to-be-detected words by using a common sensitive word tree, a pinyin tree, a special-shaped word tree and a whitelist tree, and determining the hit sensitive word original words.
Based on the above embodiment, as a preferred embodiment, further comprising:
and the increment loading module is used for reconstructing a matching tree corresponding to the target sensitive word by utilizing increment loading when the target sensitive word is newly added or modified.
Based on the above embodiment, as a preferred embodiment, further comprising:
the word stock generating module is used for maintaining the word list by utilizing a prefix tree according to scenes and storing the prefix tree in a database; wherein each scene contains at least one vocabulary, and each vocabulary is used for maintaining sensitive words and attributes of the sensitive words.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the steps provided by the above-described embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The present application also provides an electronic device, referring to fig. 4, and as shown in fig. 4, a block diagram of an electronic device provided in an embodiment of the present application may include a processor 1410 and a memory 1420.
Processor 1410 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc., among others. The processor 1410 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1410 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1410 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1410 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 1420 may include one or more computer-readable storage media, which may be non-transitory. Memory 1420 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 1420 is used at least to store a computer program 1421, where the computer program can implement relevant steps in the sensitive word detection method performed by the electronic device side disclosed in any of the foregoing embodiments after being loaded and executed by the processor 1410. In addition, the resources stored by memory 1420 may include an operating system 1422, data 1423, and the like, and the storage may be transient storage or permanent storage. The operating system 1422 may include Windows, linux, android, among other things.
In some embodiments, the electronic device may further include a display 1430, an input-output interface 1440, a communication interface 1450, a sensor 1460, a power supply 1470, and a communication bus 1480.
Of course, the structure of the electronic device shown in fig. 4 is not limited to the electronic device in the embodiment of the present application, and the electronic device may include more or fewer components than those shown in fig. 4 or may combine some components in practical applications.
In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The system provided by the embodiment is relatively simple to describe as it corresponds to the method provided by the embodiment, and the relevant points are referred to in the description of the method section.
The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.
It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for detecting a sensitive word, comprising:
receiving a sensitive word matching task, and determining text data corresponding to the sensitive word matching task;
performing sensitive word matching on the text data in a memory by utilizing a prefix tree containing each scene word list, and determining hit sensitive words;
and generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word.
2. The method of claim 1, wherein performing the sensitive word matching on the text data in the memory using a prefix tree including a vocabulary of each scene includes:
loading the vocabulary of each scene into the memory by using full refresh or real-time synchronous increment;
matching the text data by using a matching tree corresponding to the vocabulary, and determining hit sensitive word primitive words;
and carrying out combination word judgment, word validity filtering and white list filtering on the sensitive word original words to determine a hit word list.
3. The method for detecting sensitive words according to claim 2, wherein before the sensitive word matching is performed on the text data by using a prefix tree including each scene vocabulary in the memory, the method further comprises:
and carrying out character preprocessing on the words to be detected in the text data.
4. The method of claim 3, wherein the performing character preprocessing on the word to be detected in the text data comprises:
and executing any one or a combination of any several of case conversion, complex-form conversion and special character conversion on the word to be detected in the text data.
5. The method for detecting a sensitive word according to claim 4, wherein the matching the text data using the matching tree corresponding to the vocabulary, and determining the hit sensitive word stock comprises:
and respectively matching the words to be detected by using a common sensitive word tree, a pinyin tree, a special-shaped word tree and a white list tree, and determining the primary words of the hit sensitive words.
6. The sensitive word detection method of claim 2, further comprising:
when the target sensitive word is newly added or modified, reconstructing a matching tree corresponding to the target sensitive word by using incremental loading.
7. The method for detecting a sensitive word according to claim 2 or 6, further comprising:
maintaining the word list by utilizing a prefix tree according to scenes, and storing the prefix tree in a database; wherein each scene contains at least one vocabulary, and each vocabulary is used for maintaining sensitive words and attributes of the sensitive words.
8. A sensitive word detection system, comprising:
the task receiving module is used for receiving a sensitive word matching task and determining text data corresponding to the sensitive word matching task;
the sensitive word matching module is used for carrying out sensitive word matching on the text data by utilizing a prefix tree containing each scene word list in the memory, and determining hit sensitive words;
and the matching result generation module is used for generating a matching result corresponding to the matching task of the sensitive word according to the hit sensitive word.
9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the sensitive word detection method according to any of claims 1-7.
10. An electronic device comprising a memory and a processor, wherein the memory has a computer program stored therein, and wherein the processor, when calling the computer program in the memory, implements the steps of the sensitive word detection method as claimed in any one of claims 1-7.
CN202311009383.4A 2023-08-10 2023-08-10 Sensitive word detection method, system, storage medium and electronic equipment Pending CN117057344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311009383.4A CN117057344A (en) 2023-08-10 2023-08-10 Sensitive word detection method, system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311009383.4A CN117057344A (en) 2023-08-10 2023-08-10 Sensitive word detection method, system, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117057344A true CN117057344A (en) 2023-11-14

Family

ID=88665649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311009383.4A Pending CN117057344A (en) 2023-08-10 2023-08-10 Sensitive word detection method, system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117057344A (en)

Similar Documents

Publication Publication Date Title
CN104679646B (en) A kind of method and apparatus for detecting SQL code defect
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN113965389B (en) Network security management method, device and medium based on firewall log
CN113449753B (en) Service risk prediction method, device and system
CN113538154A (en) Risk object identification method and device, storage medium and electronic equipment
CN115269981A (en) Abnormal behavior analysis method and system combined with artificial intelligence
CN110532773B (en) Malicious access behavior identification method, data processing method, device and equipment
CN105808602B (en) Method and device for detecting junk information
CN111861733B (en) Fraud prevention and control system and method based on address fuzzy matching
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN112613176A (en) Slow SQL statement prediction method and system
CN102467537A (en) Method and device for deleting vocabulary
CN117057344A (en) Sensitive word detection method, system, storage medium and electronic equipment
CN115599973A (en) User crowd label screening method, system, equipment and storage medium
CN115168509A (en) Processing method and device of wind control data, storage medium and computer equipment
CN110807082A (en) Quality spot check item determination method, system, electronic device and readable storage medium
CN112149743A (en) Access control method, device, equipment and medium
CN111144088A (en) Corpus management method, corpus management device and electronic equipment
CN117688564B (en) Detection method, device and storage medium for intelligent contract event log
CN113837802B (en) Secondhand mobile phone price prediction method integrating time sequence process and mobile phone defect feature depth
CN111881983B (en) Data processing method and device based on classification model, electronic equipment and medium
CN117077668A (en) Risk image display method, apparatus, computer device, and readable storage medium
CN112926301A (en) Sensitive word monitoring method and device based on sensitive word bank construction
CN114997159A (en) Text extraction method, device, server and computer readable storage medium
CN116126367A (en) Model updating method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication