WO2021139268A1 - 敏感词检测方法、装置、计算机设备及存储介质 - Google Patents

敏感词检测方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021139268A1
WO2021139268A1 PCT/CN2020/118862 CN2020118862W WO2021139268A1 WO 2021139268 A1 WO2021139268 A1 WO 2021139268A1 CN 2020118862 W CN2020118862 W CN 2020118862W WO 2021139268 A1 WO2021139268 A1 WO 2021139268A1
Authority
WO
WIPO (PCT)
Prior art keywords
sensitive
word
text
sensitive word
homophonic
Prior art date
Application number
PCT/CN2020/118862
Other languages
English (en)
French (fr)
Inventor
程华东
李剑锋
汪伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139268A1 publication Critical patent/WO2021139268A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • This application relates to the technical field of sensitive word filtering, and in particular to a sensitive word detection method, device, computer equipment and storage medium.
  • Sensitive word filtering is based on advanced artificial intelligence technology to accurately and efficiently identify various scenes of political, pornographic, abusive, prohibited, spam and other illegal content, prevent content risks in advance, and improve user experience.
  • the commonly used sensitive word filtering algorithms include a finite automata matching algorithm based on a sensitive word database, a classification and sequence labeling algorithm based on a machine learning model.
  • the inventor realizes that the shortcomings of the above-mentioned existing sensitive word filtering methods are that they can only identify sensitive words themselves, and cannot filter out the inflections of sensitive words, such as homophones and redundant insert words, resulting in accurate recognition of sensitive words.
  • the sex is low.
  • the embodiments of the present application provide a sensitive word detection method, device, computer equipment, and storage medium, aiming to solve the problem of low accuracy of the existing sensitive word filtering method for sensitive word recognition.
  • an embodiment of the present application provides a method for detecting sensitive words, which includes:
  • the first sensitive word set, the second sensitive word set, the third sensitive word set, and the fourth sensitive word set are deduplicated and combined to obtain a total sensitive word set.
  • an embodiment of the present application also provides a sensitive word detection device, which includes:
  • the first obtaining unit is used to obtain a sensitive word database from a preset sensitive word server;
  • the first construction unit is used to construct a homophonic vocabulary corresponding to the sensitive vocabulary
  • the second construction unit is configured to construct a sensitive word indexer and a homophone word indexer respectively according to the sensitive word database and the homophone word database;
  • the first filtering unit is configured to, if the text to be tested is received, filter the text to be tested through the sensitive word indexer to obtain a first set of sensitive words;
  • a second filtering unit configured to remove non-Chinese characters in the text to be tested to obtain a de-redundant text, and filter the de-redundant text through the sensitive word indexer to obtain a second set of sensitive words;
  • a third filtering unit configured to filter the text to be tested through the homophone indexer to obtain a third set of sensitive words
  • a fourth filtering unit configured to filter the de-redundant text through the homophone word indexer to obtain a fourth set of sensitive words
  • the merging unit is used to de-duplicate and merge the first sensitive word set, the second sensitive word set, the third sensitive word set, and the fourth sensitive word set to obtain a total sensitive word set.
  • an embodiment of the present application also provides a computer device, the computer device includes a memory and a processor, the memory stores a computer program, and the processor is used to run the computer program to perform the following steps :
  • the first sensitive word set, the second sensitive word set, the third sensitive word set, and the fourth sensitive word set are deduplicated and combined to obtain a total sensitive word set.
  • the embodiments of the present application also provide a computer-readable storage medium that stores a computer program, wherein when the computer program is executed by a processor, the processor executes the following steps :
  • the first sensitive word set, the second sensitive word set, the third sensitive word set, and the fourth sensitive word set are deduplicated and combined to obtain a total sensitive word set.
  • FIG. 1 is a schematic diagram of an application scenario of a sensitive word detection method provided by an embodiment of this application;
  • FIG. 2 is a schematic flowchart of a method for detecting sensitive words according to an embodiment of the application
  • FIG. 3 is a schematic diagram of a sub-flow of a method for detecting sensitive words according to an embodiment of this application;
  • FIG. 4 is a schematic diagram of a sub-flow of a method for detecting sensitive words according to an embodiment of this application;
  • FIG. 5 is a schematic diagram of a sub-process of a method for detecting sensitive words according to an embodiment of this application;
  • FIG. 6 is a schematic diagram of a sub-process of a method for detecting sensitive words according to an embodiment of the application
  • FIG. 7 is a schematic block diagram of a sensitive word detection device provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of a first construction unit of a sensitive word detection device provided by an embodiment of the application.
  • FIG. 9 is a schematic block diagram of a second construction unit of a sensitive word detection device provided by an embodiment of this application.
  • FIG. 10 is a schematic block diagram of a third filtering unit of a sensitive word detection device provided by an embodiment of the application.
  • FIG. 11 is a schematic block diagram of a fourth filtering unit of a sensitive word detection device provided by an embodiment of the application.
  • FIG. 12 is a schematic block diagram of a first acquiring unit of a sensitive word detection device provided by an embodiment of this application.
  • FIG. 13 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context .
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • FIG. 1 is a schematic diagram of an application scenario of a sensitive word detection method provided by an embodiment of the application.
  • FIG. 2 is a schematic flowchart of a method for detecting sensitive words provided by an embodiment of the application.
  • the sensitive word detection method is applied to the sensitive word detection server 10.
  • the sensitive word detection server 10 refers to a server for detecting sensitive words.
  • the sensitive word detection server 10 obtains a sensitive word database from a preset sensitive word server 20.
  • FIG. 2 is a schematic flowchart of a sensitive word detection method provided by an embodiment of the present application. As shown in the figure, the method includes the following steps S1-S8.
  • S1 Obtain a sensitive word database from a preset sensitive word server.
  • the sensitive word database is obtained from the preset sensitive word server.
  • the sensitive word server refers to a server used to provide a sensitive word database.
  • the above step S1 specifically includes: if a sensitive word database update reminder message sent by the sensitive word server is received, obtaining the updated download address of the sensitive word database from the sensitive word database update reminder message,
  • the sensitive vocabulary update reminder message includes the download address; the updated sensitive vocabulary is downloaded from the download address.
  • the sensitive word server when the sensitive word database is updated, the sensitive word server will send a sensitive word database update reminder message to the sensitive word detection server, and the sensitive word database update reminder message contains the download address of the updated sensitive word database.
  • the sensitive word detection server obtains the updated download address of the sensitive word database from the sensitive word database update reminder message, and downloads the update from the download address Sensitive vocabulary afterwards. Through the above steps, the sensitive vocabulary can be updated.
  • the sensitive word server obtains the sensitive word database in the following manner.
  • the training corpus is constructed, and the corpus is automatically annotated according to the sensitive lexicon and the harmonious phonetic lexicon.
  • the sensitive words with redundant components are randomly generated according to the redundant regularity for corresponding text enhancement.
  • the sensitive word discovery model is trained through training corpus.
  • the detection of sensitive words and the updating of the lexicon are separated from each other.
  • the updating of the lexicon is an offline task and therefore does not affect the speed of online search and filtering. Therefore, the bert+bi-lstm+crf model is selected as the sensitive The word discovery model is more accurate.
  • each word in the sensitive dictionary is converted into a corresponding pinyin, for example, Gaochunbing is converted into gaochunbing.
  • the pinyin tone can be removed.
  • the above step S2 specifically includes the following steps S21-S22.
  • the pinyin of the sensitive words in the sensitive word database is acquired, and the tone is removed.
  • the pinyin of the sensitive words in the sensitive word library is used as the homophonic sensitive words, and the obtained homophonic sensitive words are de-duplicated and then stored in a preset blank database to obtain the homophonic word library.
  • a blank database refers to a database without data.
  • a sensitive word indexer and a homophone word indexer are constructed respectively according to the sensitive word database and the homophone word database.
  • the sensitive word indexer and the homophone word indexer can be constructed through a data structure such as a trie tree or a double array trie tree.
  • the trie tree is a tree structure and a variant of the hash tree. Its advantages are: use the common prefix of strings to reduce query time, minimize unnecessary string comparisons, and realize insertion and query operations. It is a data structure that trades space for time. It is widely used in word frequency statistics and Enter the statistical field.
  • the double-array trie tree can store the Trie tree that originally required multiple arrays to be represented by using two data, which can greatly reduce the space complexity. Specifically:
  • the base array is responsible for recording the state
  • the check array is responsible for checking whether each string is transferred from the same state.
  • check[i] is a negative value, it means that the state is a character The end of the string.
  • step S3 specifically includes the following steps S31-S32.
  • a double array tire tree is used to construct a sensitive word indexer corresponding to the sensitive word database.
  • a double array tire tree is used to construct a homophone word indexer corresponding to the homophone word library.
  • this embodiment proposes a double-array trie tree after upgrading on the basis of the tire tree structure.
  • the double-array trie tree has high query efficiency. , The advantages of saving space, can effectively reduce the waste of space.
  • the text to be tested is received, the text to be tested is filtered by the sensitive word indexer to obtain the first set of sensitive words.
  • the text to be tested is input into the sensitive word indexer, and the sensitive word indexer searches for sensitive words contained in the text to be tested, and adds the queried sensitive words to the first sensitive word set.
  • S5 Remove non-Chinese characters in the text to be tested to obtain a de-redundant text, and filter the de-redundant text through the sensitive word indexer to obtain a second set of sensitive words.
  • Non-Chinese characters in the text to be tested are removed to obtain the de-redundant text.
  • Non-Chinese characters include redundant elements such as Martian script, symbols and numbers. These redundant components will interfere with the retrieval of sensitive word indexers.
  • the de-redundant text is filtered by the sensitive word indexer to obtain a second set of sensitive words.
  • the de-redundant text is input into the sensitive word indexer, and the sensitive word indexer searches for sensitive words contained in the de-redundant text, and adds the queried sensitive words to the second sensitive word set in.
  • the Chinese character of the text to be tested is first converted to Pinyin, and then the converted text to be tested is input into the homophone word indexer, so that the homophone word indexer searches for corresponding homophone sensitive words.
  • step S6 specifically includes the following steps S61-S63.
  • S61 Convert Chinese in the text to be tested into pinyin to obtain the homophonic text to be tested.
  • the Chinese in the text to be tested is converted into pinyin to obtain the homophonic text to be tested.
  • the pinyin tone can be removed.
  • S62 Filter the to-be-tested homophonic text through the homophonic word indexer to obtain a first homophonic sensitive word set.
  • the homophonic text to be tested is filtered by the homophonic word indexer to obtain the first homophonic sensitive word set.
  • the homophone text to be tested is input into the homophone word indexer, and the homophone word indexer searches for homophone sensitive words contained in the homophone text to be tested, and adds the queried homophone sensitive words to the first homophone In the collection of sensitive words.
  • the mapping relationship between Chinese and Pinyin is established when the Chinese in the text to be tested is converted to pinyin, and then the homophonic sensitive words in the text to be tested and the homophonic sensitive words in the first homophonic sensitive word set are searched according to the mapping relationship. Corresponding word. In addition, the found words are added as sensitive words to the third sensitive word set.
  • the Chinese to Pinyin of the de-redundant text is first converted to Pinyin, and then the converted de-redundant text is input into the homophone word indexer, so that the homophone word indexer searches for corresponding homophone sensitive words.
  • step S7 specifically includes the following steps S71-S73.
  • the Chinese in the de-redundant text is converted into pinyin to obtain the de-redundant homophonic text.
  • the pinyin tone can be removed.
  • the de-redundant homophonic text is filtered by the homophonic word indexer to obtain the second homophonic sensitive word set.
  • the de-redundant homophonic text is input into the homophonic word indexer, and the homophonic word indexer searches for the homophonic sensitive words contained in the de-redundant homophonic text, and adds the queried homophonic sensitive words to the first Two homophonic sensitive words in the collection.
  • the mapping relationship between Chinese and Pinyin is established when the Chinese in the de-redundant text is converted to pinyin, and then the homophonic sensitive words in the de-redundant text and the second homophonic sensitive word set are searched according to the mapping relationship. The word corresponding to the word. In addition, the found words are added as sensitive words to the fourth sensitive word set.
  • S8 De-duplicate and merge the first sensitive word set, the second sensitive word set, the third sensitive word set, and the fourth sensitive word set to obtain a total sensitive word set.
  • first, the first sensitive word set, the second sensitive word set, the third sensitive word set, and the fourth sensitive word set are deduplicated, that is, the repeated sensitive words are removed.
  • the total sensitive word set contains all the sensitive words contained in the text to be tested.
  • the technical solution of the embodiment of the present application constructs a homophone word database corresponding to the sensitive word database; constructs a sensitive word indexer and a homophone word indexer according to the sensitive word database and the homophone word database; if the text to be tested is received, respectively Sensitive word indexer and homophonic word indexer are used to filter sensitive words in the text to be tested and the de-redundant text after removing non-Chinese characters from the text to be tested, so that not only the sensitive words themselves in the text to be tested can be identified, but also sensitive words can be identified
  • the homophones and redundant insertion words greatly improve the accuracy of recognition.
  • FIG. 7 is a schematic block diagram of a sensitive word detection device 60 provided by an embodiment of the present application. As shown in FIG. 7, corresponding to the above sensitive word detection method, the present application also provides a sensitive word detection device 60.
  • the sensitive word detection device 60 includes a unit for executing the above-mentioned sensitive word detection method, and the sensitive word detection device 60 can be configured in a server. Specifically, referring to FIG. 7, the sensitive word detection device 60 includes a first acquisition unit 61, a first construction unit 62, a second construction unit 63, a first filtering unit 64, a second filtering unit 65, and a third filtering unit 66. , The fourth filtering unit 67 and the merging unit 68.
  • the first obtaining unit 61 is configured to obtain a sensitive word database from a preset sensitive word server;
  • the first construction unit 62 is configured to construct a homophonic word database corresponding to the sensitive word database
  • the second construction unit 63 is configured to construct a sensitive word indexer and a homophone word indexer respectively according to the sensitive word database and the homophone word database;
  • the first filtering unit 64 is configured to, if the text to be tested is received, filter the text to be tested through the sensitive word indexer to obtain a first set of sensitive words;
  • the second filtering unit 65 is configured to remove non-Chinese characters in the text to be tested to obtain a de-redundant text, and filter the de-redundant text through the sensitive word indexer to obtain a second set of sensitive words ;
  • the third filtering unit 66 is configured to filter the text to be tested through the homophone word indexer to obtain a third set of sensitive words
  • the fourth filtering unit 67 is configured to filter the de-redundant text through the homophone indexer to obtain a fourth set of sensitive words
  • the merging unit 68 is configured to de-duplicate and merge the first sensitive word set, the second sensitive word set, the third sensitive word set, and the fourth sensitive word set to obtain a total sensitive word set.
  • the first construction unit 62 includes a second acquisition unit 621 and a storage unit 622.
  • the second acquiring unit 621 is configured to acquire the pinyin of the sensitive words in the sensitive word database.
  • the storage unit 622 is configured to use the pinyin of the sensitive words of the sensitive word library as the homophone-sensitive words, and store the homophone-sensitive words in a preset blank database to obtain the homophone word library.
  • the second construction unit 63 includes a third construction unit 631 and a fourth construction unit 632.
  • the third construction unit 631 is configured to use a double array tire tree to construct a sensitive word indexer corresponding to the sensitive word database;
  • the fourth construction unit 632 is configured to construct a homophone word indexer corresponding to the homophone word library by using a dual array tire tree.
  • the third filter unit 66 includes a first conversion unit 661, a fifth filter unit 662 and a third acquisition unit 663.
  • the first conversion unit 661 is configured to convert Chinese in the text to be tested into pinyin to obtain the homophonic text to be tested;
  • the fifth filtering unit 662 is configured to filter the homophonic text to be tested through the homophonic word indexer to obtain a first homophonic sensitive word set;
  • the third acquiring unit 663 is configured to acquire the words corresponding to the homophonic sensitive words in the first homophonic sensitive word set in the text to be tested to obtain the third sensitive word set.
  • the fourth filter unit 67 includes a second conversion unit 671, a sixth filter unit 672, and a fourth acquisition unit 673.
  • the second conversion unit 671 is configured to convert Chinese in the de-redundant text into pinyin to obtain a de-redundant homophonic text
  • a sixth filtering unit 672 configured to filter the de-redundant homophonic text through the homophonic word indexer to obtain a second homophonic sensitive word set;
  • the fourth acquiring unit 673 is configured to acquire the words corresponding to the homophonic sensitive words in the second homophonic sensitive word set in the de-redundant text to obtain the fourth sensitive word set.
  • the first obtaining unit 61 includes a downloading unit 611.
  • the downloading unit 611 is configured to, if a sensitive word database update reminder message sent by the sensitive word server is received, obtain the updated download address of the sensitive word database from the sensitive word database update reminder message, and the sensitive word database is updated
  • the reminder message includes the download address; download the updated sensitive vocabulary from the download address.
  • the above-mentioned sensitive word detection device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 13.
  • FIG. 13 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute a sensitive word detection method.
  • the processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute a sensitive word detection method.
  • the network interface 505 is used for network communication with other devices.
  • the structure shown in FIG. 13 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory to implement the sensitive word detection method of the present application.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer program may be stored in a storage medium, and the storage medium is a computer-readable storage medium.
  • the computer program is executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiment.
  • the storage medium may be a computer-readable storage medium.
  • the storage medium stores a computer program.
  • the processor executes the sensitive word detection method of the present application.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store program codes. medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of each unit is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the steps in the method in the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs.
  • the units in the devices in the embodiments of the present application may be combined, divided, and deleted according to actual needs.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

一种敏感词检测方法、装置、计算机设备及存储介质。所述方法包括:从预设的敏感词服务器中获取敏感词库(S1);构建所述敏感词库对应的谐音词库(S2);分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器(S3);若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合(S4);去除待测文本中的非中文字符以得到去冗余文本,并通过敏感词索引器对去冗余文本进行过滤以得到第二敏感词集合(S5);通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合(S6);通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合(S7);将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合(S8)。

Description

敏感词检测方法、装置、计算机设备及存储介质
本申请要求于2020年7月16日提交中国专利局、申请号为202010688343.7、发明名称为“敏感词检测方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及敏感词过滤技术领域,尤其涉及一种敏感词检测方法、装置、计算机设备及存储介质。
背景技术
敏感词过滤是指基于先进的人工智能技术,精准高效识别各类场景涉政、色情、辱骂、违禁、垃圾广告等违规内容,提前防御内容风险,提升用户体验。目前,常用的敏感词过滤算法有基于敏感词库的有限自动机匹配算法、基于机器学习模型的分类和序列标注算法。
发明人意识到以上现有敏感词过滤方法的缺点是:只能识别敏感词本身,对于敏感词的变形词,如谐音词以及冗余插入词不能够过滤出来,从而导致对敏感词识别的准确性较低。
发明内容
本申请实施例提供了一种敏感词检测方法、装置、计算机设备及存储介质,旨在解决现有敏感词过滤方法对敏感词识别的准确性低的问题。
第一方面,本申请实施例提供了一种敏感词检测方法,其包括:
从预设的敏感词服务器中获取敏感词库;
构建所述敏感词库对应的谐音词库;
分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
第二方面,本申请实施例还提供了一种敏感词检测装置,其包括:
第一获取单元,用于从预设的敏感词服务器中获取敏感词库;
第一构建单元,用于构建所述敏感词库对应的谐音词库;
第二构建单元,用于分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
第一过滤单元,用于若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
第二过滤单元,用于去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
第三过滤单元,用于通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
第四过滤单元,用于通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
合并单元,用于将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
第三方面,本申请实施例还提供了一种计算机设备,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器用于运行所述计算机程序,以执行如下步骤:
从预设的敏感词服务器中获取敏感词库;
构建所述敏感词库对应的谐音词库;
分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时使所述处理器执行以下步骤:
从预设的敏感词服务器中获取敏感词库;
构建所述敏感词库对应的谐音词库;
分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种敏感词检测方法的应用场景示意图;
图2为本申请实施例提供的一种敏感词检测方法的流程示意图;
图3为本申请实施例提供的一种敏感词检测方法的子流程示意图;
图4为本申请实施例提供的一种敏感词检测方法的子流程示意图;
图5为本申请实施例提供的一种敏感词检测方法的子流程示意图;
图6为本申请实施例提供的一种敏感词检测方法的子流程示意图;
图7为本申请实施例提供的一种敏感词检测装置的示意性框图;
图8为本申请实施例提供的一种敏感词检测装置的第一构建单元的示意性框图;
图9为本申请实施例提供的一种敏感词检测装置的第二构建单元的示意性框图;
图10为本申请实施例提供的一种敏感词检测装置的第三过滤单元的示意性框图;
图11为本申请实施例提供的一种敏感词检测装置的第四过滤单元的示意性框图;
图12为本申请实施例提供的一种敏感词检测装置的第一获取单元的示意性框图;
图13为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
请参阅图1和图2,图1为本申请实施例提供的敏感词检测方法的应用场景示意图。图2为本申请实施例提供的敏感词检测方法的示意性流程图。该敏感词检测方法应用于敏感词检测服务器10中。敏感词检测服务器10是指用于检测敏感词的服务器。敏感词检测服务器10从预设的敏感词服务器20中获取敏感词库。
图2是本申请实施例提供的敏感词检测方法的流程示意图。如图所示,该方法包括以下步骤S1-S8。
S1,从预设的敏感词服务器中获取敏感词库。
具体实施中,从预设的敏感词服务器中获取敏感词库。敏感词服务器是指用于提供敏感 词库的服务器。
在一实施例中,以上步骤S1具体包括:若接收到所述敏感词服务器发送的敏感词库更新提醒消息,从所述敏感词库更新提醒消息中获取更新后的敏感词库的下载地址,所述敏感词库更新提醒消息包含所述下载地址;从所述下载地址下载更新后的敏感词库。
具体实施中,敏感词服务器在敏感词库更新时,会向敏感词检测服务器发送敏感词库更新提醒消息,敏感词库更新提醒消息包含更新后的敏感词库的下载地址。
如果接收到所述敏感词服务器发送的敏感词库更新提醒消息,敏感词检测服务器从所述敏感词库更新提醒消息中获取更新后的敏感词库的下载地址,并从所述下载地址下载更新后的敏感词库。通过以上步骤,可实现对敏感词库的更新。
需要说明的是,本申请实施例中,敏感词服务器通过以下方式获取敏感词库。
首先,构建训练语料,根据敏感词库和谐音词库,对语料库进行自动标注,同时根据冗余正则随机生成带冗余成分的敏感词进行对应的文本增强。
其次,通过训练语料对敏感词发现模型进行训练。
最后,定期的将获取到的语料,包括网络语料、业务语料等输入到敏感词发现模型进行预测,将预测到的新敏感词进行冗余过滤后添加到敏感词库中。
在本申请实施例中,敏感词的检测和词库更新是相互分离,词库的更新是一个离线任务因此不会影响线上的检索过滤速度,所以选用bert+bi-lstm+crf模型作为敏感词发现模型,该模型的准确性更高。
S2,构建所述敏感词库对应的谐音词库。
具体实施中,首先,将敏感词库中的每个词转换成对应的拼音,比如高纯冰转换为gaochunbing。为了扩大检索范围,可去掉拼音的音调。
然后,对所有敏感词转换后的拼音进行去重即可得到对应的谐音词库。
参见图3,在一实施例中,以上步骤S2具体包括如下步骤S21-S22。
S21,获取所述敏感词库的敏感词的拼音。
具体实施中,获取所述敏感词库的敏感词的拼音,同时去除声调。
S22,将所述敏感词库的敏感词的拼音作为谐音敏感词,并将所述谐音敏感词存入到预设的空白数据库中以得到所述谐音词库。
具体实施中,将所述敏感词库的敏感词的拼音作为谐音敏感词,并对得到的谐音敏感词去重后,存入到预设的空白数据库中以得到所述谐音词库。空白数据库是指未存有数据的数据库。
S3,分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器。
具体实施中,分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器。
需要说明的是,敏感词索引器以及谐音词索引器可通过trie树或者双数组trie树等数据结构构建。
trie树,是一种树形结构,是一种哈希树的变种。它的优点是:利用字符串的公共前缀来减少查询时间,最大限度地减少无谓的字符串比较,能实现插入和查询操作,是一种以空间换取时间的数据结构,广泛用于词频统计和输入统计领域。
双数组trie树将原来需要多个数组才能表示的Trie树,使用两个数据就可以存储下来,可以极大的减小空间复杂度。具体来说:
使用两个数组base和check来维护trie树,base数组负责记录状态,check数组负责检查各个字符串是否是从同一个状态转移而来,当check[i]为负值时,表示此状态为字符串的结束。
参见图4,在一实施例中,以上步骤S3具体包括如下步骤S31-S32。
S31,使用双数组tire树构建所述敏感词库对应的敏感词索引器。
具体实施中,使用双数组tire树构建所述敏感词库对应的敏感词索引器。
S32,使用双数组tire树构建所述谐音词库对应的谐音词索引器。
具体实施中,使用双数组tire树构建所述谐音词库对应的谐音词索引器。
需要说明的是,由于tire树结构存在较大的数据稀疏,造成了空间浪费,因此本实施例,在tire树结构的基础上经过升级提出了双数组trie树,双数组trie树具有查询效率高、节省空间的优点,可以有效降低空间浪费。
S4,若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合。
具体实施中,如果接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合。
具体地,将所述待测文本输入到所述敏感词索引器中,敏感词索引器查找所述待测文本包含的敏感词,并将查询到的敏感词添加到第一敏感词集合中。
S5,去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合。
具体实施中,首先,去除所述待测文本中的非中文字符以得到去冗余文本。非中文字符 包括火星文、符号和数字等冗余成分。这些冗余成分会干扰敏感词索引器的检索。
然后,通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合。
具体地,将所述去冗余文本输入到所述敏感词索引器中,敏感词索引器查找所述去冗余文本包含的敏感词,并将查询到的敏感词添加到第二敏感词集合中。
S6,通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合。
具体实施中,首先将待测文本的中文转行为拼音,之后将转换后的待测文本输入到所述谐音词索引器中,以由所述谐音词索引器查找相应的谐音敏感词。
参见图5,在一实施例中,以上步骤S6具体包括如下步骤S61-S63。
S61,将所述待测文本中的中文转换为拼音以得到待测谐音文本。
具体实施中,将所述待测文本中的中文转换为拼音以得到待测谐音文本。为了提高检索范围,可去除拼音的音调。
S62,通过所述谐音词索引器对所述待测谐音文本进行过滤以得到第一谐音敏感词集合。
具体实施中,通过所述谐音词索引器对所述待测谐音文本进行过滤以得到第一谐音敏感词集合。
具体地,将所述待测谐音文本输入到所述谐音词索引器中,谐音词索引器查找所述待测谐音文本包含的谐音敏感词,并将查询到的谐音敏感词添加到第一谐音敏感词集合中。
S63,获取所述待测文本中与所述第一谐音敏感词集合中的谐音敏感词相对应的词以得到所述第三敏感词集合。
具体实施中,将待测文本中的中文转换为拼音时建立中文与拼音的映射关系,之后根据该映射关系查找所述待测文本中与所述第一谐音敏感词集合中的谐音敏感词相对应的词。并且,将查找到的词作为敏感词添加到所述第三敏感词集合中。
S7,通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合。
具体实施中,首先将去冗余文本的中文转行为拼音,之后将转换后的去冗余文本输入到所述谐音词索引器中,以由所述谐音词索引器查找相应的谐音敏感词。
参见图6,在一实施例中,以上步骤S7具体包括如下步骤S71-S73。
S71,将所述去冗余文本中的中文转换为拼音以得到去冗余谐音文本。
具体实施中,将所述去冗余文本中的中文转换为拼音以得到去冗余谐音文本。为了提高检索范围,可去除拼音的音调。
S72,通过所述谐音词索引器对所述去冗余谐音文本进行过滤以得到第二谐音敏感词集合。
具体实施中,通过所述谐音词索引器对所述去冗余谐音文本进行过滤以得到第二谐音敏感词集合。
具体地,将所述去冗余谐音文本输入到所述谐音词索引器中,谐音词索引器查找所述去冗余谐音文本包含的谐音敏感词,并将查询到的谐音敏感词添加到第二谐音敏感词集合中。
S73,获取所述去冗余文本中与所述第二谐音敏感词集合中的谐音敏感词相对应的词以得到所述第四敏感词集合。
具体实施中,将去冗余文本中的中文转换为拼音时建立中文与拼音的映射关系,之后根据该映射关系查找所述去冗余文本中与所述第二谐音敏感词集合中的谐音敏感词相对应的词。并且,将查找到的词作为敏感词添加到所述第四敏感词集合中。
S8,将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
具体实施中,首先,对所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合进行去重处理,即去除重复的敏感词。
然后,将第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合合并以得到总敏感词集合。总敏感词集合即包含了待测文本包含的所有敏感词。
本申请实施例的技术方案,构建敏感词库对应的谐音词库;分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;若接收到待测文本,分别通过敏感词索引器以及谐音词索引器对待测文本以及待测文本去除非中文字符后的去冗余文本进行敏感词过滤,从而不仅能够识别待测文本中的敏感词本身,还能识别敏感词的谐音词以及冗余插入词,极大地提高了识别的准确性。
图7是本申请实施例提供的一种敏感词检测装置60的示意性框图。如图7所示,对应于以上敏感词检测方法,本申请还提供一种敏感词检测装置60。该敏感词检测装置60包括用于执行上述敏感词检测方法的单元,该敏感词检测装置60可以被配置于服务器中。具体地,请参阅图7,该敏感词检测装置60包括第一获取单元61、第一构建单元62、第二构建单元63、第一过滤单元64、第二过滤单元65、第三过滤单元66、第四过滤单元67以及合并单元68。
第一获取单元61,用于从预设的敏感词服务器中获取敏感词库;
第一构建单元62,用于构建所述敏感词库对应的谐音词库;
第二构建单元63,用于分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
第一过滤单元64,用于若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
第二过滤单元65,用于去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
第三过滤单元66,用于通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
第四过滤单元67,用于通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
合并单元68,用于将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
在一实施例中,如图8所示,所述第一构建单元62包括第二获取单元621以及储存单元622。
第二获取单元621,用于获取所述敏感词库的敏感词的拼音。
储存单元622,用于将所述敏感词库的敏感词的拼音作为谐音敏感词,并将所述谐音敏感词存入到预设的空白数据库中以得到所述谐音词库。
在一实施例中,如图9所示,所述第二构建单元63包括第三构建单元631以及第四构建单元632。
第三构建单元631,用于使用双数组tire树构建所述敏感词库对应的敏感词索引器;
第四构建单元632,用于使用双数组tire树构建所述谐音词库对应的谐音词索引器。
在一实施例中,如图10所示,所述第三过滤单元66包括第一转换单元661、第五过滤单元662以及第三获取单元663。
第一转换单元661,用于将所述待测文本中的中文转换为拼音以得到待测谐音文本;
第五过滤单元662,用于通过所述谐音词索引器对所述待测谐音文本进行过滤以得到第一谐音敏感词集合;
第三获取单元663,用于获取所述待测文本中与所述第一谐音敏感词集合中的谐音敏感词相对应的词以得到所述第三敏感词集合。
在一实施例中,如图11所示,所述第四过滤单元67包括第二转换单元671、第六过滤单元672以及第四获取单元673。
第二转换单元671,用于将所述去冗余文本中的中文转换为拼音以得到去冗余谐音文本;
第六过滤单元672,用于通过所述谐音词索引器对所述去冗余谐音文本进行过滤以得到 第二谐音敏感词集合;
第四获取单元673,用于获取所述去冗余文本中与所述第二谐音敏感词集合中的谐音敏感词相对应的词以得到所述第四敏感词集合。
在一实施例中,如图12所示,所述第一获取单元61包括下载单元611。
下载单元611,用于若接收到所述敏感词服务器发送的敏感词库更新提醒消息,从所述敏感词库更新提醒消息中获取更新后的敏感词库的下载地址,所述敏感词库更新提醒消息包含所述下载地址;从所述下载地址下载更新后的敏感词库。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述敏感词检测装置60和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。
上述敏感词检测装置可以实现为一种计算机程序的形式,该计算机程序可以在如图13所示的计算机设备上运行。
请参阅图13,图13是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。
参阅图13,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行一种敏感词检测方法。
该处理器502用于提供计算和控制能力,以支撑整个计算机设备500的运行。
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行一种敏感词检测方法。
该网络接口505用于与其它设备进行网络通信。本领域技术人员可以理解,图13中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请的敏感词检测方法。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列 (Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成。该计算机程序可存储于一存储介质中,该存储介质为计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。
因此,本申请还提供一种存储介质。该存储介质可以为计算机可读存储介质。该存储介质存储有计算机程序。该计算机程序被处理器执行时使处理器执行本申请的敏感词检测方法。
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的实体存储介质。所述计算机可读存储介质可以是非易失性,也可以是易失性。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的。例如,各个单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。本申请实施例装置中的单元可以根据实际需要进行合并、划分和删减。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,终 端,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,尚且本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种敏感词检测方法,包括:
    从预设的敏感词服务器中获取敏感词库;
    构建所述敏感词库对应的谐音词库;
    分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
    若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
    去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
    通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
    通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
    将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
  2. 根据权利要求1所述的敏感词检测方法,其中,所述构建所述敏感词库对应的谐音词库,包括:
    获取所述敏感词库的敏感词的拼音;
    将所述敏感词库的敏感词的拼音作为谐音敏感词,并将所述谐音敏感词存入到预设的空白数据库中以得到所述谐音词库。
  3. 根据权利要求1所述的敏感词检测方法,其中,所述分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器,包括:
    使用双数组tire树构建所述敏感词库对应的敏感词索引器;
    使用双数组tire树构建所述谐音词库对应的谐音词索引器。
  4. 根据权利要求1所述的敏感词检测方法,其中,所述通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合,包括:
    将所述待测文本中的中文转换为拼音以得到待测谐音文本;
    通过所述谐音词索引器对所述待测谐音文本进行过滤以得到第一谐音敏感词集合;
    获取所述待测文本中与所述第一谐音敏感词集合中的谐音敏感词相对应的词以得到所述第三敏感词集合。
  5. 根据权利要求1所述的敏感词检测方法,其中,所述通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合,包括:
    将所述去冗余文本中的中文转换为拼音以得到去冗余谐音文本;
    通过所述谐音词索引器对所述去冗余谐音文本进行过滤以得到第二谐音敏感词集合;
    获取所述去冗余文本中与所述第二谐音敏感词集合中的谐音敏感词相对应的词以得到所述第四敏感词集合。
  6. 根据权利要求1所述的敏感词检测方法,其中,所述从预设的敏感词服务器中获取敏感词库,包括:
    若接收到所述敏感词服务器发送的敏感词库更新提醒消息,从所述敏感词库更新提醒消息中获取更新后的敏感词库的下载地址,所述敏感词库更新提醒消息包含所述下载地址;
    从所述下载地址下载更新后的敏感词库。
  7. 一种敏感词检测装置,包括:
    第一获取单元,用于从预设的敏感词服务器中获取敏感词库;
    第一构建单元,用于构建所述敏感词库对应的谐音词库;
    第二构建单元,用于分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
    第一过滤单元,用于若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
    第二过滤单元,用于去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
    第三过滤单元,用于通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
    第四过滤单元,用于通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
    合并单元,用于将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
  8. 根据权利要求7所述的敏感词检测装置,其中,所述第一构建单元包括:
    第二获取单元,用于获取所述敏感词库的敏感词的拼音;
    储存单元,用于将所述敏感词库的敏感词的拼音作为谐音敏感词,并将所述谐音敏感词存入到预设的空白数据库中以得到所述谐音词库。
  9. 一种计算机设备,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器用于运行所述计算机程序,以执行如下步骤:
    从预设的敏感词服务器中获取敏感词库;
    构建所述敏感词库对应的谐音词库;
    分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
    若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
    去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
    通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
    通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
    将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
  10. 根据权利要求9所述的计算机设备,其中,所述构建所述敏感词库对应的谐音词库的步骤,包括:
    获取所述敏感词库的敏感词的拼音;
    将所述敏感词库的敏感词的拼音作为谐音敏感词,并将所述谐音敏感词存入到预设的空白数据库中以得到所述谐音词库。
  11. 根据权利要求9所述的计算机设备,其中,所述分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器的步骤,包括:
    使用双数组tire树构建所述敏感词库对应的敏感词索引器;
    使用双数组tire树构建所述谐音词库对应的谐音词索引器。
  12. 根据权利要求9所述的计算机设备,其中,所述通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合的步骤,包括:
    将所述待测文本中的中文转换为拼音以得到待测谐音文本;
    通过所述谐音词索引器对所述待测谐音文本进行过滤以得到第一谐音敏感词集合;
    获取所述待测文本中与所述第一谐音敏感词集合中的谐音敏感词相对应的词以得到所述第三敏感词集合。
  13. 根据权利要求9所述的计算机设备,其中,所述通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合的步骤,包括:
    将所述去冗余文本中的中文转换为拼音以得到去冗余谐音文本;
    通过所述谐音词索引器对所述去冗余谐音文本进行过滤以得到第二谐音敏感词集合;
    获取所述去冗余文本中与所述第二谐音敏感词集合中的谐音敏感词相对应的词以得到所述第四敏感词集合。
  14. 根据权利要求9所述的计算机设备,其中,所述从预设的敏感词服务器中获取敏感词库的步骤,包括:
    若接收到所述敏感词服务器发送的敏感词库更新提醒消息,从所述敏感词库更新提醒消息中获取更新后的敏感词库的下载地址,所述敏感词库更新提醒消息包含所述下载地址;
    从所述下载地址下载更新后的敏感词库。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时使所述处理器执行以下步骤:
    从预设的敏感词服务器中获取敏感词库;
    构建所述敏感词库对应的谐音词库;
    分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器;
    若接收到待测文本,通过所述敏感词索引器对所述待测文本进行过滤以得到第一敏感词集合;
    去除所述待测文本中的非中文字符以得到去冗余文本,并通过所述敏感词索引器对所述去冗余文本进行过滤以得到第二敏感词集合;
    通过所述谐音词索引器对所述待测文本进行过滤以得到第三敏感词集合;
    通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合;
    将所述第一敏感词集合、第二敏感词集合、第三敏感词集合以及第四敏感词集合去重并合并以得到总敏感词集合。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述构建所述敏感词库对应的谐音词库的步骤,包括:
    获取所述敏感词库的敏感词的拼音;
    将所述敏感词库的敏感词的拼音作为谐音敏感词,并将所述谐音敏感词存入到预设的空白数据库中以得到所述谐音词库。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述分别根据所述敏感词库以及所述谐音词库构建敏感词索引器以及谐音词索引器的步骤,包括:
    使用双数组tire树构建所述敏感词库对应的敏感词索引器;
    使用双数组tire树构建所述谐音词库对应的谐音词索引器。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述通过所述谐音词索引器对 所述待测文本进行过滤以得到第三敏感词集合的步骤,包括:
    将所述待测文本中的中文转换为拼音以得到待测谐音文本;
    通过所述谐音词索引器对所述待测谐音文本进行过滤以得到第一谐音敏感词集合;
    获取所述待测文本中与所述第一谐音敏感词集合中的谐音敏感词相对应的词以得到所述第三敏感词集合。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述通过所述谐音词索引器对所述去冗余文本进行过滤以得到第四敏感词集合的步骤,包括:
    将所述去冗余文本中的中文转换为拼音以得到去冗余谐音文本;
    通过所述谐音词索引器对所述去冗余谐音文本进行过滤以得到第二谐音敏感词集合;
    获取所述去冗余文本中与所述第二谐音敏感词集合中的谐音敏感词相对应的词以得到所述第四敏感词集合。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述从预设的敏感词服务器中获取敏感词库的步骤,包括:
    若接收到所述敏感词服务器发送的敏感词库更新提醒消息,从所述敏感词库更新提醒消息中获取更新后的敏感词库的下载地址,所述敏感词库更新提醒消息包含所述下载地址;
    从所述下载地址下载更新后的敏感词库。
PCT/CN2020/118862 2020-07-16 2020-09-29 敏感词检测方法、装置、计算机设备及存储介质 WO2021139268A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010688343.7A CN111831785A (zh) 2020-07-16 2020-07-16 敏感词检测方法、装置、计算机设备及存储介质
CN202010688343.7 2020-07-16

Publications (1)

Publication Number Publication Date
WO2021139268A1 true WO2021139268A1 (zh) 2021-07-15

Family

ID=72924338

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118862 WO2021139268A1 (zh) 2020-07-16 2020-09-29 敏感词检测方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111831785A (zh)
WO (1) WO2021139268A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081440A (zh) * 2022-07-22 2022-09-20 湖南湘生网络信息有限公司 文本中变种词的识别及提取原敏感词的方法、装置及设备
CN115659078A (zh) * 2022-10-12 2023-01-31 湖北盈隆腾辉科技有限公司 基于人工智能的网络信息安全监控方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077792B (zh) * 2021-03-24 2024-03-05 平安科技(深圳)有限公司 佛学主题词识别方法、装置、设备及存储介质
CN113256301B (zh) * 2021-07-13 2022-03-29 杭州趣链科技有限公司 数据屏蔽方法、装置、服务器及介质
CN114021564B (zh) * 2022-01-06 2022-04-01 成都无糖信息技术有限公司 一种针对社交文本的切分取词方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956180A (zh) * 2016-05-30 2016-09-21 北京京东尚科信息技术有限公司 一种敏感词过滤方法
CN107463666A (zh) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 一种基于文本内容的敏感词过滤方法
CN108280130A (zh) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 一种在文本大数据中发现敏感数据的方法
CN109918548A (zh) * 2019-04-08 2019-06-21 上海凡响网络科技有限公司 一种自动检测文档敏感信息的方法和应用
US20190303056A1 (en) * 2018-03-27 2019-10-03 KYOCERA Document Solutions Development America, Inc. Methods and systems for detecting and formatting sensitive information in a multi-function printer

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729520A (zh) * 2008-10-28 2010-06-09 北京大学 敏感信息的检测方法及装置
TWI420510B (zh) * 2010-05-28 2013-12-21 Ind Tech Res Inst 可調整記憶體使用空間之語音辨識系統與方法
CN106951437B (zh) * 2017-02-08 2019-11-01 中国科学院信息工程研究所 适于多个中文敏感词句的识别处理方法及装置
CN110941959B (zh) * 2018-09-21 2023-05-26 阿里巴巴集团控股有限公司 文本违规检测、文本还原方法、数据处理方法及设备
CN109977416B (zh) * 2019-04-03 2023-07-25 中山大学 一种多层次自然语言反垃圾文本方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956180A (zh) * 2016-05-30 2016-09-21 北京京东尚科信息技术有限公司 一种敏感词过滤方法
CN107463666A (zh) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 一种基于文本内容的敏感词过滤方法
CN108280130A (zh) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 一种在文本大数据中发现敏感数据的方法
US20190303056A1 (en) * 2018-03-27 2019-10-03 KYOCERA Document Solutions Development America, Inc. Methods and systems for detecting and formatting sensitive information in a multi-function printer
CN109918548A (zh) * 2019-04-08 2019-06-21 上海凡响网络科技有限公司 一种自动检测文档敏感信息的方法和应用

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081440A (zh) * 2022-07-22 2022-09-20 湖南湘生网络信息有限公司 文本中变种词的识别及提取原敏感词的方法、装置及设备
CN115081440B (zh) * 2022-07-22 2022-11-01 湖南湘生网络信息有限公司 文本中变种词的识别及提取原敏感词的方法、装置及设备
CN115659078A (zh) * 2022-10-12 2023-01-31 湖北盈隆腾辉科技有限公司 基于人工智能的网络信息安全监控方法及系统

Also Published As

Publication number Publication date
CN111831785A (zh) 2020-10-27

Similar Documents

Publication Publication Date Title
WO2021139268A1 (zh) 敏感词检测方法、装置、计算机设备及存储介质
US9208450B1 (en) Method and apparatus for template-based processing of electronic documents
US11275774B2 (en) Systems and methods for generating and using aggregated search indices and non-aggregated value storage
KR101648235B1 (ko) 정합-데이터 보고 모듈을 갖는 패턴 인식 프로세서
EP3256962A1 (en) Finding documents describing solutions to computing issues
US20120259615A1 (en) Text prediction
US9020951B2 (en) Methods for indexing and searching based on language locale
US9195666B2 (en) Location independent files
US10572544B1 (en) Method and system for document similarity analysis
JPH079655B2 (ja) スペルの誤りの検出訂正方法及び装置
EP3679488A1 (en) System and method for recommendation of terms, including recommendation of search terms in a search system
JP7052145B2 (ja) 大量な文書コーパスにおけるトークン・マッチング
US9734178B2 (en) Searching entity-key associations using in-memory objects
CN113821544B (zh) 使用字段级删除邻域的改进的模糊搜索
CN109800427B (zh) 一种分词方法、装置、终端及计算机可读存储介质
CN110968593A (zh) 数据库sql语句优化方法、装置、设备和存储介质
CN111435406A (zh) 一种纠正数据库语句拼写错误的方法和装置
EP3198476A1 (en) Efficient pattern matching
WO2017215244A1 (zh) 提供相关词的方法和装置
KR20060043583A (ko) 언어 데이터의 로그의 압축 방법 및 시스템
CN110795617A (zh) 一种搜索词的纠错方法及相关装置
US20170270127A1 (en) Category-based full-text searching
JP2007133682A (ja) 全文検索システム、及び、その全文検索方法
CN113408660B (zh) 图书聚类方法、装置、设备和存储介质
CN113569010B (zh) 过滤检索结果的方法、装置、设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912472

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912472

Country of ref document: EP

Kind code of ref document: A1