WO2021151333A1 - Sensitive word recognition method and apparatus based on artificial intelligence, and computer device - Google Patents

Sensitive word recognition method and apparatus based on artificial intelligence, and computer device Download PDF

Info

Publication number
WO2021151333A1
WO2021151333A1 PCT/CN2020/124684 CN2020124684W WO2021151333A1 WO 2021151333 A1 WO2021151333 A1 WO 2021151333A1 CN 2020124684 W CN2020124684 W CN 2020124684W WO 2021151333 A1 WO2021151333 A1 WO 2021151333A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
word
combination
target
word slot
Prior art date
Application number
PCT/CN2020/124684
Other languages
French (fr)
Chinese (zh)
Inventor
吕焕焕
姜国玮
张冬
李飞鹏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021151333A1 publication Critical patent/WO2021151333A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer equipment for identifying sensitive words based on artificial intelligence.
  • this application provides an artificial intelligence-based method, device, and computer equipment for identifying sensitive words, the main purpose of which is to improve the current traditional filtering method of sensitive words, which causes the technical problem of low accuracy in identifying sensitive words.
  • a method for identifying sensitive words based on artificial intelligence includes: acquiring text information to be recognized; identifying a target word slot combination contained in the text information, wherein the target word The slot combination is composed of at least one preset word slot; according to the target word slot combination and the intermediate word information of the target word slot combination in the text information, it is determined whether the text information contains sensitive words; If the text information contains sensitive words, the text information is restricted.
  • an artificial intelligence-based sensitive word recognition device includes: an acquisition module for acquiring text information to be recognized; a recognition module for recognizing content contained in the text information
  • the target word slot combination wherein the target word slot combination is composed of at least one preset word slot; the judgment module is used for combining the target word slot combination and the middle character in the text information according to the target word slot combination
  • the word information determines whether the text information contains sensitive words; the processing module is used to perform restriction processing on the text information if it is determined that the text information contains sensitive words.
  • a readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the following method is implemented: obtaining text information to be recognized; The target word slot combination included in the text information, wherein the target word slot combination is composed of at least one preset word slot; according to the target word slot combination and the target word slot combination in the middle of the text information The word information determines whether the text information contains sensitive words; if it is determined that the text information contains sensitive words, the text information is restricted.
  • a computer device including a readable storage medium, a processor, and computer readable instructions stored on the readable storage medium and executable on the processor, the processor executing all
  • the computer-readable instruction implements the following method: acquiring text information to be recognized; identifying a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one preset word slot; according to the The target word slot combination and the intermediate word information of the target word slot combination in the text information are judged whether the text information contains sensitive words; if it is determined that the text information contains sensitive words, the text information is Restrict processing.
  • This application can accurately identify whether the text information contains sensitive words, which can improve the accuracy of sensitive word recognition and improve the processing efficiency of sensitive words.
  • FIG. 1 shows a schematic flowchart of an artificial intelligence-based sensitive word recognition method provided by an embodiment of the present application.
  • FIG. 2 shows a schematic flowchart of another method for identifying sensitive words based on artificial intelligence according to an embodiment of the present application.
  • Fig. 3 shows a schematic structural diagram of an artificial intelligence-based sensitive word recognition device provided by an embodiment of the present application.
  • the technical solution of this application can be applied to the fields of artificial intelligence, blockchain and/or big data technology to realize sensitive word recognition.
  • the data involved in this application such as text information, can be stored in a database, or can be stored in a blockchain, which is not limited in this application.
  • this embodiment provides an artificial intelligence-based sensitive word recognition method. As shown in FIG. 1, the method includes the following steps.
  • the text information to be recognized can be the text information of the communication message to be published, such as the message sending text in the instant messaging software, the online communication text between the platform customer service staff and the user, and the message publishing text of the public platform (such as the text published by the web comment , The text of the product evaluation, the text sent by the video barrage, etc.) etc.
  • the text information to be recognized can also be text within a specified range (such as a specified range text in a publicly published electronic book, a specified range text in a publicly issued notification message, etc.), etc.
  • the execution subject of this embodiment may be a device or equipment for sensitive word recognition and processing, which may be deployed on the client or server, etc., which can improve the accuracy of sensitive word recognition.
  • the target word slot combination is composed of at least one preset word slot.
  • word slots can be set in advance. These word slots can be determined according to different sensitive words. Specifically, they can include word slots of sensitive words (such as "reduction of principal”, “reduction of rent”, “personal loan”, etc., as well as Bank card number, ID number, account password format, a series of digital symbols and other word slots), non-sensitive word slots (such as "no", "must” and other word slots, as well as single numbers, single words and other word slots ), the word slot of the sensitive word synonyms (such as the word slot that is essentially synonymous with the sensitive word but does not belong to the scope of the sensitive word), and can also contain each participle obtained by splitting the sensitive word (such as for the sensitive word "Go to your unit to investigate "The three word slots of "Fuck You", “Unit” and “Investigation” obtained by splitting). Then these word slots are combined and matched according to the corresponding sensitive word recognition to obtain the word slot combination.
  • word slots of sensitive words such
  • the pre-statistical word slot combination can be stored in a predetermined storage location (such as a database, a mapping table, etc.).
  • a predetermined storage location such as a database, a mapping table, etc.
  • each word segmentation in the text information can be combined with a predetermined storage location.
  • Each word slot combination in the storage location is matched, and a matching word slot combination is found as the target word slot combination contained in the text information.
  • the intermediate word information may be word information that appears between each word slot included in the word slot combination in the text information.
  • the text message is "XX finds someone to go to your unit, and finds out XX after doing a background investigation on you", where XX stands for words that are omitted from the text message, and the target word slot combination contained in the text message is "Go You” + “unit” + “investigation”, and the "de” between "fuck you” and “unit”, and "being a background for you” between “unit” and “investigation” are the middle words .
  • the word slot combination corresponding to the sensitive word has the same meaning as that of the sensitive word to a certain extent. It can be a word slot combination composed of the sensitive word itself; or it is not a sensitive word, but Combination of word slots with the meaning of sensitive words, etc.
  • the published text that actually contains sensitive words will be mixed with spaces, symbols, or added some words, or the same semantic rewriting through other texts, etc., which will affect whether there is sensitive in the text information.
  • the accuracy of word discrimination is not only judges the word slot combination and the intermediate word information of the word slot combination in the text information, it can accurately identify whether the text information contains sensitive words in these cases, which can improve the sensitivity. Accuracy of word recognition.
  • the text information when it is determined that the text information contains sensitive words, the text information can be marked and reminded to inform the existence of sensitive word information, such as highlighting the text part containing the target word slot combination in the text information (such as highlighting, bolding, adding Underscore, etc.), or restrict the sending of communication messages containing the text information, etc.
  • sensitive word information such as highlighting the text part containing the target word slot combination in the text information (such as highlighting, bolding, adding Underscore, etc.), or restrict the sending of communication messages containing the text information, etc.
  • the target word slot combination contained in the text information can be identified.
  • the word slot combination is composed of at least one preset word slot, and then according to the target word slot combination and target
  • the word slot combines the intermediate word information in the text information to determine whether the text information contains sensitive words.
  • this embodiment uses the method of discriminating intermediate words between the word slot combination + word slot combination, even if symbols or spaces are added in the text sensitive words, or some words are added, Or the same semantic rewriting through other texts, etc., can accurately identify whether the text information contains sensitive words, which can improve the accuracy of sensitive word recognition. If it is determined that the text information contains sensitive words, the text information can also be restricted and processed in time. The entire process of sensitive word recognition + restriction processing can be automated, which improves the efficiency of sensitive word processing.
  • step 201 can specifically include: Obtain the text information to be recognized from the block chain.
  • the text information to be recognized can be obtained from the target node of the blockchain, and then sensitive word recognition can be performed on the text information.
  • the blockchain referred to in this embodiment is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the preset special matches can be symbols such as "@”, “#”, “ ⁇ ", “ ⁇ ”, “/”, and "*".
  • the character spaces and preset special symbols in the text information are cleared, which can effectively reduce noise interference and improve the accurate matching of word slot combinations and corresponding detection rules.
  • the configuration of the sensitive word recognition rule can be performed first, and the rule configuration can be divided into three layers: slot, rule, and model.
  • the word slot contains some preset keywords such as sensitive words, non-sensitive words, sensitive word synonyms, etc.
  • the rule is a combination of word slots (equivalent to a preset verification rule, that is, the text information meets the criterion for the presence of sensitive words)
  • the model is a combination of rules (equivalent to a combination of multiple verification rules).
  • rules and models can be freely combined, and a sensitive word filtering strategy that meets the requirements can be formulated according to the business scenario. For example, after establishing word slots, rules, and models, each participle in the text information after clearing character spaces and preset special symbols is matched with the word slot combination in the rule, and then the target word slot combination contained in it is found.
  • a single word slot combination may correspond to at least one target verification rule, and each verification rule is equivalent to a preset criterion for containing sensitive words.
  • a single word slot combination corresponds to at least two verification rules, it is equivalent to a combination of verification rules.
  • this embodiment can pre-determine whether the word slot combination corresponds to a single check rule or a check rule combination containing at least two check rules according to actual needs. That is, the qualifier slot combination can be used in the rule layer or the model. Floor.
  • the range of sensitive words can be limited, that is, the content of detection.
  • the scope of action is the rule layer or the model layer. The detection is performed in the specified range, and then the verification rules can be flexibly used for sensitive word recognition. In the case of, a variety of verification rules can be used to make accurate judgments from different angles, which can improve the accuracy of sensitive word recognition.
  • At least one target verification rule corresponding to each target word slot combination is combined to obtain a target verification rule combination, and the target verification rule combination includes at least one preset sensitive word determination criterion.
  • the word slot arrangement information of the target word slot combination in the text information and the intermediate word information between the word slots respectively determine whether the text information meets the multiple preset sensitive word determination criteria in the target check rule combination.
  • step 205 may specifically include: if the criteria for determining sensitive words is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words is limited If the range is to meet the criteria of the judgment, then the arrangement information of the word slots in the judgment (such as the sequence in which the word slots appear in the text) conforms to the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is less than or equal to the predetermined word slot sequence.
  • the text information contains sensitive words; if the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range does not meet the judgment standard, When it is determined that the word slot arrangement information matches the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is greater than or equal to the preset number threshold, it is determined that the text information contains sensitive words.
  • each word slot combination has its own corresponding preset word slot sequence, which is used to determine whether it has the meaning of a sensitive word.
  • the word slot combination can correspond to at least one preset word slot sequence according to actual conditions.
  • the middle word part is used to modify the semantics of the word slot combination, and the preset number threshold is used to determine whether the language-modified word slot combination still has the meaning of a sensitive word.
  • the threshold size can be preset according to the actual situation.
  • a word slot combination composed of word slots "Go to you", “unit”, and “investigation” can be matched to get the corresponding preset verification rule as the judgment standard of [and], and editable among the three word slots is allowed
  • the number of characters in is 8. If the message sent by the user contains these three words at the same time and there are less than 8 words in the middle, it will be judged to meet the criterion of [and], that is, it is determined that the message sent by the user contains sensitive words. And if there are more than 8 edited characters among the three sensitive words, it is judged that it does not meet the criterion of [and], that is, it is determined that the information sent by the user does not contain sensitive words.
  • the word slot combination consisting of the word slot of sensitive words "reduction of principal” and the word slot of non-sensitive words "will not", after matching, the corresponding preset verification rule is [Non] judgment standard, There are 3 editable characters between the two sensitive words. If the information sent by the user contains these two words at the same time, and there are less than 3 words in the middle, it will be judged to meet the criterion of [Non], that is, it is determined that the information sent by the user does not contain sensitive words. [Non]
  • the verification rule is to set a word slot of a sensitive word and a word slot of a non-sensitive word. If two words appear together, they will not be hit, and the middle word can be set.
  • the number of words in the middle between the two word slots of "no” and “principal reduction” is 0, which is considered to have not hit the corresponding verification rule Standards, and then determine that this text does not contain sensitive words.
  • the principal will be reduced
  • the middle word between the two word slots of "no” and “principal reduction” ", you can rest assured, will definitely” the number is greater than 3, it is considered that the corresponding verification rule standard is hit, and then it is determined that the text contains sensitive words.
  • the combination of verification rules may include at least two verification rules, which is equivalent to identifying sensitive words through the model in step 203.
  • the combination of verification rules contains three verification rules.
  • the first verification rule is ID verification
  • the second verification rule is sensitive word + and verification
  • the third verification rule is sensitive word + non-verification.
  • verification rule 1 when using verification rule 1 to identify sensitive words, it can be recognized whether the text information (after removing character spaces, preset special symbols, rare words and other noisy texts) contains a string of numeric word slots, if it contains this type of word slot
  • the word slot can determine whether the string of numbers corresponding to the word slot conforms to the ID card format.
  • step 205 may further specifically include: if the target inspection rule combination includes at least one preset sensitive word determination criterion with different execution priorities, then according to each target inspection rule combination.
  • the execution priority of the sensitive word judgment standard is from high to low, and the text information is judged in turn; in the process of sequential judgment, if it is determined that there is a sensitive word judgment standard that the text information meets, then the subsequent judgment on the text information is stopped , And use the currently obtained judgment result as the result of judging the text information by using the target inspection rule combination.
  • the target check rule combination contains five check rules, and the five check rules are preset with execution priority in the check rule combination (for example, the preset priority is based on the sensitive word recognition success rate from high to low) ,
  • the order of execution priority from high to low is: verification rule one>verification rule three>verification rule four>verification rule five>verification rule two, and then follow the order of this sort and use the corresponding verification in turn
  • the rule judges the text information. If it is judged that there are sensitive words in the text information through verification rule three, the subsequent verification process of verification rule four, verification rule five, and verification rule two will be stopped. Through this optional method, there is no need to check sensitive words one by one, and the judgment result can be obtained as quickly as possible, which can improve the efficiency of sensitive word recognition.
  • the calculation priority range can be limited. Similar to the four arithmetic operations, the verification rules appearing in the priority range are executed first. If there are specific regular symbols in the verification rules that represent different meanings, the verification rules that match the priority range can be placed in brackets "()". When the verification rules are executed, the content of the rules in the brackets will be executed first, and then the others will be executed. The verification rules.
  • the text information meets at least one set of sensitive word determination criteria in the target inspection rule combination, it is determined that the text information contains sensitive words.
  • a set of sensitive word judgment standards may include at least one sensitive word judgment standard, that is, one, or two, or more sensitive word judgment standards, which may be determined according to actual sensitive word judgment accuracy requirements. .
  • step 206b which is parallel to step 206a, if none of the text information meets the sensitive word determination criteria in the target inspection rule combination, it is determined that the text information does not contain sensitive words.
  • a plurality of preset sensitive word judgment criteria in the combination of inspection rules may be used to perform sensitive word recognition judgment on the text information respectively. If the text information is judged to meet at least one of the preset sensitive word judgment criteria, then It can be determined that the text information contains sensitive words, which can improve the accuracy of sensitive word recognition.
  • restricting the text information may specifically include: preventing the publication of the text information; or, using preset characters (such as "*", "-” and other characters in the text part containing the target word slot combination in the text information, (It has the effect of desensitization) before publishing after replacement; or, sending the text information to the review module for review, and publishing if the review is passed.
  • preset characters such as "*", "-" and other characters in the text part containing the target word slot combination in the text information, (It has the effect of desensitization) before publishing after replacement
  • sending the text information to the review module for review and publishing if the review is passed.
  • the system will prevent the user from publishing sensitive words, or directly delete the content containing sensitive words sent by the user. For some less sensitive words, they will not be deleted immediately after they are sent out, and the reviewers need to conduct a second manual review.
  • the method of this embodiment may further include: recording the text part of the target word slot combination in the text information as sample data; and then periodically analyzing each recorded sample data to make statistics on each sample data Word combinations that appear more frequently than the preset frequency threshold and are different from the existing word slot combinations; calculate the semantic similarity between the word combinations obtained by statistics and the preset sensitive words and/or preset sensitive sentences; The target word combination whose semantic similarity is greater than the preset similarity threshold is used as a new word slot combination, and the verification rule corresponding to the new word slot combination is updated according to the sample data containing the new word slot combination; it can be used later The new word slot combination and corresponding inspection rules determine whether other text information contains sensitive words.
  • the automatic update of the sensitive word recognition system can be realized, so as to further improve the accuracy of subsequent sensitive word recognition.
  • the entire sensitive word recognition system is equivalent to having the function of machine learning, which can realize the accurate recognition of sensitive words by artificial intelligence.
  • text data determined to contain sensitive words may sometimes include other word combinations with the meaning of sensitive words.
  • This embodiment collects these text data as sample data; regularly analyzes the sample data, finds word combinations that appear more than a certain threshold and is different from the existing word slot combinations, and compares them with the preset sensitivity Words and/or preset sensitive sentences are calculated for semantic similarity, and then find those new word slot combinations that have not been discovered before and also have the meaning of sensitive words, and formulate their corresponding verification rules.
  • the new word slot combination and corresponding inspection rules can be used to determine whether other text information contains sensitive words, so as to find more text data that actually has the meaning of sensitive words.
  • the method of this embodiment can also be applied to a system for intelligent sensitive word quality inspection.
  • Algorithms can be used to match entries. Specific rules and strategies can be set to reduce noise interference. It can span text and perform accurate sensitive words. filter. After constructing the sensitive vocabulary, the algorithm is used to traverse the text and match with the sensitive word tree to achieve the function of identifying and filtering sensitive vocabulary. Intelligent strategies can be customized according to customer needs to efficiently filter prohibited messages, malicious promotion, vulgar abuse, low-quality irrigation and other sensitive words and prohibited variants.
  • the intelligent quality inspection system has a high accuracy of content review and recognition, which can quickly process text, greatly reduce the workload of manual review, eliminate online risks, improve content output quality, purify the network environment, and ensure a good user experience.
  • this embodiment provides an artificial intelligence-based sensitive word recognition device.
  • the device includes: an acquisition module 31, a recognition module 32, The judgment module 33 and the processing module 34.
  • the obtaining module 31 is used to obtain the text information to be recognized.
  • the recognition module 32 is configured to recognize a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one predetermined word slot.
  • the judgment module 33 is configured to judge whether the text information contains sensitive words according to the target word slot combination and the intermediate word information of the target word slot combination in the text information.
  • the processing module 34 is configured to perform filtering processing on the text information if it is determined that the text information contains sensitive words.
  • the judgment module 33 is specifically configured to obtain a target verification rule combination according to at least one target verification rule corresponding to each target word slot combination; according to the target word slot combination in the text information The word slot arrangement information and the intermediate word information between the word slots in, respectively determine whether the text information meets the multiple preset sensitive word determination criteria in the target inspection rule combination; if the text information meets the At least one set of sensitive word determination criteria in the target inspection rule combination is determined to include the sensitive words; if none of the text information meets the determination criteria of each sensitive word in the target inspection rule combination, then the text information is determined Does not contain sensitive words.
  • the judging module 33 is specifically used to determine if the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words is within a limited range.
  • the criterion for determination is to determine the text information when it is determined that the word slot arrangement information matches the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is less than or equal to the preset number threshold.
  • the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words does not meet the judgment criterion within a limited range, then the word is judged When the slot arrangement information conforms to the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is greater than or equal to the preset number threshold, it is determined that the text information contains sensitive words.
  • the judging module 33 is specifically further configured to, if the target inspection rule combination includes at least one preset sensitive word determination criterion with different execution priorities, then according to each sensitive word in the target inspection rule combination.
  • the execution priority of the word judgment standard is from high to low, and the text information is judged in sequence; in the process of sequential judgment, if it is determined that there is a sensitive word judgment standard that the text information meets, then the subsequent review of the text information is stopped. The text information is judged, and the currently obtained judgment result is used as the result of judging the text information by using the target inspection rule combination.
  • the device also includes: a recording module and an analysis module.
  • the recording module is configured to record the text part of the text information containing the target word slot combination as sample data after the restriction processing is performed on the text information.
  • the analysis module is used to periodically analyze the recorded sample data, and count the word combinations in each sample data that appear more frequently than the preset frequency threshold and are different from the existing word slot combinations; Word combinations are calculated for semantic similarity with preset sensitive words and/or preset sensitive sentences; target word combinations with semantic similarity greater than the preset similarity threshold are used as a new word slot combination, and based on the new word slot combination.
  • the sample data of the word slot combination is updated with the verification rule corresponding to the new word slot combination; the new word slot combination and the corresponding verification rule are used to determine whether other text information contains sensitive words.
  • the processing module 34 is specifically configured to prevent the publication of the text information; or, replace the text part of the text information containing the target word slot combination with preset characters before publishing; or , Send the text information to the review module for review, and publish it if the review is passed.
  • the text information is pre-stored in the blockchain; correspondingly, the obtaining module 31 is specifically configured to obtain the text information from the blockchain; the recognition module 32, Specifically, it is used to clear the character spaces and preset special symbols in the text information; identify the target word slot combination included in the text information after removing the character spaces and the preset special symbols.
  • this embodiment also provides a readable storage medium on which computer-readable instructions are stored.
  • the computer-readable instructions are executed by a processor, the foregoing Figure 1 and Figure 2 show the artificial intelligence-based sensitive word recognition method.
  • the readable storage medium involved in this application may be a computer readable storage medium.
  • the storage medium involved in this application such as a readable storage medium, may be non-volatile, such as a non-volatile readable storage medium, or may be volatile, such as a volatile readable storage medium.
  • the technical solution of this application can be embodied in the form of a software product.
  • the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods in each implementation scenario of the present application.
  • this embodiment also provides a computer device, which may be a personal computer, a notebook computer, or a server.
  • the physical equipment includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to implement the aforementioned manual-based Intelligent method of identifying sensitive words.
  • the storage medium may be a readable storage medium, such as a nonvolatile readable storage medium or a volatile readable storage medium.
  • the computer device may also include a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like.
  • the optional network interface can include standard wired interface, wireless interface (such as Bluetooth interface, WI-FI interface), etc.
  • the computer device structure provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components.
  • the storage medium may also include an operating system and a network communication module.
  • the operating system is a program that manages the hardware and software resources of the aforementioned physical devices, and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to realize the communication between the various components in the storage medium and the communication with other hardware and software in the physical device.
  • the target word slot combination contained in the text information can be identified.
  • the word slot combination is composed of at least one preset word slot, and then the target word slot combination and the target word slot combination are used in the text information In the middle word information, judge whether the text information contains sensitive words.
  • this embodiment uses the method of discriminating intermediate words between the word slot combination + word slot combination, even if symbols or spaces are added in the text sensitive words, or some words are added, Or the same semantic rewriting through other texts, etc., can accurately identify whether the text information contains sensitive words, which can improve the accuracy of sensitive word recognition. If it is determined that the text information contains sensitive words, the text information can also be restricted and processed in time. The entire process of sensitive word recognition + restriction processing can be automated, which improves the efficiency of sensitive word processing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A sensitive word recognition method and apparatus based on artificial intelligence, and a computer device, relating to the technical field of artificial intelligence. The method comprises: firstly, acquiring text information to be subjected to recognition (101); then carrying out recognition on a target word slot combination included in the text information (102), wherein the target word slot combination is composed of at least one preset word slot; next, determining, according to the target word slot combination and intermediate character and word information of the target word slot combination in the text information, whether the text information includes a sensitive word (103); and if it is determined that the text information includes a sensitive word, performing limitation processing on the text information (104). The method can improve the accuracy of sensitive word recognition. In addition, the method also relates to blockchain technology, and text data can be stored in a blockchain, so as to ensure data privacy and security.

Description

基于人工智能的敏感词识别方法、装置及计算机设备Artificial intelligence-based sensitive word recognition method, device and computer equipment
本申请要求于2020年9月7日提交中国专利局、申请号为202010927419.7,发明名称为“基于人工智能的敏感词识别方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 7, 2020, the application number is 202010927419.7, and the invention title is "Artificial Intelligence-based Sensitive Word Recognition Method, Apparatus, and Computer Equipment". The entire content of the application is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能技术领域,尤其是涉及到一种基于人工智能的敏感词识别方法、装置及计算机设备。This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer equipment for identifying sensitive words based on artificial intelligence.
背景技术Background technique
随着社交软件的发展,用户与用户之间的沟通方式变得越来越多样化。但与此同时也带来了一个无法避免的问题,传输的信息内容无法被有效的控制,包含敏感词的不良信息有可能通过各种渠道发送给用户,造成一定的不良影响,因此衍生出了敏感词过滤的需求。在用户发布内容时进行有效的敏感词质检,保证所输出内容的质量符合规范。With the development of social software, the communication methods between users have become more and more diversified. But at the same time, it also brings an unavoidable problem. The content of the transmitted information cannot be effectively controlled. Bad information containing sensitive words may be sent to users through various channels, causing certain adverse effects. Therefore, it is derived The need for sensitive word filtering. Perform effective quality inspection of sensitive words when users publish content to ensure that the quality of the output content meets the specifications.
发明人意识到,目前,传统的敏感词过滤通常使用一对一过滤,使用较为简单的正则表达式进行敏感词的匹配,或者是通过维护敏感词类库,在文本输入时在对应表中进行查找。例如,如果需要过滤掉敏感词A,那么就是在字符串中搜索所有与给出的正则表达式匹配的内容,或者查找敏感词类库中是否有相对应的词汇或内容,如果存在则返回对应的结果。The inventor realizes that currently, traditional sensitive word filtering usually uses one-to-one filtering, using simpler regular expressions to match sensitive words, or by maintaining a sensitive word class library, and performing text input in the corresponding table. Look up. For example, if you need to filter out the sensitive word A, then search for all the content that matches the given regular expression in the string, or find out whether there is a corresponding vocabulary or content in the sensitive word library, and return the corresponding if it exists the result of.
然而,本申请创造的发明人在研究中发现,传统的敏感词过滤方式有很大的局限性,只能匹配特定内容,容易被钻漏洞,跳过匹配规则,如在敏感词中间增加符号或空格等,进而不能达到应有的敏感词识别效果,影响了敏感词识别的精准性。However, the inventor of this application found in his research that the traditional filtering method of sensitive words has great limitations. It can only match specific content and is easy to be exploited. The matching rules are skipped, such as adding symbols or adding symbols in the middle of sensitive words. Spaces, etc., in turn, cannot achieve the required sensitive word recognition effect, which affects the accuracy of sensitive word recognition.
技术问题technical problem
有鉴于此,本申请提供了一种基于人工智能的敏感词识别方法、装置及计算机设备,主要目的在于改善目前传统的敏感词过滤方式会造成敏感词识别精准度较低的技术问题。In view of this, this application provides an artificial intelligence-based method, device, and computer equipment for identifying sensitive words, the main purpose of which is to improve the current traditional filtering method of sensitive words, which causes the technical problem of low accuracy in identifying sensitive words.
技术解决方案Technical solutions
根据本申请的一个方面,提供了一种基于人工智能的敏感词识别方法,该方法包括:获取待识别的文本信息;识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。According to one aspect of the present application, there is provided a method for identifying sensitive words based on artificial intelligence. The method includes: acquiring text information to be recognized; identifying a target word slot combination contained in the text information, wherein the target word The slot combination is composed of at least one preset word slot; according to the target word slot combination and the intermediate word information of the target word slot combination in the text information, it is determined whether the text information contains sensitive words; If the text information contains sensitive words, the text information is restricted.
根据本申请的另一个方面,提供了一种基于人工智能的敏感词识别装置,该装置包括:获取模块,用于获取待识别的文本信息;识别模块,用于识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;判断模块,用于根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;处理模块,用于若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。According to another aspect of the present application, there is provided an artificial intelligence-based sensitive word recognition device. The device includes: an acquisition module for acquiring text information to be recognized; a recognition module for recognizing content contained in the text information The target word slot combination, wherein the target word slot combination is composed of at least one preset word slot; the judgment module is used for combining the target word slot combination and the middle character in the text information according to the target word slot combination The word information determines whether the text information contains sensitive words; the processing module is used to perform restriction processing on the text information if it is determined that the text information contains sensitive words.
根据本申请的又一个方面,提供了一种可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现以下方法:获取待识别的文本信息;识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。According to another aspect of the present application, there is provided a readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the following method is implemented: obtaining text information to be recognized; The target word slot combination included in the text information, wherein the target word slot combination is composed of at least one preset word slot; according to the target word slot combination and the target word slot combination in the middle of the text information The word information determines whether the text information contains sensitive words; if it is determined that the text information contains sensitive words, the text information is restricted.
根据本申请的再一个方面,提供了一种计算机设备,包括可读存储介质、处理器及存储在可读存储介质上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下方法:获取待识别的文本信息;识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。According to another aspect of the present application, there is provided a computer device, including a readable storage medium, a processor, and computer readable instructions stored on the readable storage medium and executable on the processor, the processor executing all The computer-readable instruction implements the following method: acquiring text information to be recognized; identifying a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one preset word slot; according to the The target word slot combination and the intermediate word information of the target word slot combination in the text information are judged whether the text information contains sensitive words; if it is determined that the text information contains sensitive words, the text information is Restrict processing.
有益效果Beneficial effect
本申请能够精准识别出文本信息中是否包含敏感词,可提高敏感词识别的精准度,提高了敏感词处理效率。This application can accurately identify whether the text information contains sensitive words, which can improve the accuracy of sensitive word recognition and improve the processing efficiency of sensitive words.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了本申请的上述和其他目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solutions of this application. In order to understand the technical means of this application more clearly, it can be implemented in accordance with the content of the specification, and for the above and other purposes, features and advantages of this application to be more obvious and understandable, The following specifically cite specific implementations of this application.
附图说明Description of the drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本地申请的不当限定。The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the local application.
图1示出了本申请实施例提供的一种基于人工智能的敏感词识别方法的流程示意图。FIG. 1 shows a schematic flowchart of an artificial intelligence-based sensitive word recognition method provided by an embodiment of the present application.
图2示出了本申请实施例提供的另一种基于人工智能的敏感词识别方法的流程示意图。FIG. 2 shows a schematic flowchart of another method for identifying sensitive words based on artificial intelligence according to an embodiment of the present application.
图3示出了本申请实施例提供的一种基于人工智能的敏感词识别装置的结构示意图。Fig. 3 shows a schematic structural diagram of an artificial intelligence-based sensitive word recognition device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互结合。Hereinafter, the present application will be described in detail with reference to the drawings and in conjunction with the embodiments. It should be noted that the embodiments in the application and the features in the embodiments can be combined with each other if there is no conflict.
本申请的技术方案可应用于人工智能、区块链和/或大数据技术领域,以实现敏感词识别。可选的,本申请涉及的数据如文本信息等可存储于数据库中,或者可以存储于区块链中,本申请不做限定。The technical solution of this application can be applied to the fields of artificial intelligence, blockchain and/or big data technology to realize sensitive word recognition. Optionally, the data involved in this application, such as text information, can be stored in a database, or can be stored in a blockchain, which is not limited in this application.
针对改善目前传统的敏感词过滤方式会造成敏感词识别精准度较低的技术问题,本实施例提供了一种基于人工智能的敏感词识别方法,如图1所示,该方法包括以下步骤。In view of the technical problem that the improvement of the current traditional sensitive word filtering method will cause the low accuracy of sensitive word recognition, this embodiment provides an artificial intelligence-based sensitive word recognition method. As shown in FIG. 1, the method includes the following steps.
101、获取待识别的文本信息。101. Acquire text information to be recognized.
待识别的文本信息可以为通讯消息待发布的文本信息,如即时通讯软件中的消息发送文本、平台客服人员与用户之间的在线沟通文本、公众平台的消息发布文本(如网页评论发布的文本、商品评价的文本、视频弹幕发送的文本等)等。除此之外,待识别的文本信息还可为指定范围内的文本(如对公开发表的电子读物中的指定范围文本、公开下发的通知消息中的指定范围文本等)等。The text information to be recognized can be the text information of the communication message to be published, such as the message sending text in the instant messaging software, the online communication text between the platform customer service staff and the user, and the message publishing text of the public platform (such as the text published by the web comment , The text of the product evaluation, the text sent by the video barrage, etc.) etc. In addition, the text information to be recognized can also be text within a specified range (such as a specified range text in a publicly published electronic book, a specified range text in a publicly issued notification message, etc.), etc.
对于本实施例的执行主体可为用于敏感词识别和处理的装置或设备,可部署在客户端或者服务端等,可提高敏感词识别的精准度。The execution subject of this embodiment may be a device or equipment for sensitive word recognition and processing, which may be deployed on the client or server, etc., which can improve the accuracy of sensitive word recognition.
102、识别文本信息中包含的目标词槽组合。102. Identify the target word slot combination contained in the text information.
其中,目标词槽组合由至少一预设词槽组成。本实施例中可预先设置词槽,这些词槽可根据不同的敏感词确定,具体可包含敏感词的词槽(如“减免本金”、“减免租金”、“个人贷款”等,以及符合银行卡号、身份证号、账号密码格式的一系列数字符号等的词槽)、非敏感词的词槽(如“不会”、“必须”等词槽、以及单个数字、单个文字等词槽)、敏感词同义词的词槽(如与敏感词实质同义、但不属于敏感词范围的词槽),还可包含根据敏感词拆分得到的各个分词(如对于敏感词“去你单位调查”拆分得到的“去你”、“单位”、“调查”三个词槽)。然后将这些词槽按照相应的敏感词识别进行组合搭配,得到词槽组合。Among them, the target word slot combination is composed of at least one preset word slot. In this embodiment, word slots can be set in advance. These word slots can be determined according to different sensitive words. Specifically, they can include word slots of sensitive words (such as "reduction of principal", "reduction of rent", "personal loan", etc., as well as Bank card number, ID number, account password format, a series of digital symbols and other word slots), non-sensitive word slots (such as "no", "must" and other word slots, as well as single numbers, single words and other word slots ), the word slot of the sensitive word synonyms (such as the word slot that is essentially synonymous with the sensitive word but does not belong to the scope of the sensitive word), and can also contain each participle obtained by splitting the sensitive word (such as for the sensitive word "Go to your unit to investigate "The three word slots of "Fuck You", "Unit" and "Investigation" obtained by splitting). Then these word slots are combined and matched according to the corresponding sensitive word recognition to obtain the word slot combination.
对于本实施例,可将预先统计的词槽组合保存预定存储位置(如数据库、映射表等)中,后续在识别文本信息中包含的词槽组合时,可将文本信息中的各个分词与预定存储位置中的各个词槽组合进行匹配,找到匹配的词槽组合,作为文本信息中包含的目标词槽组合。For this embodiment, the pre-statistical word slot combination can be stored in a predetermined storage location (such as a database, a mapping table, etc.). When subsequently identifying the word slot combination contained in the text information, each word segmentation in the text information can be combined with a predetermined storage location. Each word slot combination in the storage location is matched, and a matching word slot combination is found as the target word slot combination contained in the text information.
103、根据文本信息中包含的目标词槽组合、和目标词槽组合在文本信息中的中间字词信息,判断文本信息是否包含敏感词。103. Determine whether the text information contains sensitive words according to the target word slot combination contained in the text information and the intermediate word information of the target word slot combination in the text information.
中间字词信息可为词槽组合包含的各个词槽在文本信息中的之间出现的字词信息。例如,文本信息为“XX找人去你的单位,对你做个背景调查后发现XX”,其中XX代表文本信息中省略展示的字词,该文本信息中包含的目标词槽组合为“去你”+“单位”+“调查”,而“去你”与“单位”之间的“的”、以及“单位”与“调查”之间的“,对你做个背景”为中间字词。The intermediate word information may be word information that appears between each word slot included in the word slot combination in the text information. For example, the text message is "XX finds someone to go to your unit, and finds out XX after doing a background investigation on you", where XX stands for words that are omitted from the text message, and the target word slot combination contained in the text message is "Go You" + "unit" + "investigation", and the "de" between "fuck you" and "unit", and "being a background for you" between "unit" and "investigation" are the middle words .
在本实施例中,敏感词对应的词槽组合,在一定程度上与该敏感词的含义相同,可以是敏感词本身组成的词槽组合;或者是单看其一并不为敏感词,但是组合在一起具有敏感词含义的词槽组合等。在具体的应用场景中,有时发布的实际包含敏感词的文本中会掺杂着空格、符号、或是添加一些词语、或是通过其他文本进行相同语义改写等,进而影响文本信息中是否存在敏感词的判别精准度。而本实施例不仅通过词槽组合的判别,以及词槽组合在文本信息中的中间字词信息的判别,可实现在这些情况下均能够精准识别出文本信息中是否包含敏感词,可提高敏感词识别的精准度。In this embodiment, the word slot combination corresponding to the sensitive word has the same meaning as that of the sensitive word to a certain extent. It can be a word slot combination composed of the sensitive word itself; or it is not a sensitive word, but Combination of word slots with the meaning of sensitive words, etc. In specific application scenarios, sometimes the published text that actually contains sensitive words will be mixed with spaces, symbols, or added some words, or the same semantic rewriting through other texts, etc., which will affect whether there is sensitive in the text information. The accuracy of word discrimination. However, this embodiment not only judges the word slot combination and the intermediate word information of the word slot combination in the text information, it can accurately identify whether the text information contains sensitive words in these cases, which can improve the sensitivity. Accuracy of word recognition.
104、若判定文本信息包含敏感词,则对文本信息进行限制处理。104. If it is determined that the text information contains sensitive words, perform restriction processing on the text information.
例如,在判定文本信息包含敏感词时,可对文本信息进行标记提醒,告知存在敏感词信息,如对文本信息中包含目标词槽组合的文本部分进行突出显示(如高亮、加粗、添加下划线等),或限制包含该文本信息的通信消息发送出去等。For example, when it is determined that the text information contains sensitive words, the text information can be marked and reminded to inform the existence of sensitive word information, such as highlighting the text part containing the target word slot combination in the text information (such as highlighting, bolding, adding Underscore, etc.), or restrict the sending of communication messages containing the text information, etc.
通过本实施例中的基于人工智能的敏感词识别方法,可识别文本信息中包含的目标词槽组合,该词槽组合中由至少一预设词槽组成,然后根据目标词槽组合、和目标词槽组合在文本信息中的中间字词信息,判断文本信息是否包含敏感词。与目前现有的传统的敏感词过滤方式相比,本实施例通过词槽组合+词槽组合之间中间字词的判别方式,即便文本敏感词中间添加符号或空格,或者是添加一些词语、再或者是通过其他文本进行相同语义改写等,均能够精准识别出文本信息中是否包含敏感词,可提高敏感词识别的精准度。若判定文本信息中包含敏感词,还可对文本信息进行及时地限制处理,整个敏感词识别+限制处理的过程,可自动化实现,提高了敏感词处理效率。Through the artificial intelligence-based sensitive word recognition method in this embodiment, the target word slot combination contained in the text information can be identified. The word slot combination is composed of at least one preset word slot, and then according to the target word slot combination and target The word slot combines the intermediate word information in the text information to determine whether the text information contains sensitive words. Compared with the current existing traditional filtering method for sensitive words, this embodiment uses the method of discriminating intermediate words between the word slot combination + word slot combination, even if symbols or spaces are added in the text sensitive words, or some words are added, Or the same semantic rewriting through other texts, etc., can accurately identify whether the text information contains sensitive words, which can improve the accuracy of sensitive word recognition. If it is determined that the text information contains sensitive words, the text information can also be restricted and processed in time. The entire process of sensitive word recognition + restriction processing can be automated, which improves the efficiency of sensitive word processing.
进一步的,作为上述实施例具体实施方式的细化和扩展,为了完整说明本实施例中的具体实施过程,提供了另一种基于人工智能的敏感词识别方法,如图2所示,该方法包括以下步骤。Further, as a refinement and extension of the specific implementation of the above embodiment, in order to fully explain the specific implementation process in this embodiment, another artificial intelligence-based sensitive word recognition method is provided. As shown in FIG. 2, this method Including the following steps.
201、获取待识别的文本信息。201. Acquire text information to be recognized.
在待识别的文本信息进行敏感词识别之前,为了保证文本信息的安全性和私密性,可选的,该文本信息可预先保存在区块链中,相应的,步骤201具体可包括:从区块链中获取待识别的文本信息。例如,可从区块链的目标节点中获得待识别的文本信息,然后对该文本信息进行敏感词识别。需要说明的是,本实施例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。Before the sensitive word recognition is performed on the text information to be recognized, in order to ensure the security and privacy of the text information, optionally, the text information can be pre-stored in the blockchain. Accordingly, step 201 can specifically include: Obtain the text information to be recognized from the block chain. For example, the text information to be recognized can be obtained from the target node of the blockchain, and then sensitive word recognition can be performed on the text information. It should be noted that the blockchain referred to in this embodiment is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
202、对文本信息中的字符空格及预设特殊符号进行清除。202. Clear character spaces and preset special symbols in the text information.
其中,预设特殊符合可为“@”、“#”、“¥”、“\”、“/”、“*”等符号。对于本实施例,在对文本信息进行词槽组合识别之前,对文本信息中的字符空格及预设特殊符号等进行清除,可有效减少噪音干扰,可提高词槽组合以及对应检测规则的精确匹配。Among them, the preset special matches can be symbols such as "@", "#", "¥", "\", "/", and "*". For this embodiment, before the word slot combination recognition is performed on the text information, the character spaces and preset special symbols in the text information are cleared, which can effectively reduce noise interference and improve the accurate matching of word slot combinations and corresponding detection rules. .
进一步的,除了清除文本信息中的字符空格及预设特殊符号以外,还可去除文本信息中的连续重复字词、连续重复的符号等,还可去除文本信息中会造成匹配干扰的生僻字等,从而可进一步减少噪音干扰。Further, in addition to removing character spaces and preset special symbols in the text information, it can also remove continuously repeated words and symbols in the text information, and can also remove rare words in the text information that will cause matching interference. , Which can further reduce noise interference.
203、识别清除字符空格及预设特殊符号后的文本信息中包含的目标词槽组合。203. Identify the target word slot combination contained in the text information after clearing character spaces and preset special symbols.
本实施例可首先进行敏感词识别规则的配置,该规则配置可分为三层:词槽(slot)、规则(rule)、模型(model)。词槽里面是一些敏感词、非敏感词、敏感词同义词等的预设关键词,规则是词槽的组合(相当于一条预设校验规则,即文本信息符合存在敏感词时的判定标准),模型是规则的组合(相当于多条校验规则的组合搭配)。在建立词槽后,可以进行规则和模型的自由组合,根据业务场景制定符合需求的敏感词过滤策略。例如,在建立词槽、规则、模型之后,将清除字符空格及预设特殊符号后的文本信息中的各个分词与规则中的词槽组合进行匹配,进而找到其包含的目标词槽组合。In this embodiment, the configuration of the sensitive word recognition rule can be performed first, and the rule configuration can be divided into three layers: slot, rule, and model. The word slot contains some preset keywords such as sensitive words, non-sensitive words, sensitive word synonyms, etc. The rule is a combination of word slots (equivalent to a preset verification rule, that is, the text information meets the criterion for the presence of sensitive words) , The model is a combination of rules (equivalent to a combination of multiple verification rules). After the word slot is established, rules and models can be freely combined, and a sensitive word filtering strategy that meets the requirements can be formulated according to the business scenario. For example, after establishing word slots, rules, and models, each participle in the text information after clearing character spaces and preset special symbols is matched with the word slot combination in the rule, and then the target word slot combination contained in it is found.
需要说明的是,经过识别匹配,文本信息中可能会存在至少一目标词槽组合,即存在多组不同的目标词槽组合,后续根据这些目标词槽组合进行综合判定,具体执行步骤204至206a、206b所示的过程。It should be noted that after recognition and matching, there may be at least one target word slot combination in the text information, that is, there are multiple sets of different target word slot combinations, and then comprehensive judgments are made based on these target word slot combinations, and steps 204 to 206a are specifically executed. , 206b shows the process.
204、根据各个目标词槽组合对应的至少一目标校验规则,获取目标校验规则组合。204. Obtain a target verification rule combination according to at least one target verification rule corresponding to each target word slot combination.
在本实施例中,单个词槽组合可对应至少一目标校验规则,每一校验规则中相当于预设了包含敏感词的判定标准。当单个词槽组合对应至少两个校验规则时,相当于其对应的是校验规则组合。需要说明的是,本实施例可根据实际需求预先限定词槽组合对应的是单个校验规则,还是包含至少两个校验规则的校验规则组合,即限定词槽组合可用于规则层或模型层。通过这种方式可限定敏感词出现的范围,也就是检出的内容,作用范围为规则层或模型层,在指定范围进行检测,进而可灵活地使用校验规则进行敏感词识别,在语义双关的情况下能够采用多种校验规则从不同的角度进行准确判断,可提高敏感词识别的精确度。In this embodiment, a single word slot combination may correspond to at least one target verification rule, and each verification rule is equivalent to a preset criterion for containing sensitive words. When a single word slot combination corresponds to at least two verification rules, it is equivalent to a combination of verification rules. It should be noted that this embodiment can pre-determine whether the word slot combination corresponds to a single check rule or a check rule combination containing at least two check rules according to actual needs. That is, the qualifier slot combination can be used in the rule layer or the model. Floor. In this way, the range of sensitive words can be limited, that is, the content of detection. The scope of action is the rule layer or the model layer. The detection is performed in the specified range, and then the verification rules can be flexibly used for sensitive word recognition. In the case of, a variety of verification rules can be used to make accurate judgments from different angles, which can improve the accuracy of sensitive word recognition.
本实施例将每个目标词槽组合对应的至少一目标校验规则进行组合,得到目标校验规则组合,该目标检验规则组合中包含至少一预设的敏感词判定标准。In this embodiment, at least one target verification rule corresponding to each target word slot combination is combined to obtain a target verification rule combination, and the target verification rule combination includes at least one preset sensitive word determination criterion.
205、根据目标词槽组合在文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断文本信息是否符合目标检验规则组合中多个预设的敏感词判定标准。205. According to the word slot arrangement information of the target word slot combination in the text information and the intermediate word information between the word slots, respectively determine whether the text information meets the multiple preset sensitive word determination criteria in the target check rule combination.
在具体的判断过程中,利用目标检验规则组合中的各个预设的敏感词判定标准,根据目标词槽组合在文本信息中的词槽排列信息和词槽之间的中间字词信息,判断文本信息是否符合这些敏感词判定标准中的一个或多个。In the specific judgment process, use each preset sensitive word judgment standard in the target test rule combination, and judge the text according to the word slot arrangement information of the target word slot combination in the text information and the intermediate word information between the word slots Whether the information meets one or more of these sensitive word criteria.
示例性的,给出几个敏感词判定标准的示例,步骤205具体可包括:若敏感词判定标准为目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为符合判定的标准,则在判定词槽排列信息(如词槽在文本中先后出现的顺序)符合目标词槽组合对应的预设词槽顺序,且中间字词的数量小于或等于预设数量阈值时,确定文本信息包含敏感词;若敏感词判定标准为目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为不符合判定的标准,则在判定词槽排列信息符合目标词槽组合对应的预设词槽顺序,且中间字词的数量大于或等于预置数量阈值时,确定文本信息包含敏感词。Illustratively, given several examples of criteria for determining sensitive words, step 205 may specifically include: if the criteria for determining sensitive words is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words is limited If the range is to meet the criteria of the judgment, then the arrangement information of the word slots in the judgment (such as the sequence in which the word slots appear in the text) conforms to the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is less than or equal to the predetermined word slot sequence. When the number threshold is set, it is determined that the text information contains sensitive words; if the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range does not meet the judgment standard, When it is determined that the word slot arrangement information matches the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is greater than or equal to the preset number threshold, it is determined that the text information contains sensitive words.
其中,由于有时不同词语搭配的先后顺序不同,得到的语义也不会有所区别,因此每一词槽组合均有各自对应的预设词槽顺序,用于判定是否具有敏感词的含义,单个词槽组合可根据实际情况对应至少一种预设词槽顺序。中间字词部分用于语言修饰词槽组合的语义,而预设数量阈值用于判定经过语言修饰的词槽组合是否仍具备敏感词的含义,该阈值大小可根据实际情况预先设定。Among them, because sometimes the sequence of different word collocations is different, the semantics obtained will not be different, so each word slot combination has its own corresponding preset word slot sequence, which is used to determine whether it has the meaning of a sensitive word. The word slot combination can correspond to at least one preset word slot sequence according to actual conditions. The middle word part is used to modify the semantics of the word slot combination, and the preset number threshold is used to determine whether the language-modified word slot combination still has the meaning of a sensitive word. The threshold size can be preset according to the actual situation.
例如,由词槽“去你”、“单位”、“调查”组成的词槽组合,经过匹配可得到对应的预设校验规则为【且】的判定标准,三个词槽中间允许可编辑的字为8个,若用户发送的信息中同时包含这三个词汇且中间字词少于8个,则被判定符合该【且】的判定标准,即确定用户发送的信息中含有敏感词。而如果这三个敏感词中间编辑的字大于8个,此时被判定不符合该【且】的判定标准,即确定用户发送的信息中不含有敏感词。【且】是指这个规则前后的多个词槽都出现才可被命中,并且可设置中间字词。举例来说,对于“找人去你的单位,对你做个背景调查。”的文本信息,词槽组合在这段文本中的中间字为“去你”与“单位”之间的“的”、以及“单位”与“调查”之间的“,对你做个背景”为中间字词,中间字词的数量相加之和等于8个,则被认为命中对应的校验规则标准,进而确定这段文本中包含敏感词。For example, a word slot combination composed of word slots "Go to you", "unit", and "investigation" can be matched to get the corresponding preset verification rule as the judgment standard of [and], and editable among the three word slots is allowed The number of characters in is 8. If the message sent by the user contains these three words at the same time and there are less than 8 words in the middle, it will be judged to meet the criterion of [and], that is, it is determined that the message sent by the user contains sensitive words. And if there are more than 8 edited characters among the three sensitive words, it is judged that it does not meet the criterion of [and], that is, it is determined that the information sent by the user does not contain sensitive words. [And] It means that multiple word slots before and after this rule can be hit, and the middle word can be set. For example, for the text message of "Look for someone to go to your unit, do a background check on you.", the middle word of the word slot combination in this text is the "between" "Go to you" and "Unit". ", and "Be a background for you" between "unit" and "investigation" are the middle words. If the sum of the number of middle words is equal to 8, it is considered to hit the corresponding verification rule standard. Then determine that this text contains sensitive words.
再例如,由敏感词的词槽“减免本金”和非敏感词的词槽“不会”组成的词槽组合,经过匹配可得到对应的预设校验规则为【非】的判定标准,两个敏感词中间允许可编辑的字为3个。若用户发送的信息中同时包含这两个词、且中间字词少于3个,则被判定符合该【非】的判定标准,即确定用户发送的信息中不含有敏感词。【非】的校验规则为设置一个敏感词的词槽和一个非敏感词的词槽,如两个词一起出现,则不会被命中,并且可设置中间字词。举例来说,对于“不会减免本金”的文本信息,"不会"与“减免本金”这两个词槽之间的中间字词数量为0,被认为没有命中对应的校验规则标准,进而确定这段文本中不含有敏感词。而对于“不会的,您放心,肯定会减免本金”,“不会”与“减免本金”这两个词槽之间的中间字词“的,您放心,肯定会”的数量大于3个,则认为命中对应的校验规则标准,进而确定这段文本中含有敏感词。For another example, the word slot combination consisting of the word slot of sensitive words "reduction of principal" and the word slot of non-sensitive words "will not", after matching, the corresponding preset verification rule is [Non] judgment standard, There are 3 editable characters between the two sensitive words. If the information sent by the user contains these two words at the same time, and there are less than 3 words in the middle, it will be judged to meet the criterion of [Non], that is, it is determined that the information sent by the user does not contain sensitive words. [Non] The verification rule is to set a word slot of a sensitive word and a word slot of a non-sensitive word. If two words appear together, they will not be hit, and the middle word can be set. For example, for the text message of "no principal reduction", the number of words in the middle between the two word slots of "no" and "principal reduction" is 0, which is considered to have not hit the corresponding verification rule Standards, and then determine that this text does not contain sensitive words. As for "no, you can rest assured, the principal will be reduced", the middle word between the two word slots of "no" and "principal reduction" ", you can rest assured, will definitely" the number is greater than 3, it is considered that the corresponding verification rule standard is hit, and then it is determined that the text contains sensitive words.
在实际应用当中,校验规则组合可包含至少两个校验规则,相当于通过步骤203中的模型进行敏感词识别。例如,校验规则组合中包含三个校验规则,校验规则一为身份证的校验,校验规则二为敏感词+与的校验,校验规则三为敏感词+非的校验。其中,利用校验规则一进行敏感词识别时,可识别文本信息(去除字符空格、预设特殊符号、生僻字等噪音文本后)中是否包含一串数字类型的词槽,若包含该类型的词槽,可判断该词槽对应的一串数字是否符合身份证格式,如果符合身份证格式,则判定符合校验规则一,即可认为文本信息中存在敏感词。而利用校验规则二和校验规则三进行敏感词识别时可参照上述两个示例,在此不再赘述。In practical applications, the combination of verification rules may include at least two verification rules, which is equivalent to identifying sensitive words through the model in step 203. For example, the combination of verification rules contains three verification rules. The first verification rule is ID verification, the second verification rule is sensitive word + and verification, and the third verification rule is sensitive word + non-verification. . Among them, when using verification rule 1 to identify sensitive words, it can be recognized whether the text information (after removing character spaces, preset special symbols, rare words and other noisy texts) contains a string of numeric word slots, if it contains this type of word slot The word slot can determine whether the string of numbers corresponding to the word slot conforms to the ID card format. If it meets the ID card format, it is determined that it meets the first verification rule, and it can be considered that there are sensitive words in the text information. The above two examples can be referred to when using verification rule 2 and verification rule 3 to identify sensitive words, which will not be repeated here.
在实际校验过程中,目标检验规则组合内可能会包含数量较多的校验规则,如果逐一进行校验判断,会影响到时间效率的问题。因此为了提高敏感词识别的效率,可选的,步骤205具体还可包括:若目标检验规则组合中包含执行优先级不同的至少一预设的敏感词判定标准,则按照目标检验规则组合中各个敏感词判定标准的执行优先级从高到低的顺序,依次对该文本信息进行判断;在依次判断的过程中,若确定存在文本信息符合的敏感词判定标准,则停止后续对文本信息进行判断,并将当前得到的判断结果作为利用目标检验规则组合对文本信息进行判断的结果。In the actual verification process, the target verification rule combination may contain a large number of verification rules. If the verification judgment is performed one by one, it will affect the problem of time efficiency. Therefore, in order to improve the efficiency of sensitive word recognition, optionally, step 205 may further specifically include: if the target inspection rule combination includes at least one preset sensitive word determination criterion with different execution priorities, then according to each target inspection rule combination. The execution priority of the sensitive word judgment standard is from high to low, and the text information is judged in turn; in the process of sequential judgment, if it is determined that there is a sensitive word judgment standard that the text information meets, then the subsequent judgment on the text information is stopped , And use the currently obtained judgment result as the result of judging the text information by using the target inspection rule combination.
例如,目标检验规则组合中包含五个校验规则,这五个校验规则在该检验规则组合中预先设置有执行优先级(如根据敏感词识别成功率从高到低进行预设优先级),按照执行优先级从高到低排序为:校验规则一>校验规则三>校验规则四>校验规则五>校验规则二,后续按照这个排序的顺序,依次利用对应的校验规则对文本信息进行判断,如果通过校验规则三判断出文本信息存在敏感词,则停止后续校验规则四、校验规则五、校验规则二的校验过程。通过这种可选方式,无需逐个校验规则进行敏感词校验,即可尽可能地快速得到判断结果,可提高敏感词识别的效率。For example, the target check rule combination contains five check rules, and the five check rules are preset with execution priority in the check rule combination (for example, the preset priority is based on the sensitive word recognition success rate from high to low) , The order of execution priority from high to low is: verification rule one>verification rule three>verification rule four>verification rule five>verification rule two, and then follow the order of this sort and use the corresponding verification in turn The rule judges the text information. If it is judged that there are sensitive words in the text information through verification rule three, the subsequent verification process of verification rule four, verification rule five, and verification rule two will be stopped. Through this optional method, there is no need to check sensitive words one by one, and the judgment result can be obtained as quickly as possible, which can improve the efficiency of sensitive word recognition.
在本实施例的具体应用实现中,可限定计算优先范围,类似于四则运算,在优先范围内出现的校验规则先执行。如校验规则中有特定的正则符号代表不同的意思,优先范围匹配的校验规则可放置在括号“()”内,在执行校验规则时会优先执行括号内的规则内容,再执行其他的校验规则。In the specific application implementation of this embodiment, the calculation priority range can be limited. Similar to the four arithmetic operations, the verification rules appearing in the priority range are executed first. If there are specific regular symbols in the verification rules that represent different meanings, the verification rules that match the priority range can be placed in brackets "()". When the verification rules are executed, the content of the rules in the brackets will be executed first, and then the others will be executed. The verification rules.
206a、若文本信息符合目标检验规则组合中至少一组敏感词判定标准,则判定文本信息包含敏感词。206a. If the text information meets at least one set of sensitive word determination criteria in the target inspection rule combination, it is determined that the text information contains sensitive words.
在本实施例中,一组敏感词判定标准可包含至少一条敏感词判定标准,即可为一条、或两条、或多条的敏感词判定标准,具体可根据实际敏感词判定精度需求而定。In this embodiment, a set of sensitive word judgment standards may include at least one sensitive word judgment standard, that is, one, or two, or more sensitive word judgment standards, which may be determined according to actual sensitive word judgment accuracy requirements. .
与步骤206a并列的步骤206b、若文本信息均不符合目标检验规则组合中各个敏感词判定标准,则判定文本信息不包含敏感词。In step 206b, which is parallel to step 206a, if none of the text information meets the sensitive word determination criteria in the target inspection rule combination, it is determined that the text information does not contain sensitive words.
在本实施例中,可利用检验规则组合中的多个预设的敏感词判定标准,分别对文本信息进行敏感词识别判断,如果经过判断,符合至少之一预设的敏感词判定标准,则可判定文本信息包含敏感词,进而可提高敏感词识别的精确度。In this embodiment, a plurality of preset sensitive word judgment criteria in the combination of inspection rules may be used to perform sensitive word recognition judgment on the text information respectively. If the text information is judged to meet at least one of the preset sensitive word judgment criteria, then It can be determined that the text information contains sensitive words, which can improve the accuracy of sensitive word recognition.
207、若判定文本信息包含敏感词,则对文本信息进行限制处理。207. If it is determined that the text information contains sensitive words, perform restriction processing on the text information.
可选的,对文本信息进行限制处理,具体可包括:阻止发布文本信息;或,将文本信息中包含目标词槽组合的文本部分利用预设字符(如“*”、“-”等字符,起到脱敏的效果)替换后再进行发布;或,将文本信息发送给审核模块进行审核,若审核通过则进行发布。例如,在判定文本信息中含有敏感词后,系统会阻止用户发布敏感词汇,或将用户发出来的含有敏感词的内容直接删除。对于某些敏感性较低的词汇,发出来后不会立即删除,需要审核人员进行二次人工审核。Optionally, restricting the text information may specifically include: preventing the publication of the text information; or, using preset characters (such as "*", "-" and other characters in the text part containing the target word slot combination in the text information, (It has the effect of desensitization) before publishing after replacement; or, sending the text information to the review module for review, and publishing if the review is passed. For example, after determining that the text information contains sensitive words, the system will prevent the user from publishing sensitive words, or directly delete the content containing sensitive words sent by the user. For some less sensitive words, they will not be deleted immediately after they are sent out, and the reviewers need to conduct a second manual review.
进一步可选的,在步骤207之后,本实施例方法还可包括:记录文本信息中包含目标词槽组合的文本部分作为样本数据;然后定期根据记录的各个样本数据进行分析,统计各个样本数据中出现频率大于预设频率阈值的,且与已有的词槽组合不同的字词组合;将统计得到的字词组合,与预设敏感词和/或预设敏感语句进行语义相似度计算;将语义相似度大于预设相似度阈值的目标字词组合,作为新的词槽组合,并根据包含新的词槽组合的样本数据,更新与新的词槽组合对应的校验规则;后续可利用新的词槽组合和与其对应的检验规则,判断其他文本信息是否包含敏感词。通过这种定时自动更新词槽组合和与其对应检验规则的方式,可实现敏感词识别系统的自动更新,以便进一步提高后续的敏感词识别精确度。整个敏感词识别系统相当于具有机器学习的功能,可实现人工智能的敏感词精确识别。Further optionally, after step 207, the method of this embodiment may further include: recording the text part of the target word slot combination in the text information as sample data; and then periodically analyzing each recorded sample data to make statistics on each sample data Word combinations that appear more frequently than the preset frequency threshold and are different from the existing word slot combinations; calculate the semantic similarity between the word combinations obtained by statistics and the preset sensitive words and/or preset sensitive sentences; The target word combination whose semantic similarity is greater than the preset similarity threshold is used as a new word slot combination, and the verification rule corresponding to the new word slot combination is updated according to the sample data containing the new word slot combination; it can be used later The new word slot combination and corresponding inspection rules determine whether other text information contains sensitive words. Through this method of automatically updating the word slot combination and its corresponding inspection rules, the automatic update of the sensitive word recognition system can be realized, so as to further improve the accuracy of subsequent sensitive word recognition. The entire sensitive word recognition system is equivalent to having the function of machine learning, which can realize the accurate recognition of sensitive words by artificial intelligence.
例如,对于敏感话题的文章,通常其不只包含一组具有敏感词语义的词槽组合,会利用多种不同词语来进行敏感话题的表达。因此利用已有的词槽组合和其对应的校验规则,在被判定为包含敏感词的文本数据中有时也会包含其他具有敏感词含义的字词组合。本实施例将这些文本数据收集起来,作为样本数据;定期根据这些样本数据进行分析,找到出现频率大于一定阈值的、且与已有的词槽组合不同的字词组合,将其与预设敏感词和/或预设敏感语句进行语义相似度计算,进而找到那些之前没被发现的,同样具有敏感词含义的新词槽组合,并制定其对应的校验规则。这样后续可利用新的词槽组合和与其对应的检验规则,判断其他文本信息是否包含敏感词,从而找到更多实际具有敏感词含义的文本数据。For example, for an article on a sensitive topic, it usually not only includes a set of word slots with the meaning of a sensitive word, but also uses a variety of different words to express the sensitive topic. Therefore, using existing word slot combinations and their corresponding verification rules, text data determined to contain sensitive words may sometimes include other word combinations with the meaning of sensitive words. This embodiment collects these text data as sample data; regularly analyzes the sample data, finds word combinations that appear more than a certain threshold and is different from the existing word slot combinations, and compares them with the preset sensitivity Words and/or preset sensitive sentences are calculated for semantic similarity, and then find those new word slot combinations that have not been discovered before and also have the meaning of sensitive words, and formulate their corresponding verification rules. In this way, the new word slot combination and corresponding inspection rules can be used to determine whether other text information contains sensitive words, so as to find more text data that actually has the meaning of sensitive words.
本实施例方法还可应用在智能敏感词质检的系统中,可使用算法进行词条的匹配,可通过设置特定的规则和策略,减少噪音的干扰,能够横跨文本,进行精准的敏感词过滤。构建敏感词库后,通过算法来遍历文本,并与敏感词树匹配,进而达到识别并过滤敏感词汇的作用。可根据客户需求智能策略定制,高效过滤违禁消息、恶意推广、低俗辱骂、低质灌水等多类敏感词和违禁变种。智能质检系统的内容审核识别准确率高,能够对文本进行快速处理,极大减少人工审核工作量,杜绝线上风险,提高内容输出质量,净化网络环境,保证良好的用户体验。The method of this embodiment can also be applied to a system for intelligent sensitive word quality inspection. Algorithms can be used to match entries. Specific rules and strategies can be set to reduce noise interference. It can span text and perform accurate sensitive words. filter. After constructing the sensitive vocabulary, the algorithm is used to traverse the text and match with the sensitive word tree to achieve the function of identifying and filtering sensitive vocabulary. Intelligent strategies can be customized according to customer needs to efficiently filter prohibited messages, malicious promotion, vulgar abuse, low-quality irrigation and other sensitive words and prohibited variants. The intelligent quality inspection system has a high accuracy of content review and recognition, which can quickly process text, greatly reduce the workload of manual review, eliminate online risks, improve content output quality, purify the network environment, and ensure a good user experience.
进一步的,作为图1和图2所示方法的具体实现,本实施例提供了一种基于人工智能的敏感词识别装置,如图3所示,该装置包括:获取模块31、识别模块32、判断模块33、处理模块34。Further, as a specific implementation of the method shown in FIG. 1 and FIG. 2, this embodiment provides an artificial intelligence-based sensitive word recognition device. As shown in FIG. 3, the device includes: an acquisition module 31, a recognition module 32, The judgment module 33 and the processing module 34.
获取模块31,用于获取待识别的文本信息。The obtaining module 31 is used to obtain the text information to be recognized.
识别模块32,用于识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成。The recognition module 32 is configured to recognize a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one predetermined word slot.
判断模块33,用于根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词。The judgment module 33 is configured to judge whether the text information contains sensitive words according to the target word slot combination and the intermediate word information of the target word slot combination in the text information.
处理模块34,用于若判定所述文本信息包含敏感词,则对所述文本信息进行过滤处理。The processing module 34 is configured to perform filtering processing on the text information if it is determined that the text information contains sensitive words.
在具体的应用场景中,判断模块33,具体用于根据各个所述目标词槽组合对应的至少一目标校验规则,获取目标校验规则组合;根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准;若所述文本信息符合所述目标检验规则组合中至少一组敏感词判定标准,则判定所述文本信息包含敏感词;若所述文本信息均不符合所述目标检验规则组合中各个敏感词判定标准,则判定所述文本信息不包含敏感词。In a specific application scenario, the judgment module 33 is specifically configured to obtain a target verification rule combination according to at least one target verification rule corresponding to each target word slot combination; according to the target word slot combination in the text information The word slot arrangement information and the intermediate word information between the word slots in, respectively determine whether the text information meets the multiple preset sensitive word determination criteria in the target inspection rule combination; if the text information meets the At least one set of sensitive word determination criteria in the target inspection rule combination is determined to include the sensitive words; if none of the text information meets the determination criteria of each sensitive word in the target inspection rule combination, then the text information is determined Does not contain sensitive words.
在具体的应用场景中,判断模块33,具体还用于若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量小于或等于预设数量阈值时,确定所述文本信息包含敏感词;若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为不符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量大于或等于预置数量阈值时,确定所述文本信息包含敏感词。In a specific application scenario, the judging module 33 is specifically used to determine if the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words is within a limited range. The criterion for determination is to determine the text information when it is determined that the word slot arrangement information matches the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is less than or equal to the preset number threshold. Contains sensitive words; if the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words does not meet the judgment criterion within a limited range, then the word is judged When the slot arrangement information conforms to the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is greater than or equal to the preset number threshold, it is determined that the text information contains sensitive words.
在具体的应用场景中,判断模块33,具体还用于若所述目标检验规则组合中包含执行优先级不同的至少一预设的敏感词判定标准,则按照所述目标检验规则组合中各个敏感词判定标准的执行优先级从高到低的顺序,依次对所述文本信息进行判断;在依次判断的过程中,若确定存在所述文本信息符合的敏感词判定标准,则停止后续对所述文本信息进行判断,并将当前得到的判断结果作为利用所述目标检验规则组合对所述文本信息进行判断的结果。In a specific application scenario, the judging module 33 is specifically further configured to, if the target inspection rule combination includes at least one preset sensitive word determination criterion with different execution priorities, then according to each sensitive word in the target inspection rule combination. The execution priority of the word judgment standard is from high to low, and the text information is judged in sequence; in the process of sequential judgment, if it is determined that there is a sensitive word judgment standard that the text information meets, then the subsequent review of the text information is stopped. The text information is judged, and the currently obtained judgment result is used as the result of judging the text information by using the target inspection rule combination.
在具体的应用场景中,本装置还包括:记录模块和分析模块。In specific application scenarios, the device also includes: a recording module and an analysis module.
记录模块,用于在所述对所述文本信息进行限制处理之后,记录所述文本信息中包含所述目标词槽组合的文本部分作为样本数据。The recording module is configured to record the text part of the text information containing the target word slot combination as sample data after the restriction processing is performed on the text information.
分析模块,用于定期根据记录的各个样本数据进行分析,统计各个样本数据中出现频率大于预设频率阈值的,且与已有的词槽组合不同的字词组合;将统计得到的所述字词组合,与预设敏感词和/或预设敏感语句进行语义相似度计算;将语义相似度大于预设相似度阈值的目标字词组合,作为新的词槽组合,并根据包含所述新的词槽组合的样本数据,更新与所述新的词槽组合对应的校验规则;利用所述新的词槽组合和与其对应的检验规则,判断其他文本信息是否包含敏感词。The analysis module is used to periodically analyze the recorded sample data, and count the word combinations in each sample data that appear more frequently than the preset frequency threshold and are different from the existing word slot combinations; Word combinations are calculated for semantic similarity with preset sensitive words and/or preset sensitive sentences; target word combinations with semantic similarity greater than the preset similarity threshold are used as a new word slot combination, and based on the new word slot combination The sample data of the word slot combination is updated with the verification rule corresponding to the new word slot combination; the new word slot combination and the corresponding verification rule are used to determine whether other text information contains sensitive words.
在具体的应用场景中,处理模块34,具体用于阻止发布所述文本信息;或,将所述文本信息中包含所述目标词槽组合的文本部分利用预设字符替换后再进行发布;或,将所述文本信息发送给审核模块进行审核,若审核通过则进行发布。In a specific application scenario, the processing module 34 is specifically configured to prevent the publication of the text information; or, replace the text part of the text information containing the target word slot combination with preset characters before publishing; or , Send the text information to the review module for review, and publish it if the review is passed.
在具体的应用场景中,可选的,所述文本信息预先保存在区块链中;相应的,获取模块31,具体用于从所述区块链中获取所述文本信息;识别模块32,具体用于对所述文本信息中的字符空格及预设特殊符号进行清除;识别清除字符空格及预设特殊符号后的所述文本信息中包含的目标词槽组合。In a specific application scenario, optionally, the text information is pre-stored in the blockchain; correspondingly, the obtaining module 31 is specifically configured to obtain the text information from the blockchain; the recognition module 32, Specifically, it is used to clear the character spaces and preset special symbols in the text information; identify the target word slot combination included in the text information after removing the character spaces and the preset special symbols.
需要说明的是,本实施例提供的一种基于人工智能的敏感词识别装置所涉及各功能单元的其它相应描述,可以参考图1和图2中的对应描述,在此不再赘述。It should be noted that, for other corresponding descriptions of the functional units involved in the artificial intelligence-based sensitive word recognition device provided in this embodiment, reference may be made to the corresponding descriptions in FIG. 1 and FIG. 2, and details are not repeated here.
基于上述如图1和图2所示方法,相应的,本实施例还提供了一种可读存储介质,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述如图1和图2所示的基于人工智能的敏感词识别方法。Based on the above-mentioned methods shown in Figures 1 and 2, correspondingly, this embodiment also provides a readable storage medium on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the foregoing Figure 1 and Figure 2 show the artificial intelligence-based sensitive word recognition method.
可选的,本申请涉及的可读存储介质可以是计算机可读存储介质。进一步可选的,本申请涉及的存储介质如可读存储介质可以是非易失性的,如非易失性可读存储介质,也可以是易失性的,如易失性可读存储介质。Optionally, the readable storage medium involved in this application may be a computer readable storage medium. Further optionally, the storage medium involved in this application, such as a readable storage medium, may be non-volatile, such as a non-volatile readable storage medium, or may be volatile, such as a volatile readable storage medium.
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景的方法。Based on this understanding, the technical solution of this application can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods in each implementation scenario of the present application.
基于上述如图1、图2所示的方法,以及图3所示的虚拟装置实施例,为了实现上述目的,本实施例还提供了一种计算机设备,具体可以为个人计算机、笔记本电脑、服务器、网络设备等,该实体设备包括存储介质和处理器;存储介质,用于存储计算机可读指令;处理器,用于执行计算机可读指令以实现上述如图1和图2所示的基于人工智能的敏感词识别方法。可选的,该存储介质可以是可读存储介质,如非易失性可读存储介质或易失性可读存储介质。Based on the above-mentioned method shown in Fig. 1 and Fig. 2 and the virtual device embodiment shown in Fig. 3, in order to achieve the above-mentioned purpose, this embodiment also provides a computer device, which may be a personal computer, a notebook computer, or a server. , Network equipment, etc., the physical equipment includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to implement the aforementioned manual-based Intelligent method of identifying sensitive words. Optionally, the storage medium may be a readable storage medium, such as a nonvolatile readable storage medium or a volatile readable storage medium.
可选的,该计算机设备还可以包括用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Optionally, the computer device may also include a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like. The optional network interface can include standard wired interface, wireless interface (such as Bluetooth interface, WI-FI interface), etc.
本领域技术人员可以理解,本实施例提供的计算机设备结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the computer device structure provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components.
存储介质中还可以包括操作系统、网络通信模块。操作系统是管理上述实体设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信,以及与该实体设备中其它硬件和软件之间通信。The storage medium may also include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the aforementioned physical devices, and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between the various components in the storage medium and the communication with other hardware and software in the physical device.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。通过应用本实施例的技术方案,可识别文本信息中包含的目标词槽组合,该词槽组合中由至少一预设词槽组成,然后根据目标词槽组合、和目标词槽组合在文本信息中的中间字词信息,判断文本信息是否包含敏感词。与目前现有的传统的敏感词过滤方式相比,本实施例通过词槽组合+词槽组合之间中间字词的判别方式,即便文本敏感词中间添加符号或空格,或者是添加一些词语、再或者是通过其他文本进行相同语义改写等,均能够精准识别出文本信息中是否包含敏感词,可提高敏感词识别的精准度。若判定文本信息中包含敏感词,还可对文本信息进行及时地限制处理,整个敏感词识别+限制处理的过程,可自动化实现,提高了敏感词处理效率。Through the description of the above implementation manners, those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform, or can be implemented by hardware. By applying the technical solution of this embodiment, the target word slot combination contained in the text information can be identified. The word slot combination is composed of at least one preset word slot, and then the target word slot combination and the target word slot combination are used in the text information In the middle word information, judge whether the text information contains sensitive words. Compared with the current existing traditional filtering method for sensitive words, this embodiment uses the method of discriminating intermediate words between the word slot combination + word slot combination, even if symbols or spaces are added in the text sensitive words, or some words are added, Or the same semantic rewriting through other texts, etc., can accurately identify whether the text information contains sensitive words, which can improve the accuracy of sensitive word recognition. If it is determined that the text information contains sensitive words, the text information can also be restricted and processed in time. The entire process of sensitive word recognition + restriction processing can be automated, which improves the efficiency of sensitive word processing.
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。Those skilled in the art can understand that the accompanying drawings are only schematic diagrams of preferred implementation scenarios, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing this application. Those skilled in the art can understand that the modules in the device in the implementation scenario can be distributed in the device in the implementation scenario according to the description of the implementation scenario, or can be changed to be located in one or more devices different from the implementation scenario. The modules of the above implementation scenarios can be combined into one module or further divided into multiple sub-modules.
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The above serial number of this application is for description only, and does not represent the pros and cons of implementation scenarios. What has been disclosed above are only a few specific implementation scenarios of this application, but this application is not limited to these, and any changes that can be thought of by those skilled in the art should fall into the protection scope of this application.

Claims (20)

  1. 一种基于人工智能的敏感词识别方法,其中,包括:A method for identifying sensitive words based on artificial intelligence, which includes:
    获取待识别的文本信息;Obtain the text information to be recognized;
    识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;Identifying a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one preset word slot;
    根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;Judging whether the text information contains sensitive words according to the target word slot combination and the intermediate word information of the target word slot combination in the text information;
    若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。If it is determined that the text information contains sensitive words, restrict processing is performed on the text information.
  2. 根据权利要求1所述的方法,其中,所述根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词,具体包括:The method according to claim 1, wherein said determining whether the text information contains sensitive words according to the target word slot combination and the intermediate word information of the target word slot combination in the text information, specifically include:
    根据各个所述目标词槽组合对应的至少一目标校验规则,获取目标校验规则组合;Obtaining a target verification rule combination according to at least one target verification rule corresponding to each of the target word slot combinations;
    根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准;According to the word slot arrangement information of the target word slot combination in the text information and the intermediate word information between word slots, it is determined whether the text information meets the multiple preset sensitivity in the target check rule combination. Word criterion;
    若所述文本信息符合所述目标检验规则组合中至少一组敏感词判定标准,则判定所述文本信息包含敏感词;If the text information meets at least one set of sensitive word determination criteria in the target inspection rule combination, it is determined that the text information contains sensitive words;
    若所述文本信息均不符合所述目标检验规则组合中各个敏感词判定标准,则判定所述文本信息不包含敏感词。If none of the text information meets the criteria for determining each sensitive word in the target inspection rule combination, it is determined that the text information does not contain sensitive words.
  3. 根据权利要求2所述的方法,其中,所述根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准,具体包括:2. The method according to claim 2, wherein the word slot arrangement information and the intermediate word information between the word slots combined in the text information according to the target word slot respectively determine whether the text information conforms to The multiple preset criteria for judging sensitive words in the target inspection rule combination specifically include:
    若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量小于或等于预设数量阈值时,确定所述文本信息包含敏感词;If the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range meets the judgment standard, then it is determined that the word slot arrangement information meets all When the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is less than or equal to the preset number threshold, it is determined that the text information contains sensitive words;
    若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为不符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量大于或等于预置数量阈值时,确定所述文本信息包含敏感词。If the sensitive word determination criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range does not meet the criterion, then it is determined that the word slot arrangement information matches When the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is greater than or equal to the preset number threshold, it is determined that the text information contains sensitive words.
  4. 根据权利要求2所述的方法,其中,若所述目标检验规则组合中包含执行优先级不同的至少一预设的敏感词判定标准,则所述根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准,具体包括:2. The method according to claim 2, wherein if the target check rule combination includes at least one preset sensitive word determination criterion with different execution priorities, the combination according to the target word slot is used in the text information The word slot arrangement information and the intermediate word information between the word slots in, respectively determine whether the text information meets the multiple preset sensitive word determination criteria in the target inspection rule combination, which specifically includes:
    按照所述目标检验规则组合中各个敏感词判定标准的执行优先级从高到低的顺序,依次对所述文本信息进行判断;Judging the text information in turn according to the order of execution priority of each sensitive word judgment criterion in the target inspection rule combination from high to low;
    在依次判断的过程中,若确定存在所述文本信息符合的敏感词判定标准,则停止后续对所述文本信息进行判断,并将当前得到的判断结果作为利用所述目标检验规则组合对所述文本信息进行判断的结果。In the process of sequential judgment, if it is determined that there is a sensitive word judgment standard that the text information meets, then the subsequent judgment on the text information is stopped, and the currently obtained judgment result is used as the combination of the target inspection rules for the The result of the judgment on the text information.
  5. 根据权利要求2所述的方法,其中,在所述对所述文本信息进行限制处理之后,所述方法还包括:The method according to claim 2, wherein, after the restricting processing of the text information, the method further comprises:
    记录所述文本信息中包含所述目标词槽组合的文本部分作为样本数据;Recording the text part containing the target word slot combination in the text information as sample data;
    定期根据记录的各个样本数据进行分析,统计各个样本数据中出现频率大于预设频率阈值的,且与已有的词槽组合不同的字词组合;Regularly analyze the recorded sample data, and count the word combinations that appear more frequently than the preset frequency threshold in each sample data and are different from the existing word slot combinations;
    将统计得到的所述字词组合,与预设敏感词和/或预设敏感语句进行语义相似度计算;Calculating the semantic similarity between the word combination obtained by statistics and the preset sensitive word and/or the preset sensitive sentence;
    将语义相似度大于预设相似度阈值的目标字词组合,作为新的词槽组合,并根据包含所述新的词槽组合的样本数据,更新与所述新的词槽组合对应的校验规则;The target word combination with semantic similarity greater than the preset similarity threshold is used as a new word slot combination, and the check corresponding to the new word slot combination is updated according to the sample data containing the new word slot combination rule;
    利用所述新的词槽组合和与其对应的检验规则,判断其他文本信息是否包含敏感词。Using the new word slot combination and corresponding inspection rules to determine whether other text information contains sensitive words.
  6. 根据权利要求1所述的方法,其中,所述对所述文本信息进行限制处理,具体包括:The method according to claim 1, wherein the restricting processing of the text information specifically includes:
    阻止发布所述文本信息;或,Prevent the publication of the text information; or,
    将所述文本信息中包含所述目标词槽组合的文本部分利用预设字符替换后再进行发布;或,Publish after replacing the text part of the text information containing the target word slot combination with preset characters; or,
    将所述文本信息发送给审核模块进行审核,若审核通过则进行发布。The text information is sent to the review module for review, and if the review is passed, it will be released.
  7. 根据权利要求1所述的方法,其中,所述文本信息预先保存在区块链中;The method according to claim 1, wherein the text information is stored in a blockchain in advance;
    所述获取待识别的文本信息,具体包括:The obtaining of the text information to be recognized specifically includes:
    从所述区块链中获取所述文本信息;Obtain the text information from the blockchain;
    所述识别所述文本信息中包含的目标词槽组合,具体包括:The recognizing the target word slot combination contained in the text information specifically includes:
    对所述文本信息中的字符空格及预设特殊符号进行清除;Clear character spaces and preset special symbols in the text information;
    识别清除字符空格及预设特殊符号后的所述文本信息中包含的目标词槽组合。Identify the target word slot combination contained in the text information after clearing character spaces and preset special symbols.
  8. 一种基于人工智能的敏感词识别装置,其中,包括:An artificial intelligence-based sensitive word recognition device, which includes:
    获取模块,用于获取待识别的文本信息;The obtaining module is used to obtain the text information to be recognized;
    识别模块,用于识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;A recognition module, configured to recognize a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one preset word slot;
    判断模块,用于根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;A judging module, configured to judge whether the text information contains sensitive words according to the target word slot combination and the intermediate word information of the target word slot combination in the text information;
    处理模块,用于若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。The processing module is configured to perform restriction processing on the text information if it is determined that the text information contains sensitive words.
  9. 一种可读存储介质,其上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现以下方法:A readable storage medium having computer readable instructions stored thereon, wherein, when the computer readable instructions are executed by a processor, the following method is implemented:
    获取待识别的文本信息;Obtain the text information to be recognized;
    识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;Identifying a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one preset word slot;
    根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;Judging whether the text information contains sensitive words according to the target word slot combination and the intermediate word information of the target word slot combination in the text information;
    若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。If it is determined that the text information contains sensitive words, restrict processing is performed on the text information.
  10. 根据权利要求9所述的可读存储介质,其中,所述根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词时,具体实现:The readable storage medium according to claim 9, wherein said determining whether the text information contains sensitive information based on the target word slot combination and the intermediate word information of the target word slot combination in the text information When writing words, the concrete realization:
    根据各个所述目标词槽组合对应的至少一目标校验规则,获取目标校验规则组合;Obtaining a target verification rule combination according to at least one target verification rule corresponding to each of the target word slot combinations;
    根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准;According to the word slot arrangement information of the target word slot combination in the text information and the intermediate word information between word slots, it is determined whether the text information meets the multiple preset sensitivity in the target check rule combination. Word criterion;
    若所述文本信息符合所述目标检验规则组合中至少一组敏感词判定标准,则判定所述文本信息包含敏感词;If the text information meets at least one set of sensitive word determination criteria in the target inspection rule combination, it is determined that the text information contains sensitive words;
    若所述文本信息均不符合所述目标检验规则组合中各个敏感词判定标准,则判定所述文本信息不包含敏感词。If none of the text information meets the criteria for determining each sensitive word in the target inspection rule combination, it is determined that the text information does not contain sensitive words.
  11. 根据权利要求10所述的可读存储介质,其中,所述根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准时,具体实现:10. The readable storage medium according to claim 10, wherein the text is determined according to the word slot arrangement information and the intermediate word information between the word slots in the text information according to the target word slot combination. When the information meets the multiple preset sensitive word judgment criteria in the target inspection rule combination, the specific realization is as follows:
    若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量小于或等于预设数量阈值时,确定所述文本信息包含敏感词;If the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range meets the judgment standard, then it is determined that the word slot arrangement information meets all When the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is less than or equal to the preset number threshold, it is determined that the text information contains sensitive words;
    若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为不符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量大于或等于预置数量阈值时,确定所述文本信息包含敏感词。If the sensitive word determination criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range does not meet the criterion, then it is determined that the word slot arrangement information matches When the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is greater than or equal to the preset number threshold, it is determined that the text information contains sensitive words.
  12. 根据权利要求10所述的可读存储介质,其中,若所述目标检验规则组合中包含执行优先级不同的至少一预设的敏感词判定标准,则所述根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准时,具体实现:The readable storage medium according to claim 10, wherein if the target check rule combination includes at least one preset sensitive word determination criterion with different execution priorities, the combination according to the target word slot is When the word slot arrangement information in the text information and the intermediate word information between the word slots are used to determine whether the text information meets the multiple preset sensitive word determination criteria in the target inspection rule combination, the specific implementation is implemented:
    按照所述目标检验规则组合中各个敏感词判定标准的执行优先级从高到低的顺序,依次对所述文本信息进行判断;Judging the text information in turn according to the order of execution priority of each sensitive word judgment criterion in the target inspection rule combination from high to low;
    在依次判断的过程中,若确定存在所述文本信息符合的敏感词判定标准,则停止后续对所述文本信息进行判断,并将当前得到的判断结果作为利用所述目标检验规则组合对所述文本信息进行判断的结果。In the process of sequential judgment, if it is determined that there is a sensitive word judgment standard that the text information meets, then the subsequent judgment on the text information is stopped, and the currently obtained judgment result is used as the combination of the target inspection rules for the The result of the judgment on the text information.
  13. 根据权利要求10所述的可读存储介质,其中,在所述对所述文本信息进行限制处理之后,所述计算机可读指令被处理器执行时还用于实现:10. The readable storage medium according to claim 10, wherein, after the restriction processing is performed on the text information, the computer readable instruction is further used to implement when being executed by a processor:
    记录所述文本信息中包含所述目标词槽组合的文本部分作为样本数据;Recording the text part containing the target word slot combination in the text information as sample data;
    定期根据记录的各个样本数据进行分析,统计各个样本数据中出现频率大于预设频率阈值的,且与已有的词槽组合不同的字词组合;Regularly analyze the recorded sample data, and count the word combinations that appear more frequently than the preset frequency threshold in each sample data and are different from the existing word slot combinations;
    将统计得到的所述字词组合,与预设敏感词和/或预设敏感语句进行语义相似度计算;Calculating the semantic similarity between the word combination obtained by statistics and the preset sensitive word and/or the preset sensitive sentence;
    将语义相似度大于预设相似度阈值的目标字词组合,作为新的词槽组合,并根据包含所述新的词槽组合的样本数据,更新与所述新的词槽组合对应的校验规则;The target word combination with semantic similarity greater than the preset similarity threshold is used as a new word slot combination, and the check corresponding to the new word slot combination is updated according to the sample data containing the new word slot combination rule;
    利用所述新的词槽组合和与其对应的检验规则,判断其他文本信息是否包含敏感词。Using the new word slot combination and corresponding inspection rules to determine whether other text information contains sensitive words.
  14. 根据权利要求9所述的可读存储介质,其中,所述文本信息预先保存在区块链中;The readable storage medium according to claim 9, wherein the text information is stored in a blockchain in advance;
    所述获取待识别的文本信息时,具体实现:When the text information to be recognized is obtained, the specific implementation is as follows:
    从所述区块链中获取所述文本信息;Obtain the text information from the blockchain;
    所述识别所述文本信息中包含的目标词槽组合时,具体实现:When recognizing the target word slot combination contained in the text information, it is specifically realized:
    对所述文本信息中的字符空格及预设特殊符号进行清除;Clear character spaces and preset special symbols in the text information;
    识别清除字符空格及预设特殊符号后的所述文本信息中包含的目标词槽组合。Identify the target word slot combination contained in the text information after clearing character spaces and preset special symbols.
  15. 一种计算机设备,包括可读存储介质、处理器及存储在可读存储介质上并可在处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现以下方法:A computer device, including a readable storage medium, a processor, and computer readable instructions stored on the readable storage medium and running on the processor, wherein the processor executes the computer readable instructions to achieve the following method:
    获取待识别的文本信息;Obtain the text information to be recognized;
    识别所述文本信息中包含的目标词槽组合,其中,所述目标词槽组合由至少一预设词槽组成;Identifying a target word slot combination contained in the text information, wherein the target word slot combination is composed of at least one preset word slot;
    根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词;Judging whether the text information contains sensitive words according to the target word slot combination and the intermediate word information of the target word slot combination in the text information;
    若判定所述文本信息包含敏感词,则对所述文本信息进行限制处理。If it is determined that the text information contains sensitive words, restrict processing is performed on the text information.
  16. 根据权利要求15所述的计算机设备,其中,所述根据所述目标词槽组合和所述目标词槽组合在所述文本信息中的中间字词信息,判断所述文本信息是否包含敏感词时,具体实现:15. The computer device according to claim 15, wherein when determining whether the text information contains sensitive words based on the target word slot combination and the intermediate word information of the target word slot combination in the text information ,Implementation:
    根据各个所述目标词槽组合对应的至少一目标校验规则,获取目标校验规则组合;Obtaining a target verification rule combination according to at least one target verification rule corresponding to each of the target word slot combinations;
    根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准;According to the word slot arrangement information of the target word slot combination in the text information and the intermediate word information between word slots, it is determined whether the text information meets the multiple preset sensitivity in the target check rule combination. Word criterion;
    若所述文本信息符合所述目标检验规则组合中至少一组敏感词判定标准,则判定所述文本信息包含敏感词;If the text information meets at least one set of sensitive word determination criteria in the target inspection rule combination, it is determined that the text information contains sensitive words;
    若所述文本信息均不符合所述目标检验规则组合中各个敏感词判定标准,则判定所述文本信息不包含敏感词。If none of the text information meets the criteria for determining each sensitive word in the target inspection rule combination, it is determined that the text information does not contain sensitive words.
  17. 根据权利要求16所述的计算机设备,其中,所述根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准时,具体实现:16. The computer device according to claim 16, wherein the word slot arrangement information combined in the text information according to the target word slot and the intermediate word information between the word slots determine whether the text information is When it meets the multiple preset criteria for determining sensitive words in the target inspection rule combination, the specific realization is as follows:
    若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量小于或等于预设数量阈值时,确定所述文本信息包含敏感词;If the sensitive word judgment criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range meets the judgment standard, then it is determined that the word slot arrangement information meets all When the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is less than or equal to the preset number threshold, it is determined that the text information contains sensitive words;
    若敏感词判定标准为所述目标词槽组合中的各个词槽均在文本信息中出现,且中间字词数量在限定范围内为不符合判定的标准,则在判定所述词槽排列信息符合所述目标词槽组合对应的预设词槽顺序,且所述中间字词的数量大于或等于预置数量阈值时,确定所述文本信息包含敏感词。If the sensitive word determination criterion is that each word slot in the target word slot combination appears in the text information, and the number of intermediate words within a limited range does not meet the criterion, then it is determined that the word slot arrangement information matches When the preset word slot sequence corresponding to the target word slot combination, and the number of intermediate words is greater than or equal to the preset number threshold, it is determined that the text information contains sensitive words.
  18. 根据权利要求16所述的计算机设备,其中,若所述目标检验规则组合中包含执行优先级不同的至少一预设的敏感词判定标准,则所述根据所述目标词槽组合在所述文本信息中的词槽排列信息和词槽之间的中间字词信息,分别判断所述文本信息是否符合所述目标检验规则组合中多个预设的敏感词判定标准时,具体实现:16. The computer device according to claim 16, wherein if the target check rule combination includes at least one preset sensitive word determination criterion with different execution priorities, then the combination according to the target word slot is used in the text When the word slot arrangement information and the intermediate word information between the word slots in the information respectively determine whether the text information meets the multiple preset sensitive word determination criteria in the target inspection rule combination, the specific realization is achieved:
    按照所述目标检验规则组合中各个敏感词判定标准的执行优先级从高到低的顺序,依次对所述文本信息进行判断;Judging the text information in turn according to the order of execution priority of each sensitive word judgment criterion in the target inspection rule combination from high to low;
    在依次判断的过程中,若确定存在所述文本信息符合的敏感词判定标准,则停止后续对所述文本信息进行判断,并将当前得到的判断结果作为利用所述目标检验规则组合对所述文本信息进行判断的结果。In the process of sequential judgment, if it is determined that there is a sensitive word judgment standard that the text information meets, then the subsequent judgment on the text information is stopped, and the currently obtained judgment result is used as the combination of the target inspection rules for the The result of the judgment on the text information.
  19. 根据权利要求16所述的计算机设备,其中,在所述对所述文本信息进行限制处理之后,所述处理器执行所述计算机可读指令时还用于实现:The computer device according to claim 16, wherein, after the restriction processing is performed on the text information, when the processor executes the computer readable instruction, it is further configured to implement:
    记录所述文本信息中包含所述目标词槽组合的文本部分作为样本数据;Recording the text part containing the target word slot combination in the text information as sample data;
    定期根据记录的各个样本数据进行分析,统计各个样本数据中出现频率大于预设频率阈值的,且与已有的词槽组合不同的字词组合;Regularly analyze the recorded sample data, and count the word combinations that appear more frequently than the preset frequency threshold in each sample data and are different from the existing word slot combinations;
    将统计得到的所述字词组合,与预设敏感词和/或预设敏感语句进行语义相似度计算;Calculating the semantic similarity between the word combination obtained by statistics and the preset sensitive word and/or the preset sensitive sentence;
    将语义相似度大于预设相似度阈值的目标字词组合,作为新的词槽组合,并根据包含所述新的词槽组合的样本数据,更新与所述新的词槽组合对应的校验规则;The target word combination with semantic similarity greater than the preset similarity threshold is used as a new word slot combination, and the check corresponding to the new word slot combination is updated according to the sample data containing the new word slot combination rule;
    利用所述新的词槽组合和与其对应的检验规则,判断其他文本信息是否包含敏感词。Using the new word slot combination and corresponding inspection rules to determine whether other text information contains sensitive words.
  20. 根据权利要求15所述的计算机设备,其中,所述文本信息预先保存在区块链中;The computer device according to claim 15, wherein the text information is stored in a blockchain in advance;
    所述获取待识别的文本信息时,具体实现:When the text information to be recognized is obtained, the specific implementation is as follows:
    从所述区块链中获取所述文本信息;Obtain the text information from the blockchain;
    所述识别所述文本信息中包含的目标词槽组合时,具体实现:When recognizing the target word slot combination contained in the text information, it is specifically realized:
    对所述文本信息中的字符空格及预设特殊符号进行清除;Clear character spaces and preset special symbols in the text information;
    识别清除字符空格及预设特殊符号后的所述文本信息中包含的目标词槽组合。Identify the target word slot combination contained in the text information after clearing character spaces and preset special symbols.
PCT/CN2020/124684 2020-09-07 2020-10-29 Sensitive word recognition method and apparatus based on artificial intelligence, and computer device WO2021151333A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010927419.7 2020-09-07
CN202010927419.7A CN112016317A (en) 2020-09-07 2020-09-07 Sensitive word recognition method and device based on artificial intelligence and computer equipment

Publications (1)

Publication Number Publication Date
WO2021151333A1 true WO2021151333A1 (en) 2021-08-05

Family

ID=73515434

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124684 WO2021151333A1 (en) 2020-09-07 2020-10-29 Sensitive word recognition method and apparatus based on artificial intelligence, and computer device

Country Status (2)

Country Link
CN (1) CN112016317A (en)
WO (1) WO2021151333A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705211B (en) * 2021-10-29 2022-01-18 云账户技术(天津)有限公司 Method and device for automatically generating license character number and readable storage medium
CN114357511A (en) * 2021-12-30 2022-04-15 北京鼎普科技股份有限公司 Method and device for marking key content of document and user terminal
CN117436437A (en) * 2022-07-11 2024-01-23 华为云计算技术有限公司 Method, device, equipment and cluster for detecting combination sensitive words

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992471A (en) * 2017-11-10 2018-05-04 北京光年无限科技有限公司 Information filtering method and device in a kind of interactive process
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text
CN110096585A (en) * 2019-03-26 2019-08-06 珠海鹏游网络科技有限公司 A kind of intelligence filtering sensitive words system
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN111339760A (en) * 2018-12-18 2020-06-26 北京京东尚科信息技术有限公司 Method and device for training lexical analysis model, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992471A (en) * 2017-11-10 2018-05-04 北京光年无限科技有限公司 Information filtering method and device in a kind of interactive process
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text
CN111339760A (en) * 2018-12-18 2020-06-26 北京京东尚科信息技术有限公司 Method and device for training lexical analysis model, electronic equipment and storage medium
CN110096585A (en) * 2019-03-26 2019-08-06 珠海鹏游网络科技有限公司 A kind of intelligence filtering sensitive words system

Also Published As

Publication number Publication date
CN112016317A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2021151333A1 (en) Sensitive word recognition method and apparatus based on artificial intelligence, and computer device
CN110855676B (en) Network attack processing method and device and storage medium
JP5460887B2 (en) Classification rule generation device and classification rule generation program
US9418057B2 (en) Fraud detection using text analysis
US11100148B2 (en) Sentiment normalization based on current authors personality insight data points
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
US10387467B2 (en) Time-based sentiment normalization based on authors personality insight data points
CN111552798B (en) Name information processing method and device based on name prediction model and electronic equipment
US20230280974A1 (en) Rendering visual components on applications in response to voice commands
CN113076735A (en) Target information acquisition method and device and server
US20180150747A1 (en) Enhancing Time-to-Answer for Community Questions in Online Discussion Sites
US20180150748A1 (en) Enhanced Ingestion of Question-Answer Pairs into Question Answering Systems by Preprocessing Online Discussion Sites
CN111144546A (en) Scoring method and device, electronic equipment and storage medium
CN109672586A (en) A kind of DPI service traffics recognition methods, device and computer readable storage medium
CN111027065B (en) Leucavirus identification method and device, electronic equipment and storage medium
WO2024055603A1 (en) Method and apparatus for identifying text from minor
US20200167475A1 (en) Self-Evolved Adjustment Framework for Cloud-Based Large System Based on Machine Learning
CN115470489A (en) Detection model training method, detection method, device and computer readable medium
CN115314268A (en) Malicious encrypted traffic detection method and system based on traffic fingerprints and behaviors
CN111552890B (en) Name information processing method and device based on name prediction model and electronic equipment
CN114547059A (en) Platform data updating method and device and computer equipment
TWI477996B (en) Method of analyzing personalized input automatically
KR100753779B1 (en) Method for executing initial sound letter search of mixed form and system for executing the method
Yamak et al. Automatic detection of multiple account deception in social media
Coray Óðinn: A Framework for Large-Scale Wordlist Analysis and Struc-ture-Based Password Guessing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916505

Country of ref document: EP

Kind code of ref document: A1