WO2019153605A1 - 文本中敏感信息的鉴定方法、电子装置及可读存储介质 - Google Patents

文本中敏感信息的鉴定方法、电子装置及可读存储介质 Download PDF

Info

Publication number
WO2019153605A1
WO2019153605A1 PCT/CN2018/089717 CN2018089717W WO2019153605A1 WO 2019153605 A1 WO2019153605 A1 WO 2019153605A1 CN 2018089717 W CN2018089717 W CN 2018089717W WO 2019153605 A1 WO2019153605 A1 WO 2019153605A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
preset
sensitive
paragraph
identified
Prior art date
Application number
PCT/CN2018/089717
Other languages
English (en)
French (fr)
Inventor
赵骏
郑佳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019153605A1 publication Critical patent/WO2019153605A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method for authenticating sensitive information in text, an electronic device, and a readable storage medium.
  • the purpose of the present application is to provide a method for authenticating sensitive information in text, an electronic device and a readable storage medium, which are intended to automatically and effectively identify text containing sensitive information.
  • a first aspect of the present application provides an electronic device including a memory, a processor, and an identification system for storing sensitive information in a text executable on the processor.
  • the authentication system for sensitive information in the text is implemented by the processor to implement the following steps:
  • the text to be identified is divided into individual paragraphs by using a preset paragraph analysis rule
  • the preset rule is used to determine whether the to-be-identified text contains sensitive information.
  • the second aspect of the present application further provides a method for authenticating sensitive information in a text, where the method for identifying sensitive information includes:
  • the text to be identified is divided into individual paragraphs by using a preset paragraph analysis rule
  • the preset rule is used to determine whether the to-be-identified text contains sensitive information.
  • a third aspect of the present application further provides a computer readable storage medium, where the computer readable storage medium stores an authentication system for sensitive information in a text, where the authentication system for sensitive information in the text may Executing by at least one processor to cause the at least one processor to perform the steps of the method of identifying sensitive information in the text above.
  • the method, system and readable storage medium for identifying sensitive information in the text proposed by the present application after segmentation, clause and word segmentation processing of the text to be authenticated, each part of the obtained word segment and each of the pre-established sensitive lexicons Sensitive words are matched, and the participle words in the text to be identified that match the sensitive words in the pre-established sensitive lexicon are obtained; and the corresponding preset paragraph weights and the corresponding participles corresponding to the paragraphs in which the matched participles are located are correspondingly
  • the preset sensitive word matches the weight, and uses a preset rule to determine whether the to-be-identified text contains sensitive information.
  • the present application can match each participle in the text to be identified with each sensitive word in the pre-established sensitive vocabulary, according to the matching.
  • the preset sensitive word corresponding to the situation is matched with the weight, and the corresponding paragraph weight is set according to the position of the matched word segment in the to-be-identified text, that is, the preset sensitive word matching weight and the preset paragraph weight are combined.
  • Comprehensive identification can more accurately and effectively determine whether the text to be identified contains sensitive information.
  • the identification of sensitive information in the text can be automatically performed, and the detection efficiency is effectively improved.
  • FIG. 1 is a schematic diagram of an operating environment of a preferred embodiment of the authentication system 10 for sensitive information in the text of the present application;
  • FIG. 2 is a schematic flow chart of an embodiment of a method for authenticating sensitive information in the text of the present application.
  • FIG. 1 is a schematic diagram of an operating environment of a preferred embodiment of the authentication system 10 for sensitive information in the text of the present application.
  • the authentication system 10 for sensitive information in the text is installed and operated in the electronic device 1.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • Figure 1 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the memory 11 is at least one type of readable computer storage medium, which in some embodiments may be an internal storage unit of the electronic device 1, such as a hard disk or memory of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, flash card, etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 is configured to store application software installed on the electronic device 1 and various types of data, such as program codes of the authentication system 10 for sensitive information in the text.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 may be a central processing unit (CPU), a microprocessor or other data processing chip for running program code or processing data stored in the memory 11, for example An authentication system 10 or the like that performs sensitive information in the text.
  • CPU central processing unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, for example An authentication system 10 or the like that performs sensitive information in the text.
  • the display 13 in some embodiments may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
  • the display 13 is configured to display information processed in the electronic device 1 and a user interface for displaying visualization, for example, a segmentation result of the text to be authenticated, a word segmentation result, and a sensitive word in the sensitive vocabulary in the text to be identified
  • the matching participle (mark) whether the text to be identified contains the final identification result of sensitive information, and so on.
  • the components 11-13 of the electronic device 1 communicate with one another via a system bus.
  • the authentication system 10 of sensitive information in the text includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement various embodiments of the present application.
  • the authentication system 10 for sensitive information in the above text is implemented by the processor 12 to implement the following steps:
  • Step S1 After receiving the text to be identified, the text to be identified is divided into individual paragraphs by using a preset paragraph analysis rule.
  • step S2 the individual paragraphs are divided into clauses, and the divided sentences are processed by word segmentation.
  • the authentication system for sensitive information in the text receives a sensitive information authentication request sent by the user, including, for example, receiving a sensitive information authentication request sent by the user through a mobile phone, a tablet computer, a self-service terminal device, or the like, such as receiving A sensitive information authentication request sent by a user on a client pre-installed in a terminal such as a mobile phone, a tablet computer, or a self-service terminal device, or received by a user on a browser system in a terminal such as a mobile phone, a tablet computer, or a self-service terminal device. Sensitive information authentication request.
  • the authentication system of the sensitive information in the text After receiving the sensitive information authentication request sent by the user, the authentication system of the sensitive information in the text first performs a series of processing on the text to be authenticated in the sensitive information authentication request, so as to accurately perform the sensitive information in the text to be identified subsequently.
  • Judge. For example, the following processing can be performed:
  • Pre-processing such as text noise removal for the text to be authenticated, such as the case where there are distortion words or variant words in the text to be identified, first correcting the processing; removing the garbled characters and multiple special characters of the same type in the text to be identified It is also possible to perform traditional Chinese characters such as simplified Chinese characters.
  • the paragraph analysis of the identification text can be performed by using the preset paragraph analysis rule, and the text to be identified is divided into separate paragraphs.
  • the text to be authenticated is directly divided into different paragraphs by a line break; for a line break without a line break, but the TAB symbol is present after the period, the subsequent text can be treated as a new paragraph.
  • the weight X1 for each paragraph.
  • the first paragraph and the last paragraph of the text to be identified can be regarded as the core paragraph, and correspondingly, the weights set for the first paragraph and the last paragraph are higher than the weights of the other paragraphs. For example, set a higher weight of 90% for the first paragraph and the last paragraph, and a weight of 70% for the middle paragraph.
  • paragraphs divided in the text to be identified are segmented, for example, each paragraph is divided into several sentences by punctuation, and the weight X2 of each sentence is set.
  • paragraph core statement analysis can be set, such as a higher weight of 90% for sentences at the beginning of the paragraph and 70% for intermediate sentences.
  • Each clause in the text to be authenticated continues to be processed.
  • Each sentence is segmented so that subsequent keyword matching operations can be performed with each sensitive word in the sensitive lexicon.
  • a N-gram model a Hidden Markov Model (HMM), and a Maximum Entropy Model may be used for word segmentation, and the word segmentation algorithm may include: Forward maximum match, reverse maximum match, two-way maximum match, shortest path algorithm.
  • HMM Hidden Markov Model
  • T is composed of the word sequences W1, W2, W3, ...
  • the binary Bi-Gram word segmentation method is used for word segmentation, and the use of the bigram strategy, that is, the appearance of a word depends only on the two words appearing in front of it, and the formula is as follows:
  • step S3 each word segment is matched with each sensitive word in the pre-established sensitive vocabulary to obtain a word segment that matches the sensitive word in the pre-established sensitive lexicon.
  • a sensitive vocabulary that is, a sensitive keyword message library
  • a sensitive keyword message library may be established according to different sensitive types, including: establishing a message library containing reactionary, anti-human and other harmful social activities; Contains sensitive message bases such as religion, politics, and events; establishes a message library containing advertisements, scams, and other spam messages; and builds a message library that is completely unrelated to financial activities, including yellow gambling.
  • the sensitive keyword message library includes direct forbidden words, that is, forbidden words that need to be directly blocked.
  • the sensitive keywords in the sensitive keyword message database can be divided into different parts of speech such as general nouns, auxiliary verbs, auxiliary negative words, auxiliary derogatory words, and so on. Further, sensitive keywords can also be rated. Sensitive keywords can be graded according to the influence of sensitive words, or the frequency of occurrence, or the definition of national publication, company regulations, system customization. If sensitive keywords can be divided into three levels, the first-level sensitive keywords are the most serious. For example, if you directly express the reactionary and sensitive information that harms people's safety, you need to directly filter it; the second-level sensitive keywords are serious.
  • the information contained is sensitive, but it does not mean that it will bring direct harm. For this kind of information, it can give early warning and mark information for administrators to review and operate.
  • the three sensitive keywords are often related to sensitive information.
  • Special vocabulary such as special words about politics, military, etc., are mostly nouns. The meanings they express often need to be judged according to the context. If they appear in a piece of text, it does not mean that it will be harmful. Sexually sensitive information, therefore, for such information, it is also necessary to mark it so that the subsequent expression of the text according to the context will have a negative impact.
  • each participle of the text to be identified may be matched with the sensitive words in the established sensitive keyword message library, and distributed according to the matching result.
  • the corresponding matching weight is X3. Specifically, the following situations may be included:
  • the direct forbidden word is hit, that is, the participle of the text to be authenticated directly contains the direct forbidden word in the sensitive keyword message library, and the matching weight X3 is given as 100%.
  • the text to be identified may be directly determined as a bad information text according to the result of the direct forbidden word, and the direct forbidden word in the identification text shall be marked.
  • auxiliary verbs and auxiliary ambiguous/negative word hit weights are auxiliary verbs and auxiliary ambiguous/negative word hit weights. That is to say, in the present embodiment, different sensitive words are divided according to part of speech, and when a participle sensitive word is hit, the other sensitive words of the part of speech are further judged to more accurately identify the bad information. For example, if the participle of the text to be authenticated hits the general noun “government” in the sensitive keyword message library, it is also necessary to determine the context of the “government” or “people” in the text to be identified (such as the previous sentence, the same sentence or the latter). In one sentence, whether there is a related negative word situation at the same time, such as "down", "down”, etc., so that the identification of bad information can be more accurately performed according to the contextual meaning in the text.
  • Step S4 Determine, according to the preset paragraph weight corresponding to the paragraph in which the matched word segment is located, and the preset sensitive word matching weight corresponding to the matched participle, determine whether the to-be-identified text contains sensitive information by using a preset rule.
  • the P value when determining whether the text to be identified contains sensitive information by using a preset rule, the P value may be calculated according to the following formula:
  • X1 is a preset paragraph weight corresponding to a paragraph in which the matching participle in the text to be identified is located
  • X2 is a preset statement weight corresponding to a sentence in which the matched participle in the text to be identified is located
  • X3 is the The preset sensitive word matching weight corresponding to the matched word segment in the text to be authenticated
  • An early warning threshold is set in advance, and the calculated P value is compared with a preset early warning threshold. If the P value is greater than a preset early warning threshold, the text to be identified is determined to contain sensitive information, and an early warning is performed.
  • the present embodiment performs segmentation, segmentation, and word segmentation processing on the text to be authenticated, and then matches each of the divided word segments with each sensitive word in the pre-established sensitive vocabulary to obtain the Identifying the participles in the text that match the sensitive words in the pre-established sensitive lexicon; and using the corresponding preset paragraph weights set according to the paragraphs in which the matched participles are located, and the matching sensitive words matching weights corresponding to the matched participles, The preset rule determines whether the to-be-identified text contains sensitive information.
  • each participle in the text to be identified can be compared with each sensitive word in the pre-established sensitive vocabulary, because the probability of occurrence of the different sensitive information is different in different positions, such as different paragraphs or sentences.
  • Matching assigning a corresponding preset sensitive word matching weight according to the matching situation, and setting a corresponding preset paragraph weight and a corresponding statement weight corresponding to the sentence according to the position of the matched word segment in the to-be-identified text, Combining the preset sensitive word matching weights with the preset paragraph weights and the preset sentence weights for comprehensive identification can more accurately and effectively determine whether the text to be identified contains sensitive information.
  • the identification of sensitive information in the text can be automatically performed, and the detection efficiency is effectively improved.
  • the method further includes:
  • the system custom keyword library can also be used to filter sensitive lexicon related to different business characteristics. That is, for different business systems, when performing keyword matching, not only the individual word segments of the text to be identified can be matched with the sensitive keywords in the established sensitive keyword message library, but also the various word segments of the text to be identified and the system itself. The matching is defined for sensitive keywords in sensitive lexicons related to different business characteristics. Subsequent warnings can be made for reaching the public store early warning threshold. For those that do not reach the common library threshold but reach the system custom library threshold, early warning can be performed, which is more flexible and practical.
  • the method when the authentication system 10 of the sensitive information in the text is executed by the processor 12, the method further includes:
  • the sensitive keyword message library file can also be exported to the specified path through the database.
  • the system periodically updates the sensitive keyword message database data in the specified path, and can update the latest sensitive keyword message into the sensitive keyword message library in time.
  • FIG. 2 is a schematic flowchart of an embodiment of a method for authenticating sensitive information in the text of the present application.
  • the method for identifying sensitive information includes the following steps:
  • Step S10 After receiving the text to be identified, the text to be identified is divided into individual paragraphs by using a preset paragraph analysis rule.
  • step S20 the individual paragraphs are divided into clauses, and the divided sentences are processed by word segmentation.
  • the authentication system for sensitive information in the text receives a sensitive information authentication request sent by the user, including, for example, receiving a sensitive information authentication request sent by the user through a mobile phone, a tablet computer, a self-service terminal device, or the like, such as receiving A sensitive information authentication request sent by a user on a client pre-installed in a terminal such as a mobile phone, a tablet computer, or a self-service terminal device, or received by a user on a browser system in a terminal such as a mobile phone, a tablet computer, or a self-service terminal device. Sensitive information authentication request.
  • the authentication system of the sensitive information in the text After receiving the sensitive information authentication request sent by the user, the authentication system of the sensitive information in the text first performs a series of processing on the text to be authenticated in the sensitive information authentication request, so as to accurately perform the sensitive information in the text to be identified subsequently.
  • Judge. For example, the following processing can be performed:
  • Pre-processing such as text noise removal for the text to be authenticated, such as the case where there are distortion words or variant words in the text to be identified, first correcting the processing; removing the garbled characters and multiple special characters of the same type in the text to be identified It is also possible to perform traditional Chinese characters such as simplified Chinese characters.
  • the paragraph analysis of the identification text can be performed by using the preset paragraph analysis rule, and the text to be identified is divided into separate paragraphs.
  • the text to be authenticated is directly divided into different paragraphs by a line break; for a line break without a line break, but the TAB symbol is present after the period, the subsequent text can be treated as a new paragraph.
  • the weight X1 for each paragraph.
  • the first paragraph and the last paragraph of the text to be identified can be regarded as the core paragraph, and correspondingly, the weights set for the first paragraph and the last paragraph are higher than the weights of the other paragraphs. For example, set a higher weight of 90% for the first paragraph and the last paragraph, and a weight of 70% for the middle paragraph.
  • paragraphs divided in the text to be identified are segmented, for example, each paragraph is divided into several sentences by punctuation, and the weight X2 of each sentence is set.
  • paragraph core statement analysis can be set, such as a higher weight of 90% for sentences at the beginning of the paragraph and 70% for intermediate sentences.
  • Each clause in the text to be authenticated continues to be processed.
  • Each sentence is segmented so as to perform keyword matching operations with each sensitive word in the sensitive lexicon.
  • a N-gram model a Hidden Markov Model (HMM), and a Maximum Entropy Model may be used for word segmentation, and the word segmentation algorithm may include: Forward maximum match, reverse maximum match, two-way maximum match, shortest path algorithm.
  • HMM Hidden Markov Model
  • shortest path algorithm For example, in the N-gram model, assuming that T is composed of the word sequences W1, W2, W3, ...
  • the binary Bi-Gram word segmentation method is used for word segmentation, and the use of the bigram strategy, that is, the appearance of a word depends only on the two words appearing in front of it, and the formula is as follows:
  • Step S30 matching each word segment with each sensitive word in the pre-established sensitive vocabulary to obtain a word segment matching the sensitive word in the pre-established sensitive vocabulary.
  • a sensitive vocabulary that is, a sensitive keyword message library
  • a sensitive keyword message library may be established according to different sensitive types, including: establishing a message library containing reactionary, anti-human and other harmful social activities; Contains sensitive message bases such as religion, politics, and events; establishes a message library containing advertisements, scams, and other spam messages; and builds a message library that is completely unrelated to financial activities, including yellow gambling.
  • the sensitive keyword message library includes direct forbidden words, that is, forbidden words that need to be directly blocked.
  • the sensitive keywords in the sensitive keyword message database can be divided into different parts of speech such as general nouns, auxiliary verbs, auxiliary negative words, auxiliary derogatory words, and so on. Further, sensitive keywords can also be rated. Sensitive keywords can be graded according to the influence of sensitive words, or the frequency of occurrence, or the definition of national publication, company regulations, system customization. If sensitive keywords can be divided into three levels, the first-level sensitive keywords are the most serious. For example, if you directly express the reactionary and sensitive information that harms people's safety, you need to directly filter it; the second-level sensitive keywords are serious.
  • the information contained is sensitive, but it does not mean that it will bring direct harm. For this kind of information, it can give early warning and mark information for administrators to review and operate.
  • the three sensitive keywords are often related to sensitive information.
  • Special vocabulary such as special words about politics, military, etc., are mostly nouns. The meanings they express often need to be judged according to the context. If they appear in a piece of text, it does not mean that it will be harmful. Sexually sensitive information, therefore, for such information, it is also necessary to mark it so that the subsequent expression of the text according to the context will have a negative impact.
  • each participle of the text to be identified may be matched with the sensitive words in the established sensitive keyword message library, and distributed according to the matching result.
  • the corresponding matching weight is X3. Specifically, the following situations may be included:
  • the direct forbidden word is hit, that is, the participle of the text to be authenticated directly contains the direct forbidden word in the sensitive keyword message library, and the matching weight X3 is given as 100%.
  • the text to be identified may be directly determined as a bad information text according to the result of the direct forbidden word, and the direct forbidden word in the identification text shall be marked.
  • auxiliary verbs and auxiliary ambiguous/negative word hit weights are auxiliary verbs and auxiliary ambiguous/negative word hit weights. That is to say, in the present embodiment, different sensitive words are divided according to part of speech, and when a participle sensitive word is hit, the other sensitive words of the part of speech are further judged to more accurately identify the bad information. For example, if the participle of the text to be authenticated hits the general noun “government” in the sensitive keyword message library, it is also necessary to determine the context of the “government” or “people” in the text to be identified (such as the previous sentence, the same sentence or the latter). In one sentence, whether there is a related negative word situation at the same time, such as "down", "down”, etc., so that the identification of bad information can be more accurately performed according to the contextual meaning in the text.
  • Step S40 Determine, according to the preset paragraph weight corresponding to the paragraph in which the matched word segment is located, and the preset sensitive word matching weight corresponding to the matched word segment, and use the preset rule to determine whether the to-be-identified text contains sensitive information.
  • the P value when determining whether the text to be identified contains sensitive information by using a preset rule, the P value may be calculated according to the following formula:
  • X1 is a preset paragraph weight corresponding to a paragraph in which the matching participle in the text to be identified is located
  • X2 is a preset statement weight corresponding to a sentence in which the matched participle in the text to be identified is located
  • X3 is the The preset sensitive word matching weight corresponding to the matched word segment in the text to be authenticated
  • An early warning threshold is set in advance, and the calculated P value is compared with a preset early warning threshold. If the P value is greater than a preset early warning threshold, the text to be identified is determined to contain sensitive information, and an early warning is performed.
  • the present embodiment performs segmentation, segmentation, and word segmentation processing on the text to be authenticated, and then matches each of the divided word segments with each sensitive word in the pre-established sensitive vocabulary to obtain the Identifying the participles in the text that match the sensitive words in the pre-established sensitive lexicon; and using the corresponding preset paragraph weights set according to the paragraphs in which the matched participles are located, and the matching sensitive words matching weights corresponding to the matched participles, The preset rule determines whether the to-be-identified text contains sensitive information.
  • each participle in the text to be identified can be compared with each sensitive word in the pre-established sensitive vocabulary, because the probability of occurrence of the different sensitive information is different in different positions, such as different paragraphs or sentences.
  • Matching assigning a corresponding preset sensitive word matching weight according to the matching situation, and setting a corresponding preset paragraph weight and a corresponding statement weight corresponding to the sentence according to the position of the matched word segment in the to-be-identified text, Combining the preset sensitive word matching weights with the preset paragraph weights and the preset sentence weights for comprehensive identification can more accurately and effectively determine whether the text to be identified contains sensitive information.
  • the identification of sensitive information in the text can be automatically performed, and the detection efficiency is effectively improved.
  • the method further includes:
  • the system custom keyword library can also be used to filter sensitive lexicon related to different business characteristics. That is, for different business systems, when performing keyword matching, not only the individual word segments of the text to be identified can be matched with the sensitive keywords in the established sensitive keyword message library, but also the various word segments of the text to be identified and the system itself. The matching is defined for sensitive keywords in sensitive lexicons related to different business characteristics. Subsequent warnings can be made for reaching the public store early warning threshold. For those that do not reach the common library threshold but reach the system custom library threshold, early warning can be performed, which is more flexible and practical.
  • the method further includes:
  • the sensitive keyword message library file can also be exported to the specified path through the database.
  • the system periodically updates the sensitive keyword message library data in the specified path, and can update the latest sensitive keyword message into the sensitive keyword message library in time.
  • the present application also provides a computer readable storage medium storing an authentication system for sensitive information in text, the authentication system of sensitive information in the text being executable by at least one processor such that The at least one processor performs the steps of the method for identifying the sensitive information in the text in the foregoing embodiment, and the specific implementation processes of the steps S10, S20, and S30 of the method for identifying the sensitive information in the text are as described above, and are not Let me repeat.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及一种文本中敏感信息的鉴定方法、电子装置及可读存储介质,该方法包括:在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落;对各个独立段落进行分句,并对分得的各个语句进行分词处理;将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词;根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。本申请能准确有效地判断出所述待鉴定文本中是否包含敏感信息。而且,无需人工进行检测,能自动进行文本中敏感信息的鉴定,有效提高检测效率。

Description

文本中敏感信息的鉴定方法、电子装置及可读存储介质
优先权申明
本申请基于巴黎公约申明享有2018年2月6日递交的申请号为CN 201810114518.6、名称为“文本中敏感信息的鉴定方法、电子装置及可读存储介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种文本中敏感信息的鉴定方法、电子装置及可读存储介质。
背景技术
目前,对于大型互联网金融企业,在各种业务流程中会涉及到大量文本,而文本中有可能会夹杂各种敏感信息(如各种涉及色情、政治敏感、暴力、恐怖等的不良信息),必须有效鉴定并剔除。传统的文本敏感信息的鉴定方式是由人工对文本进行逐一审核以筛选出包含敏感信息的文本,这种人工检测成本高,且比较耗时,效率较低。
发明内容
本申请的目的在于提供一种文本中敏感信息的鉴定方法、电子装置及可读存储介质,旨在自动有效地识别出包含敏感信息的文本。
为实现上述目的,本申请第一方面提供一种电子装置,所述电子装置包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的文本中敏感信息的鉴定系统,所述文本中敏感信息的鉴定系统被所述处理器执行时实现如下步骤:
在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落;
对各个独立段落进行分句,并对分得的各个语句进行分词处理;
将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词;
根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。
此外,为实现上述目的,本申请第二方面还提供一种文本中敏感信息的鉴定方法,所述文本中敏感信息的鉴定方法包括:
在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落;
对各个独立段落进行分句,并对分得的各个语句进行分词处理;
将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词;
根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。
进一步地,为实现上述目的,本申请第三方面还提供一种计算机可读存储介质,所述计算机可读存储介质存储有文本中敏感信息的鉴定系统,所述文本中敏感信息的鉴定系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述的文本中敏感信息的鉴定方法的步骤。
本申请提出的文本中敏感信息的鉴定方法、系统及可读存储介质,通过对待鉴定文本进行分段、分句及分词处理后,将分得的各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到所述待鉴定文本中与预先建立的敏感词库中敏感词相匹配的分词;并根据相匹配的分词所在的段落设置的对应预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。由于一般敏感信息在文本中出现时在不同位置如不同段落的出现概率不同,本申请能将所述待鉴定文本中的各个分词与预先建立的敏感词库中的各个敏感词进行匹配,根据匹配情况分配对应的预设敏感词匹配权重,并根据相匹配的分词在所述待鉴定文本中的位置即所在的段落设置对应预设段落权重,结合预设敏感词匹配权重以及预设段落权重来进行综合鉴定,能更加准确有效地判断出所述待鉴定文本中是否包含敏感信息。而且,无需人工进行检测,能自动进行文本中敏感信息的鉴定,有效提高检测效率。
附图说明
图1为本申请文本中敏感信息的鉴定系统10较佳实施例的运行环境示意图;
图2为本申请文本中敏感信息的鉴定方法一实施例的流程示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所 获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
本申请提供一种文本中敏感信息的鉴定系统。请参阅图1,是本申请文本中敏感信息的鉴定系统10较佳实施例的运行环境示意图。
在本实施例中,所述的文本中敏感信息的鉴定系统10安装并运行于电子装置1中。该电子装置1可包括,但不仅限于,存储器11、处理器12及显示器13。图1仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
所述存储器11为至少一种类型的可读计算机存储介质,所述存储器11在一些实施例中可以是所述电子装置1的内部存储单元,例如该电子装置1的硬盘或内存。所述存储器11在另一些实施例中也可以是所述电子装置1的外部存储设备,例如所述电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括所述电子装置1的内部存储单元也包括外部存储设备。所述存储器11用于存储安装于所述电子装置1的应用软件及各类数据,例如所述文本中敏感信息的鉴定系统10的程序代码等。所述存储器11还可以用于暂时地存储已经输出或者将要输出的数据。
所述处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行所述存储器11中存储的程序代码或处理数据,例如执行所述文本中敏感信息的鉴定系统10等。
所述显示器13在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。所述显示器13用于显示在所述电子装置1中处理的信息以及用于显示可视化的用户界面,例如待鉴定文本的分段结果、分词结果、待鉴定文本中与敏感词库中敏感词相匹配的分词(标记)、待鉴定文本中是否包含敏感信息的最终鉴定结果等。所述电子装置1的部件11-13通过系统总线相互通信。
文本中敏感信息的鉴定系统10包括至少一个存储在所述存储器 11中的计算机可读指令,该至少一个计算机可读指令可被所述处理器12执行,以实现本申请各实施例。
其中,上述文本中敏感信息的鉴定系统10被所述处理器12执行时实现如下步骤:
步骤S1,在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落。
步骤S2,对各个独立段落进行分句,并对分得的各个语句进行分词处理。
本实施例中,文本中敏感信息的鉴定系统接收用户发出的包含待鉴定文本的敏感信息鉴定请求,例如,接收用户通过手机、平板电脑、自助终端设备等终端发送的敏感信息鉴定请求,如接收用户在手机、平板电脑、自助终端设备等终端中预先安装的客户端上发送来的敏感信息鉴定请求,或接收用户在手机、平板电脑、自助终端设备等终端中的浏览器系统上发送来的敏感信息鉴定请求。
文本中敏感信息的鉴定系统在收到用户发出的敏感信息鉴定请求后,首先对敏感信息鉴定请求中的待鉴定文本进行一系列的处理,以便后续对该待鉴定文本中的敏感信息进行准确地判断。例如,可进行如下处理:
对待鉴定文本进行文字噪音去除等预处理,如对于待鉴定文本中存在畸变字、变体字的情况,先进行校正处理;对于待鉴定文本中存在乱码、多个同类型特殊字符的情况进行去除;还可以进行繁体字转换简体字等处理。
对待鉴定文本完成预处理后,可利用预设的段落分析规则对待鉴定文本进行段落分析,将待鉴定文本分成各个独立的段落。例如,对于有换行符的情况,直接将待鉴定文本通过换行符划分成不同的段落;对于没有换行符,但在句号后存在TAB符号时,可将后续文本作为新段落处理。并为各个段落设置权重X1,根据经验,为了取得博人眼球的宣传作用,文本中的不良信息一般大概率会出现在一个文本的首段落和尾段落中,而出现在大量内容的正文即中间段落的概率较小。因此,本实施例中可将待鉴定文本的首段落和尾段落作为核心段落,相应的,对首段落和尾段落设定的权重高于其他段落的权重。例如,对于首段落和尾段落设置较高的权重90%,对于中间段落设置权重70%。
将待鉴定文本中划分的各个段落进行分句,如将每一段落按标点符号划分成若干句子,并设置各个语句的权重X2。例如,可设置段落核心语句分析,如对于段首的句子可设置较高权重90%,对于中间语句设置权重70%。
对待鉴定文本中的各个分句继续进行分词处理。对各个语句进行 分词,以便后续与敏感词库中的各个敏感词进行关键字匹配操作。本实施例中,可采用N元文法统计模型(N-gram Model)、隐马尔科夫模型(Hidden Markov Model,简称HMM)、最大熵模型(Maximum Entropy Model)来进行分词,分词算法可包括:正向最大匹配,反向最大匹配,双向最大匹配,最短路径算法。例如,N-gram模型中,假设T是由词序列W1,W2,W3,…Wn组成的,那么有如下公式:P(T)=P(W1W2W3…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)。具体地,在一种可选的实施方式中,采用二元的Bi-Gram分词方法进行分词,采用bigram策略即一个词的出现仅依赖于它前面出现的两个词,公式如下:
P(T)=P(W1W2W3)=P(W1)P(W2|W1)P(W3|W1W2)≈
P(W1)P(W2|W1)P(W3|W2)。
步骤S3,将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词。
本实施例中,预先可建立敏感词库也即敏感关键词消息库,例如,可按不同敏感类型建立敏感关键词消息库,包括:建立包含反动、反人类等危害社会活动的消息库;建立包含宗教、政治、事件等敏感的消息库;建立包含广告、诈骗等垃圾信息的消息库;建立包含黄赌毒等完全与金融活动无关的消息库。
该敏感关键词消息库中包括直接禁词,即需直接屏蔽掉的禁词。还可按词性将敏感关键词消息库中的敏感关键词划分为一般名词、辅助动词、辅助负面词、辅助褒义词等不同词性,并进行标注。进一步地,还可对敏感关键词定级。如可根据敏感词影响的不同,或出现频率高低,或国家公布、公司规定、系统自定义等界定,对敏感关键词进行分级。如可将敏感关键词分为三个等级,一级敏感关键词为最严重的,如直接表达出反动、危害人民安全的敏感信息的,需直接给予过滤;二级敏感关键词为严重,所包含的信息虽然敏感,但不意味着一定会带来直接的危害,对于这类可给予提示预警,并标记信息,供管理员审核和操作;三级敏感关键词,是一些常与敏感信息相关的特殊词汇,如关于政治、军事等的特殊词语,多为指代性的名词,其所表达的意思往往需要根据上下文进行判断,其出现在一段文本中,并不意味着一定会是具有危害性的敏感信息,因此,对于这类信息,也需要进行标记,以供后续根据上下文查看文字所表达的意思是否会带来不良的影响。
在将待鉴定文本进行了文本预处理、段落分析、分词等一系列处理后,可将待鉴定文本的各个分词与建立的敏感关键词消息库中的敏感词进行匹配,并根据匹配的结果分配相应的匹配权重X3。具体地,可以包括以下几种情况:
(1)精确匹配的情况,命中直接禁词,即待鉴定文本的分词中直接包含了敏感关键词消息库中的直接禁词,则赋予匹配权重X3为100%。或者,也可以直接依此命中直接禁词的结果判定待鉴定文本为不良信息文本,并对待鉴定文本中的直接禁词进行标记。
(2)模糊匹配的情况,对于待鉴定文本的分词部分命中禁词或包含与禁词同义相关词的部分,则赋予权重X3=x,其中x为利用字符串相似算法Jaro-Winkler Distance计算得到,Jaro-Winkler Distance算法是一种计算两个字符串之间相似度的方法,x为Jaro距离,公式如下:x=1/3(m/s1+m/s2+1-t/m),其中,s1,s2为待鉴定文本的分词以及敏感关键词消息库中相似禁词的字符串长度,m为两者之间相同字串长度即匹配的字符数,t为去除差异长度。
(3)多个关键字匹配的情况,若待鉴定文本的分词与敏感关键词消息库中的敏感词相匹配,但该匹配的敏感词符合预设的第一词性,则进一步地还需判定与该第一词性相关联的第二词性或第三词性的相关联敏感词的匹配情况,以综合判定待鉴定文本是否为不良信息文本。例如,当待鉴定文本的分词匹配命中敏感关键词消息库中的一般名词时,需要进一步判断与该一般名词相关联的辅助动词或辅助褒义、负面词情况。通过如下公式:
w=w1(1+(1-w1)w2*sig1)(1+(1-w1(1+(1-w1)w2*sig))w3*sig2)
来计算最终的敏感词匹配权重X3,w1,w2为辅助动词和辅助褒义/负面词命中权重。即本实施例中,预先按词性划分不同的敏感词,当一个词性的敏感词被命中后,还会进一步地判断该词性的其他关联词性的敏感词,以更准确地进行不良信息的鉴定。例如,若待鉴定文本的分词命中敏感关键词消息库中的一般名词“政府”,则还需判定待鉴定文本中在“政府”或“人民”的前后范围(如前一句、同一句或后一句中),是否同时匹配有相关的负面词情况,如“打倒”、“倒台”等,从而可根据文本中的上下文意思更准确地进行不良信息的鉴定。
步骤S4,根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。
本实施例中,在利用预设规则判断所述待鉴定文本中是否包含敏感信息时,可按照如下公式计算得到P值:
P=a1*X1+a2*X2+a3*X3
其中,X1为所述待鉴定文本中相匹配的分词所在的段落对应的预设段落权重,X2为所述待鉴定文本中相匹配的分词所在的语句对应的预设语句权重,X3为所述待鉴定文本中相匹配的分词对应的预设敏感词匹配权重;a1、a2、a3为预先为预设段落权重X1,预设语 句权重X2,以及预设敏感词匹配权重X3设置的参数权重,例如,可设置a1=0.2,a2=0.1,a3=0.7。
预先设定一预警阈值,将计算得到的P值与预先设定的预警阈值进行比较,若P值大于预先设定的预警阈值,则判断所述待鉴定文本中包含敏感信息,并进行预警。
与现有技术相比,本实施例通过对待鉴定文本进行分段、分句及分词处理后,将分得的各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到所述待鉴定文本中与预先建立的敏感词库中敏感词相匹配的分词;并根据相匹配的分词所在的段落设置的对应预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。由于一般敏感信息在文本中出现时在不同位置如不同段落或语句的出现概率不同,本实施例中能将所述待鉴定文本中的各个分词与预先建立的敏感词库中的各个敏感词进行匹配,根据匹配情况分配对应的预设敏感词匹配权重,并根据相匹配的分词在所述待鉴定文本中的位置即所在的段落设置对应预设段落权重及所在语句对应设置预设语句权重,结合预设敏感词匹配权重以及预设段落权重、预设语句权重来进行综合鉴定,能更加准确有效地判断出所述待鉴定文本中是否包含敏感信息。而且,无需人工进行检测,能自动进行文本中敏感信息的鉴定,有效提高检测效率。
在一可选的实施例中,在上述图1的实施例的基础上,所述文本中敏感信息的鉴定系统10被所述处理器12执行时,还包括:
对于不同业务系统,除了采用公共的敏感关键词消息库进行匹配过滤外,还可采用系统自定义关键词库进行针对不同业务特性相关的敏感词库过滤。即针对不同业务系统,在进行关键字匹配时,不仅可将待鉴定文本的各个分词与建立的敏感关键词消息库中的敏感关键词进行匹配,还可将待鉴定文本的各个分词与系统自定义的针对不同业务特性相关的敏感词库中的敏感关键词进行匹配。后续对于达到公共库预警阈值的即可进行预警,对于未达到公共库阈值但达到系统自定义库阈值的也可进行预警,更加灵活实用。
在一可选的实施例中,所述文本中敏感信息的鉴定系统10被所述处理器12执行时,还包括:
对敏感关键词消息库实施更新策略,如:将不同消息库在线实时或定时更新同步到敏感关键词消息库中。还可通过数据库导出敏感关键词消息库文件到指定路径,系统定期更新指定路径中的敏感关键词 消息库数据,能够及时更新最新敏感关键词消息进入敏感关键词消息库。
如图2所示,图2为本申请文本中敏感信息的鉴定方法一实施例的流程示意图,该文本中敏感信息的鉴定方法包括以下步骤:
步骤S10,在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落。
步骤S20,对各个独立段落进行分句,并对分得的各个语句进行分词处理。
本实施例中,文本中敏感信息的鉴定系统接收用户发出的包含待鉴定文本的敏感信息鉴定请求,例如,接收用户通过手机、平板电脑、自助终端设备等终端发送的敏感信息鉴定请求,如接收用户在手机、平板电脑、自助终端设备等终端中预先安装的客户端上发送来的敏感信息鉴定请求,或接收用户在手机、平板电脑、自助终端设备等终端中的浏览器系统上发送来的敏感信息鉴定请求。
文本中敏感信息的鉴定系统在收到用户发出的敏感信息鉴定请求后,首先对敏感信息鉴定请求中的待鉴定文本进行一系列的处理,以便后续对该待鉴定文本中的敏感信息进行准确地判断。例如,可进行如下处理:
对待鉴定文本进行文字噪音去除等预处理,如对于待鉴定文本中存在畸变字、变体字的情况,先进行校正处理;对于待鉴定文本中存在乱码、多个同类型特殊字符的情况进行去除;还可以进行繁体字转换简体字等处理。
对待鉴定文本完成预处理后,可利用预设的段落分析规则对待鉴定文本进行段落分析,将待鉴定文本分成各个独立的段落。例如,对于有换行符的情况,直接将待鉴定文本通过换行符划分成不同的段落;对于没有换行符,但在句号后存在TAB符号时,可将后续文本作为新段落处理。并为各个段落设置权重X1,根据经验,为了取得博人眼球的宣传作用,文本中的不良信息一般大概率会出现在一个文本的首段落和尾段落中,而出现在大量内容的正文即中间段落的概率较小。因此,本实施例中可将待鉴定文本的首段落和尾段落作为核心段落,相应的,对首段落和尾段落设定的权重高于其他段落的权重。例如,对于首段落和尾段落设置较高的权重90%,对于中间段落设置权重70%。
将待鉴定文本中划分的各个段落进行分句,如将每一段落按标点符号划分成若干句子,并设置各个语句的权重X2。例如,可设置段落核心语句分析,如对于段首的句子可设置较高权重90%,对于中间语句设置权重70%。
对待鉴定文本中的各个分句继续进行分词处理。对各个语句进行分词,以便后续与敏感词库中的各个敏感词进行关键字匹配操作。本实施例中,可采用N元文法统计模型(N-gram Model)、隐马尔科夫模型(Hidden Markov Model,简称HMM)、最大熵模型(Maximum Entropy Model)来进行分词,分词算法可包括:正向最大匹配,反向最大匹配,双向最大匹配,最短路径算法。例如,N-gram模型中,假设T是由词序列W1,W2,W3,…Wn组成的,那么有如下公式:P(T)=P(W1W2W3…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)。具体地,在一种可选的实施方式中,采用二元的Bi-Gram分词方法进行分词,采用bigram策略即一个词的出现仅依赖于它前面出现的两个词,公式如下:
P(T)=P(W1W2W3)=P(W1)P(W2|W1)P(W3|W1W2)≈
P(W1)P(W2|W1)P(W3|W2)。
步骤S30,将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词。
本实施例中,预先可建立敏感词库也即敏感关键词消息库,例如,可按不同敏感类型建立敏感关键词消息库,包括:建立包含反动、反人类等危害社会活动的消息库;建立包含宗教、政治、事件等敏感的消息库;建立包含广告、诈骗等垃圾信息的消息库;建立包含黄赌毒等完全与金融活动无关的消息库。
该敏感关键词消息库中包括直接禁词,即需直接屏蔽掉的禁词。还可按词性将敏感关键词消息库中的敏感关键词划分为一般名词、辅助动词、辅助负面词、辅助褒义词等不同词性,并进行标注。进一步地,还可对敏感关键词定级。如可根据敏感词影响的不同,或出现频率高低,或国家公布、公司规定、系统自定义等界定,对敏感关键词进行分级。如可将敏感关键词分为三个等级,一级敏感关键词为最严重的,如直接表达出反动、危害人民安全的敏感信息的,需直接给予过滤;二级敏感关键词为严重,所包含的信息虽然敏感,但不意味着一定会带来直接的危害,对于这类可给予提示预警,并标记信息,供管理员审核和操作;三级敏感关键词,是一些常与敏感信息相关的特殊词汇,如关于政治、军事等的特殊词语,多为指代性的名词,其所表达的意思往往需要根据上下文进行判断,其出现在一段文本中,并不意味着一定会是具有危害性的敏感信息,因此,对于这类信息,也需要进行标记,以供后续根据上下文查看文字所表达的意思是否会带来不良的影响。
在将待鉴定文本进行了文本预处理、段落分析、分词等一系列处理后,可将待鉴定文本的各个分词与建立的敏感关键词消息库中的敏感词进行匹配,并根据匹配的结果分配相应的匹配权重X3。具体地, 可以包括以下几种情况:
(1)精确匹配的情况,命中直接禁词,即待鉴定文本的分词中直接包含了敏感关键词消息库中的直接禁词,则赋予匹配权重X3为100%。或者,也可以直接依此命中直接禁词的结果判定待鉴定文本为不良信息文本,并对待鉴定文本中的直接禁词进行标记。
(2)模糊匹配的情况,对于待鉴定文本的分词部分命中禁词或包含与禁词同义相关词的部分,则赋予权重X3=x,其中x为利用字符串相似算法Jaro-Winkler Distance计算得到,Jaro-Winkler Distance算法是一种计算两个字符串之间相似度的方法,x为Jaro距离,公式如下:x=1/3(m/s1+m/s2+1-t/m),其中,s1,s2为待鉴定文本的分词以及敏感关键词消息库中相似禁词的字符串长度,m为两者之间相同字串长度即匹配的字符数,t为去除差异长度。
(3)多个关键字匹配的情况,若待鉴定文本的分词与敏感关键词消息库中的敏感词相匹配,但该匹配的敏感词符合预设的第一词性,则进一步地还需判定与该第一词性相关联的第二词性或第三词性的相关联敏感词的匹配情况,以综合判定待鉴定文本是否为不良信息文本。例如,当待鉴定文本的分词匹配命中敏感关键词消息库中的一般名词时,需要进一步判断与该一般名词相关联的辅助动词或辅助褒义、负面词情况。通过如下公式:
w=w1(1+(1-w1)w2*sig1)(1+(1-w1(1+(1-w1)w2*sig))w3*sig2)
来计算最终的敏感词匹配权重X3,w1,w2为辅助动词和辅助褒义/负面词命中权重。即本实施例中,预先按词性划分不同的敏感词,当一个词性的敏感词被命中后,还会进一步地判断该词性的其他关联词性的敏感词,以更准确地进行不良信息的鉴定。例如,若待鉴定文本的分词命中敏感关键词消息库中的一般名词“政府”,则还需判定待鉴定文本中在“政府”或“人民”的前后范围(如前一句、同一句或后一句中),是否同时匹配有相关的负面词情况,如“打倒”、“倒台”等,从而可根据文本中的上下文意思更准确地进行不良信息的鉴定。
步骤S40,根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。
本实施例中,在利用预设规则判断所述待鉴定文本中是否包含敏感信息时,可按照如下公式计算得到P值:
P=a1*X1+a2*X2+a3*X3
其中,X1为所述待鉴定文本中相匹配的分词所在的段落对应的预设段落权重,X2为所述待鉴定文本中相匹配的分词所在的语句对应的预设语句权重,X3为所述待鉴定文本中相匹配的分词对应的预 设敏感词匹配权重;a1、a2、a3为预先为预设段落权重X1,预设语句权重X2,以及预设敏感词匹配权重X3设置的参数权重,例如,可设置a1=0.2,a2=0.1,a3=0.7。
预先设定一预警阈值,将计算得到的P值与预先设定的预警阈值进行比较,若P值大于预先设定的预警阈值,则判断所述待鉴定文本中包含敏感信息,并进行预警。
与现有技术相比,本实施例通过对待鉴定文本进行分段、分句及分词处理后,将分得的各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到所述待鉴定文本中与预先建立的敏感词库中敏感词相匹配的分词;并根据相匹配的分词所在的段落设置的对应预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。由于一般敏感信息在文本中出现时在不同位置如不同段落或语句的出现概率不同,本实施例中能将所述待鉴定文本中的各个分词与预先建立的敏感词库中的各个敏感词进行匹配,根据匹配情况分配对应的预设敏感词匹配权重,并根据相匹配的分词在所述待鉴定文本中的位置即所在的段落设置对应预设段落权重及所在语句对应设置预设语句权重,结合预设敏感词匹配权重以及预设段落权重、预设语句权重来进行综合鉴定,能更加准确有效地判断出所述待鉴定文本中是否包含敏感信息。而且,无需人工进行检测,能自动进行文本中敏感信息的鉴定,有效提高检测效率。
在一可选的实施例中,在上述实施例的基础上,该方法还包括:
对于不同业务系统,除了采用公共的敏感关键词消息库进行匹配过滤外,还可采用系统自定义关键词库进行针对不同业务特性相关的敏感词库过滤。即针对不同业务系统,在进行关键字匹配时,不仅可将待鉴定文本的各个分词与建立的敏感关键词消息库中的敏感关键词进行匹配,还可将待鉴定文本的各个分词与系统自定义的针对不同业务特性相关的敏感词库中的敏感关键词进行匹配。后续对于达到公共库预警阈值的即可进行预警,对于未达到公共库阈值但达到系统自定义库阈值的也可进行预警,更加灵活实用。
在一可选的实施例中,该方法还包括:
对敏感关键词消息库实施更新策略,如:将不同消息库在线实时或定时更新同步到敏感关键词消息库中。还可通过数据库导出敏感关键词消息库文件到指定路径,系统定期更新指定路径中的敏感关键词消息库数据,能够及时更新最新敏感关键词消息进入敏感关键词消息 库。
此外,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有文本中敏感信息的鉴定系统,所述文本中敏感信息的鉴定系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述实施例中的文本中敏感信息的鉴定方法的步骤,该文本中敏感信息的鉴定方法的步骤S10、S20、S30等具体实施过程如上文所述,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上参照附图说明了本申请的优选实施例,并非因此局限本申请的权利范围。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。另外,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本领域技术人员不脱离本申请的范围和实质,可以有多种变型方案实现本申请,比如作为一个实施例的特征可用于另一实施例而得到又一实施例。凡在运用本申请的技术构思之内所作的任何修改、等同替换和改进,均应在本申请的权利范围之内。

Claims (20)

  1. 一种电子装置,其特征在于,所述电子装置包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的文本中敏感信息的鉴定系统,所述文本中敏感信息的鉴定系统被所述处理器执行时实现如下步骤:
    在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落;
    对各个独立段落进行分句,并对分得的各个语句进行分词处理;
    将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词;
    根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。
  2. 如权利要求1所述的电子装置,其特征在于,所述预设的段落分析规则包括:
    检测所述待鉴定文本中是否有换行符,若有换行符,则直接将所述待鉴定文本通过检测出的换行符划分成各个独立段落;若没有换行符,则在所述待鉴定文本中句号后存在TAB符号时,进行分段,将TAB符号后的文本划分为新的段落,依次将所述待鉴定文本划分成各个独立段落;
    所述对各个独立段落进行分句,并对分得的各个语句进行分词处理的步骤包括:
    对各个独立段落按标点符号划分成若干句子,并对分得的各个语句采用二元的Bi-Gram分词方法进行分词处理。
  3. 如权利要求1所述的电子装置,其特征在于,所述文本中敏感信息的鉴定系统被所述处理器执行时,还包括:
    为所述待鉴定文本中各个独立段落设置对应的预设段落权重,其中,首段落和/或尾段落的权重高于其他段落的权重;
    为所述待鉴定文本中各个语句设置对应的预设语句权重,其中,在一个独立段落内,段首和/或段尾语句的权重高于其他语句的权重。
  4. 如权利要求2所述的电子装置,其特征在于,所述文本中敏感信息的鉴定系统被所述处理器执行时,还包括:
    为所述待鉴定文本中各个独立段落设置对应的预设段落权重,其中,首段落和/或尾段落的权重高于其他段落的权重;
    为所述待鉴定文本中各个语句设置对应的预设语句权重,其中,在一个独立段落内,段首和/或段尾语句的权重高于其他语句的权重。
  5. 如权利要求3所述的电子装置,其特征在于,所述文本中敏感信息的鉴定系统被所述处理器执行时,还包括:
    若各个分词与预先建立的敏感词库中相匹配的敏感词为预设直接禁词,则直接判断所述待鉴定文本中包含敏感信息;
    若所述待鉴定文本的分词与预先建立的敏感词库中的预设直接禁词部分相同或包含与所述预设直接禁词的相关同义词的部分,则为该分词利用预设的字符串相似算法计算得到对应的第一预设敏感词匹配权重x,公式如下:
    x=1/3(m/s1+m/s2+1-t/m)
    其中,s1、s2为该分词以及对应的预设直接禁词的字符串长度,m为该分词以及对应的预设直接禁词之间相同字符串长度,t为去除差异长度;
    若所述待鉴定文本的分词与预先建立的敏感词库中的敏感词相匹配,且相匹配的敏感词符合预设的第一词性,则检测在预先建立的敏感词库中与该第一词性的敏感词相关联的第二词性和/或第三词性的相关敏感词的匹配情况,并根据匹配情况及预设计算规则为该分词设定对应的第二预设敏感词匹配权重。
  6. 如权利要求4所述的电子装置,其特征在于,所述文本中敏感信息的鉴定系统被所述处理器执行时,还包括:
    若各个分词与预先建立的敏感词库中相匹配的敏感词为预设直接禁词,则直接判断所述待鉴定文本中包含敏感信息;
    若所述待鉴定文本的分词与预先建立的敏感词库中的预设直接禁词部分相同或包含与所述预设直接禁词的相关同义词的部分,则为该分词利用预设的字符串相似算法计算得到对应的第一预设敏感词匹配权重x,公式如下:
    x=1/3(m/s1+m/s2+1-t/m)
    其中,s1、s2为该分词以及对应的预设直接禁词的字符串长度,m为该分词以及对应的预设直接禁词之间相同字符串长度,t为去除差异长度;
    若所述待鉴定文本的分词与预先建立的敏感词库中的敏感词相匹配,且相匹配的敏感词符合预设的第一词性,则检测在预先建立的敏感词库中与该第一词性的敏感词相关联的第二词性和/或第三词性的相关敏感词的匹配情况,并根据匹配情况及预设计算规则为该分词设定对应的第二预设敏感词匹配权重。
  7. 如权利要求5所述的电子装置,其特征在于,所述利用预设规则判断所述待鉴定文本中是否包含敏感信息包括:
    按照如下公式计算得到P值:
    P=a1*X1+a2*X2+a3*X3
    其中,X1为所述待鉴定文本中相匹配的分词所在的段落对应的预设段落权重,X2为所述待鉴定文本中相匹配的分词所在的语句对 应的预设语句权重,X3为所述待鉴定文本中相匹配的分词对应的预设敏感词匹配权重;a1、a2、a3为预先为预设段落权重X1,预设语句权重X2,以及预设敏感词匹配权重X3设置的参数权重;
    将计算得到的P值与预先设定的预警阈值进行比较,若P值大于预先设定的预警阈值,则判断所述待鉴定文本中包含敏感信息。
  8. 如权利要求6所述的电子装置,其特征在于,所述利用预设规则判断所述待鉴定文本中是否包含敏感信息包括:
    按照如下公式计算得到P值:
    P=a1*X1+a2*X2+a3*X3
    其中,X1为所述待鉴定文本中相匹配的分词所在的段落对应的预设段落权重,X2为所述待鉴定文本中相匹配的分词所在的语句对应的预设语句权重,X3为所述待鉴定文本中相匹配的分词对应的预设敏感词匹配权重;a1、a2、a3为预先为预设段落权重X1,预设语句权重X2,以及预设敏感词匹配权重X3设置的参数权重;
    将计算得到的P值与预先设定的预警阈值进行比较,若P值大于预先设定的预警阈值,则判断所述待鉴定文本中包含敏感信息。
  9. 一种文本中敏感信息的鉴定方法,其特征在于,所述文本中敏感信息的鉴定方法包括:
    在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落;
    对各个独立段落进行分句,并对分得的各个语句进行分词处理;
    将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词;
    根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。
  10. 如权利要求9所述的文本中敏感信息的鉴定方法,其特征在于,所述预设的段落分析规则包括:
    检测所述待鉴定文本中是否有换行符,若有换行符,则直接将所述待鉴定文本通过检测出的换行符划分成各个独立段落;若没有换行符,则在所述待鉴定文本中句号后存在TAB符号时,进行分段,将TAB符号后的文本划分为新的段落,依次将所述待鉴定文本划分成各个独立段落;
    所述对各个独立段落进行分句,并对分得的各个语句进行分词处理的步骤包括:
    对各个独立段落按标点符号划分成若干句子,并对分得的各个语句采用二元的Bi-Gram分词方法进行分词处理。
  11. 如权利要求9所述的文本中敏感信息的鉴定方法,其特征在 于,
    该方法还包括:
    为所述待鉴定文本中各个独立段落设置对应的预设段落权重,其中,首段落和/或尾段落的权重高于其他段落的权重;
    为所述待鉴定文本中各个语句设置对应的预设语句权重,其中,在一个独立段落内,段首和/或段尾语句的权重高于其他语句的权重。
  12. 如权利要求10所述的文本中敏感信息的鉴定方法,其特征在于,
    该方法还包括:
    为所述待鉴定文本中各个独立段落设置对应的预设段落权重,其中,首段落和/或尾段落的权重高于其他段落的权重;
    为所述待鉴定文本中各个语句设置对应的预设语句权重,其中,在一个独立段落内,段首和/或段尾语句的权重高于其他语句的权重。
  13. 如权利要求11所述的文本中敏感信息的鉴定方法,其特征在于,该方法还包括:
    若各个分词与预先建立的敏感词库中相匹配的敏感词为预设直接禁词,则直接判断所述待鉴定文本中包含敏感信息;
    若所述待鉴定文本的分词与预先建立的敏感词库中的预设直接禁词部分相同或包含与所述预设直接禁词的相关同义词的部分,则为该分词利用预设的字符串相似算法计算得到对应的第一预设敏感词匹配权重x,公式如下:
    x=1/3(m/s1+m/s2+1-t/m)
    其中,s1、s2为该分词以及对应的预设直接禁词的字符串长度,m为该分词以及对应的预设直接禁词之间相同字符串长度,t为去除差异长度;
    若所述待鉴定文本的分词与预先建立的敏感词库中的敏感词相匹配,且相匹配的敏感词符合预设的第一词性,则检测在预先建立的敏感词库中与该第一词性的敏感词相关联的第二词性和/或第三词性的相关敏感词的匹配情况,并根据匹配情况及预设计算规则为该分词设定对应的第二预设敏感词匹配权重。
  14. 如权利要求12所述的文本中敏感信息的鉴定方法,其特征在于,该方法还包括:
    若各个分词与预先建立的敏感词库中相匹配的敏感词为预设直接禁词,则直接判断所述待鉴定文本中包含敏感信息;
    若所述待鉴定文本的分词与预先建立的敏感词库中的预设直接禁词部分相同或包含与所述预设直接禁词的相关同义词的部分,则为 该分词利用预设的字符串相似算法计算得到对应的第一预设敏感词匹配权重x,公式如下:
    x=1/3(m/s1+m/s2+1-t/m)
    其中,s1、s2为该分词以及对应的预设直接禁词的字符串长度,m为该分词以及对应的预设直接禁词之间相同字符串长度,t为去除差异长度;
    若所述待鉴定文本的分词与预先建立的敏感词库中的敏感词相匹配,且相匹配的敏感词符合预设的第一词性,则检测在预先建立的敏感词库中与该第一词性的敏感词相关联的第二词性和/或第三词性的相关敏感词的匹配情况,并根据匹配情况及预设计算规则为该分词设定对应的第二预设敏感词匹配权重。
  15. 如权利要求13所述的文本中敏感信息的鉴定方法,其特征在于,所述利用预设规则判断所述待鉴定文本中是否包含敏感信息包括:
    按照如下公式计算得到P值:
    P=a1*X1+a2*X2+a3*X3
    其中,X1为所述待鉴定文本中相匹配的分词所在的段落对应的预设段落权重,X2为所述待鉴定文本中相匹配的分词所在的语句对应的预设语句权重,X3为所述待鉴定文本中相匹配的分词对应的预设敏感词匹配权重;a1、a2、a3为预先为预设段落权重X1,预设语句权重X2,以及预设敏感词匹配权重X3设置的参数权重;
    将计算得到的P值与预先设定的预警阈值进行比较,若P值大于预先设定的预警阈值,则判断所述待鉴定文本中包含敏感信息。
  16. 如权利要求14所述的文本中敏感信息的鉴定方法,其特征在于,所述利用预设规则判断所述待鉴定文本中是否包含敏感信息包括:
    按照如下公式计算得到P值:
    P=a1*X1+a2*X2+a3*X3
    其中,X1为所述待鉴定文本中相匹配的分词所在的段落对应的预设段落权重,X2为所述待鉴定文本中相匹配的分词所在的语句对应的预设语句权重,X3为所述待鉴定文本中相匹配的分词对应的预设敏感词匹配权重;a1、a2、a3为预先为预设段落权重X1,预设语句权重X2,以及预设敏感词匹配权重X3设置的参数权重;
    将计算得到的P值与预先设定的预警阈值进行比较,若P值大于预先设定的预警阈值,则判断所述待鉴定文本中包含敏感信息。
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有文本中敏感信息的鉴定系统,所述文本中敏感信息的鉴定系统被处理器执行时实现如下步骤:
    在收到待鉴定文本后,利用预设的段落分析规则将所述待鉴定文本分成各个独立段落;
    对各个独立段落进行分句,并对分得的各个语句进行分词处理;
    将各个分词与预先建立的敏感词库中的各个敏感词进行匹配,得到与预先建立的敏感词库中敏感词相匹配的分词;
    根据相匹配的分词所在的段落对应的预设段落权重,以及相匹配的分词对应的预设敏感词匹配权重,利用预设规则判断所述待鉴定文本中是否包含敏感信息。
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预设的段落分析规则包括:
    检测所述待鉴定文本中是否有换行符,若有换行符,则直接将所述待鉴定文本通过检测出的换行符划分成各个独立段落;若没有换行符,则在所述待鉴定文本中句号后存在TAB符号时,进行分段,将TAB符号后的文本划分为新的段落,依次将所述待鉴定文本划分成各个独立段落;
    所述对各个独立段落进行分句,并对分得的各个语句进行分词处理的步骤包括:
    对各个独立段落按标点符号划分成若干句子,并对分得的各个语句采用二元的Bi-Gram分词方法进行分词处理。
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,所述文本中敏感信息的鉴定系统被处理器执行时还包括:
    为所述待鉴定文本中各个独立段落设置对应的预设段落权重,其中,首段落和/或尾段落的权重高于其他段落的权重;
    为所述待鉴定文本中各个语句设置对应的预设语句权重,其中,在一个独立段落内,段首和/或段尾语句的权重高于其他语句的权重。
  20. 如权利要求18所述的计算机可读存储介质,其特征在于,所述文本中敏感信息的鉴定系统被处理器执行时还包括:
    为所述待鉴定文本中各个独立段落设置对应的预设段落权重,其中,首段落和/或尾段落的权重高于其他段落的权重;
    为所述待鉴定文本中各个语句设置对应的预设语句权重,其中,在一个独立段落内,段首和/或段尾语句的权重高于其他语句的权重。
PCT/CN2018/089717 2018-02-06 2018-06-03 文本中敏感信息的鉴定方法、电子装置及可读存储介质 WO2019153605A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810114518.6 2018-02-06
CN201810114518.6A CN108519970B (zh) 2018-02-06 2018-02-06 文本中敏感信息的鉴定方法、电子装置及可读存储介质

Publications (1)

Publication Number Publication Date
WO2019153605A1 true WO2019153605A1 (zh) 2019-08-15

Family

ID=63432818

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/089717 WO2019153605A1 (zh) 2018-02-06 2018-06-03 文本中敏感信息的鉴定方法、电子装置及可读存储介质

Country Status (2)

Country Link
CN (1) CN108519970B (zh)
WO (1) WO2019153605A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737398A (zh) * 2020-05-26 2020-10-02 北京百度网讯科技有限公司 文本中的敏感词的检索方法、装置、电子设备及存储介质
CN113010637A (zh) * 2021-02-24 2021-06-22 世纪龙信息网络有限责任公司 一种文本审核方法及装置

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446288A (zh) * 2018-10-18 2019-03-08 重庆邮电大学 一种基于Spark互联网涉密地图检测算法
CN109284503B (zh) * 2018-10-22 2023-08-18 传神语联网网络科技股份有限公司 翻译语句结束判断方法与系统
CN109614608A (zh) * 2018-10-26 2019-04-12 平安科技(深圳)有限公司 电子装置、文本信息检测方法及存储介质
CN109657228B (zh) * 2018-10-31 2023-06-06 北京三快在线科技有限公司 一种敏感文本确定方法及装置
CN109815395B (zh) * 2018-12-26 2021-06-08 北京中科闻歌科技股份有限公司 网页垃圾信息过滤方法、装置及存储介质
CN111882371A (zh) * 2019-04-15 2020-11-03 阿里巴巴集团控股有限公司 内容信息处理、图文内容处理方法、计算机设备、介质
CN110209796B (zh) * 2019-04-29 2022-02-08 北京印刷学院 一种敏感词检测过滤方法、装置与电子设备
CN110110715A (zh) * 2019-04-30 2019-08-09 北京金山云网络技术有限公司 文本检测模型训练方法、文本区域、内容确定方法和装置
CN110457428B (zh) * 2019-06-26 2023-07-04 北京印刷学院 一种敏感词检测过滤方法、装置与电子设备
CN110516255A (zh) * 2019-08-30 2019-11-29 广州华多网络科技有限公司 一种角色姓名提取方法及系统
CN110674247A (zh) * 2019-09-23 2020-01-10 广州虎牙科技有限公司 弹幕信息的拦截方法、装置、存储介质及设备
CN111062208B (zh) * 2019-12-13 2023-05-12 建信金融科技有限责任公司 一种文件审核的方法、装置、设备及存储介质
CN111147465A (zh) * 2019-12-18 2020-05-12 深圳市任子行科技开发有限公司 对https内容进行审计的方法及代理服务器
CN111191443A (zh) * 2019-12-19 2020-05-22 深圳壹账通智能科技有限公司 基于区块链的敏感词检测方法、装置、计算机设备和存储介质
CN111079029B (zh) * 2019-12-20 2023-11-21 珠海格力电器股份有限公司 敏感账号的检测方法、存储介质和计算机设备
CN111460814A (zh) * 2020-03-10 2020-07-28 中国平安人寿保险股份有限公司 敏感信息检测方法、装置、终端及介质
CN113536765A (zh) * 2020-04-16 2021-10-22 北京有限元科技有限公司 对话术文本信息进行检测的方法、装置以及存储介质
CN111783447B (zh) * 2020-05-28 2023-02-03 中国平安财产保险股份有限公司 基于ngram距离的敏感词检测方法、装置、设备及存储介质
CN111797214A (zh) * 2020-06-24 2020-10-20 深圳壹账通智能科技有限公司 基于faq数据库的问题筛选方法、装置、计算机设备及介质
CN111881667B (zh) * 2020-07-24 2023-09-29 上海烽烁科技有限公司 一种敏感文本审核方法
CN112016317A (zh) * 2020-09-07 2020-12-01 平安科技(深圳)有限公司 基于人工智能的敏感词识别方法、装置及计算机设备
CN112100655A (zh) * 2020-09-09 2020-12-18 北京明朝万达科技股份有限公司 一种数据检测方法、装置、电子设备及可读存储介质
CN112183053A (zh) * 2020-10-10 2021-01-05 湖南快乐阳光互动娱乐传媒有限公司 一种数据处理方法及装置
CN112949285B (zh) * 2020-10-13 2024-04-05 广州市百果园网络科技有限公司 语句文本检测方法、系统、电子设备及存储介质
CN112417103A (zh) * 2020-12-02 2021-02-26 百度国际科技(深圳)有限公司 用于检测敏感词的方法、装置、设备和存储介质
CN112905743B (zh) * 2021-02-20 2023-08-01 北京百度网讯科技有限公司 文本对象检测的方法、装置、电子设备和存储介质
CN113221554A (zh) * 2021-04-27 2021-08-06 北京字跳网络技术有限公司 文本处理方法、装置、电子设备和存储介质
CN114140077A (zh) * 2021-11-30 2022-03-04 宁波帮企一把企业服务平台有限公司 一种政府政策解构方法、装置、计算机设备和存储介质
CN115408490A (zh) * 2022-11-01 2022-11-29 广东省信息工程有限公司 一种基于知识库的官文校对方法、系统、设备及存储介质
CN116701614B (zh) * 2023-08-02 2024-07-19 南京壹行科技有限公司 一种用于文本智能采集的敏感数据模型建立方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154174A1 (en) * 2010-05-13 2015-06-04 Grammarly, Inc. Systems and methods for advanced grammar checking
CN105574090A (zh) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 一种敏感词过滤方法及系统
CN106445998A (zh) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 一种基于敏感词的文本内容审核方法及系统
CN107633380A (zh) * 2017-08-30 2018-01-26 北京明朝万达科技股份有限公司 一种数据防泄漏系统的任务审批方法和系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8321204B2 (en) * 2008-08-26 2012-11-27 Saraansh Software Solutions Pvt. Ltd. Automatic lexicon generation system for detection of suspicious e-mails from a mail archive
CN104731797B (zh) * 2013-12-19 2018-09-18 北京新媒传信科技有限公司 一种提取关键词的方法及装置
CN104866465B (zh) * 2014-02-25 2017-11-03 腾讯科技(深圳)有限公司 敏感文本检测方法及装置
KR101741509B1 (ko) * 2015-07-01 2017-06-15 지속가능발전소 주식회사 뉴스의 데이터마이닝을 통한 기업 평판 분석 장치 및 방법, 그 방법을 수행하기 위한 기록 매체
CN107357777B (zh) * 2017-06-16 2020-07-07 中科鼎富(北京)科技发展有限公司 提取标签信息的方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154174A1 (en) * 2010-05-13 2015-06-04 Grammarly, Inc. Systems and methods for advanced grammar checking
CN105574090A (zh) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 一种敏感词过滤方法及系统
CN106445998A (zh) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 一种基于敏感词的文本内容审核方法及系统
CN107633380A (zh) * 2017-08-30 2018-01-26 北京明朝万达科技股份有限公司 一种数据防泄漏系统的任务审批方法和系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737398A (zh) * 2020-05-26 2020-10-02 北京百度网讯科技有限公司 文本中的敏感词的检索方法、装置、电子设备及存储介质
CN111737398B (zh) * 2020-05-26 2023-06-23 北京百度网讯科技有限公司 文本中的敏感词的检索方法、装置、电子设备及存储介质
CN113010637A (zh) * 2021-02-24 2021-06-22 世纪龙信息网络有限责任公司 一种文本审核方法及装置

Also Published As

Publication number Publication date
CN108519970A (zh) 2018-09-11
CN108519970B (zh) 2021-08-31

Similar Documents

Publication Publication Date Title
WO2019153605A1 (zh) 文本中敏感信息的鉴定方法、电子装置及可读存储介质
WO2019184217A1 (zh) 热点事件分类方法、装置及存储介质
US11544459B2 (en) Method and apparatus for determining feature words and server
WO2019169769A1 (zh) 广告图片鉴定方法、电子装置及可读存储介质
KR101627592B1 (ko) 비밀 정보의 검출
CN103336766B (zh) 短文本垃圾识别以及建模方法和装置
US10430610B2 (en) Adaptive data obfuscation
US8380488B1 (en) Identifying a property of a document
US11429790B2 (en) Automated detection of personal information in free text
US20070067280A1 (en) System for recognising and classifying named entities
JP2008539476A (ja) スペル提示の生成方法およびシステム
US10558755B2 (en) Automated document analysis comprising company name recognition
CN107357824B (zh) 信息处理方法、服务平台及计算机存储介质
CN109241523B (zh) 变体作弊字段的识别方法、装置及设备
US9692771B2 (en) System and method for estimating typicality of names and textual data
CN108763202B (zh) 识别敏感文本的方法、装置、设备及可读存储介质
CN111858894A (zh) 语义缺失的识别方法及装置、电子设备、存储介质
CN112395866B (zh) 报关单数据匹配方法及装置
US20220164796A1 (en) System, method, and computer program product for generating enhanced n-gram models
WO2018201599A1 (zh) 基于社交信息的风险事件的识别系统、方法、电子装置及存储介质
JP5849960B2 (ja) 含意判定装置、方法、およびプログラム
US20160078072A1 (en) Term variant discernment system and method therefor
CN110019816B (zh) 一种文本审核中的规则提取方法及系统
CN115935962A (zh) 文本信息的处理方法、装置、电子设备及介质
CN114064847A (zh) 一种文本检测方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18905883

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18905883

Country of ref document: EP

Kind code of ref document: A1