CN111859032A

CN111859032A - Method and device for detecting character-breaking sensitive words of short message and computer storage medium

Info

Publication number: CN111859032A
Application number: CN202010699131.9A
Authority: CN
Inventors: 刘超; 刘霖雯
Original assignee: Beijing Beidou Tianxun Technology Co Ltd
Current assignee: Beijing Beidou Tianxun Technology Co Ltd
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-10-30

Abstract

The invention discloses a method and a device for detecting a character-breaking sensitive word of a short message and a computer storage medium, mainly relating to the information auditing technology, wherein the method and the device can be used for integrating the structural composition analysis of the sensitive word into the analysis and detection of the sensitive word, detecting the information of the sensitive word formed by breaking characters and avoiding the omission of the character-breaking sensitive word; meanwhile, when only one character is separated from the information, the context association relationship between the sensitive word and the sensitive word is analyzed to judge the sensitivity of the sensitive word, so that the detection result error caused by character separation transition recognition is avoided, whether the information containing one character separation is normal information can be accurately judged, and the judgment accuracy is improved. And the invention also carries on the segmentation to the information not directly comprising the word breaking, in order to analyze whether it contains the word breaking, and judge whether it is the sensitive word according to mutual information MI, further promote the sensitive word of word breaking of the invention and detect the accuracy, prevent from identifying the bit and producing and missing the detection or identifying the product error detection excessively.

Description

Method and device for detecting character-breaking sensitive words of short message and computer storage medium

Technical Field

The invention relates to the technical field of information auditing, in particular to a method and a device for detecting words sensitive to short message separation and a computer storage medium.

Background

The treatment of the garbage information mainly depends on technical means, and the current mature main technology is as follows: black and white list techniques, rule-based filtering techniques, methods based on probabilistic statistical analysis, and the like. The first two technologies are simple, but a large amount of labor cost is needed to be made, partial neutral words are difficult to process based on a black-and-white list, the rule-based mode is accurate, but a large amount of professional knowledge and labor cost are needed to make an audit filtering rule, the probability-based statistical analysis method is a filtering system based on a Bayesian classification algorithm in the mainstream at present, and is applied to English at first, Chinese is not provided with separated spaces, and the word-based mode greatly reduces the filtering reliability of junk information. The reason is that the basic word meaning unit of Chinese is a word, so the processing is relatively dependent on a word stock, and the word stock depends on a word segmentation system under the condition of limited word stock, but the word segmentation system is difficult to process aiming at various changes of sensitive words. Many sensitive words in the junk information are deformed and separated, so that a word segmentation system cannot identify the sensitive words, and therefore the sensitive word identification technology is particularly important in junk information identification.

At present, the main stream is mainly based on a keyword-based mode, a dictionary-based mode and a database-based mode, and the fields of structure composition analysis of sensitive words, context association relation analysis of sensitive words and the like are not considered. Therefore, the font is not easy to identify after being split.

But it must also be noted that some word breaks are meaningful to prevent over-recognition, and words such as "women's do things", "women", "tall moon", etc. are also common words that often occur.

Therefore, how to identify the characters with the character splitting function without excessive identification or missing identification becomes a great problem in the technical field.

Disclosure of Invention

The invention aims to: the method and the device for detecting the character-splitting sensitive words of the short messages and the computer storage medium solve the technical problems that in the prior art, the verification of the deformed sensitive words of the character-splitting type is poor, the recognition rate is low or the recognition is excessive and the like. Aiming at the problem of easy error, the invention provides a mode based on a context window of common words with simple words to easy error, then the mode is processed according to the result of the context window, and particularly, an influence factor which influences the subsequent auditing result according to the semantic distance calculated based on a word2vec model is used.

The technical scheme adopted by the invention is as follows:

a method for detecting short message character-splitting sensitive words comprises the following steps:

step one, establishing a word splitting sensitive word information base;

step two, establishing a search tree of the word-splitting sensitive words for the word-splitting sensitive word information base;

step three, model training: preparing a training corpus, and training a word2vec model which corresponds to an interception corpus and is intercepted and released corresponding to a release corpus;

step three, acquiring information sent by a user;

step four, preprocessing each piece of acquired information sent by the user;

step five, performing character retrieval analysis on the preprocessed information by using the sensitive words through a retrieval tree in a character streaming mode, entering step six if the information does not contain the information content of the sensitive words with the character splitting, and entering step seven if the information contains the information content of the sensitive words with the character splitting;

dividing the sentence into N sections, wherein each section is a window word, and counting the occurrence frequency P of related words;

calculating mutual information MI of the current sensitive word and the words in the front window and the rear window, wherein the calculation formula is as follows:

wherein x is a sensitive word, y is a word co-occurring with x, P (x) is the probability of occurrence of a word-dividing sensitive word, P (y) is the probability of occurrence of a word co-occurring with a sensitive word, and P (x, y) is the probability of occurrence of a sensitive word together with a word co-occurring with a sensitive word;

Comparing the obtained mutual information MI with a preset threshold, when the mutual information MI is smaller than the preset threshold, segmenting the preprocessed data again until the mutual information MI is larger than or equal to the threshold, and sending the data into the seventh step; if the mutual information MI is not greater than the threshold value all the time, entering the step eight;

step seven, when three or more than three words containing the split words exist in one message, the message is not approved and intercepted;

when one piece of information contains one or two word-splitting words and more than two common non-word-splitting sensitive words, the information is not approved and intercepted;

when one piece of information contains a word splitting word, the information is sent into a word2vec model, front and back n words with the word splitting word as a core are recorded, a weight value Y is given to each word according to the distance between each word and the word splitting word, the weight value Y of each word is n-d, the weights of all words from the word close to the word splitting word to the word far away from the word splitting word are n and n-1.

result＝A*Y_{Interception word}-Y_{Released word}，

Wherein A is a number of not less than 1 and Y_{Interception word}Weight value of each interceptor, Y _{Released word}Weight values for each released word; if result>If the number of the information audits is more than or equal to 0, the information is not approved and intercepted; if result is less than 0, sending the data to the step eight;

and step eight, auditing the information, intercepting if the information contains sensitive words, and displaying the information if the information does not contain sensitive words.

A detection device for words sensitive to short message separation comprises: an information receiving module, a content preprocessing module, a sensitive word character analysis module, a sensitive word mutual information MI calculation module, a character splitting processing module, an information auditing module, an information display module and an information intercepting module, wherein,

the information receiving module is used for receiving the information content sent by each user and sending the information content to the content preprocessing module;

the content preprocessing module is used for unifying the character formats in the information from the information receiving module, removing meaningless characters and redundant spaces, and sending the processed information to the sensitive word character analysis module;

the sensitive word character analysis module is used for analyzing whether the information from the content preprocessing module contains a character splitting or not and sending the information to the sensitive word mutual information MI calculation module or the character splitting processing module;

The sensitive word mutual information MI calculation module is used for calculating the mutual information of the sensitive words in the information and sending the information to the word splitting processing module or the information auditing module;

the word splitting processing module is used for sending information to the information interception module or the information auditing module according to rules;

the information auditing module is used for auditing whether the information contains sensitive words or not and sending the information to the information display module or the information intercepting module;

the information display module is used for displaying the information which passes the audit;

the information interception module is used for inputting the intercepted information into an interception information base.

A computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of: step one, establishing a word splitting sensitive word information base;

step three, acquiring information sent by a user;

step four, preprocessing each piece of acquired information sent by the user;

result＝A*Y_{Interception word}-Y_{Released word}，

Wherein A is a number of not less than 1 and Y_{Interception word}Weight value of each interceptor, Y_{Released word}Weight values for each released word; if result is more than or equal to 0, the information is not checked and intercepted; if result is less than 0, sending the data to the step eight;

Due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention relates to a method and a device for detecting a character-breaking sensitive word of a short message and a computer storage medium, which are used for integrating the structural composition analysis of the sensitive word into the analysis and detection of the sensitive word, can detect the information of the sensitive word formed by breaking characters and avoid the omission of the character-breaking sensitive word; meanwhile, when only one character is separated from the information, the context association relationship between the sensitive word and the sensitive word is analyzed to judge the sensitivity of the sensitive word, so that the detection result error caused by character separation transition recognition is avoided, whether the information containing one character separation is normal information can be accurately judged, and the judgment accuracy is improved. And the invention also carries on the segmentation to the information not directly comprising the word breaking, in order to analyze whether it contains the word breaking, and judge whether it is the sensitive word according to mutual information MI, further promote the sensitive word of word breaking of the invention and detect the accuracy, prevent from identifying the bit and producing and missing the detection or identifying the product error detection excessively.

Drawings

In order to more clearly illustrate the technical solution of the embodiment of the present invention, the drawings needed to be used in the embodiment will be briefly described below, and it should be understood that the proportional relationship of each component in the drawings in this specification does not represent the proportional relationship in the actual material selection design, and is only a schematic diagram of the structure or the position, in which:

FIG. 1 is a schematic diagram of the present invention;

FIG. 2 is a schematic flow diagram of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The present invention will be described in detail with reference to fig. 1 and 2.

Example 1

step one, establishing a word splitting sensitive word information base;

the Chinese character structure is divided into single-body character and combined character, the single-body character only has one single body, not two or more than two bodies. The commonly used first-level Chinese characters are 280 single-body characters which are mostly simple pictographic characters and ideographic characters. Because the words are evolved from drawings, each word is an integer, such as a day, month, mountain, water, cow, etc. The combined character refers to a Chinese character consisting of two or more single characters. The combined characters have the following eight types according to different composition structures: (1) the structure comprises an upper structure, a lower structure, an upper middle structure, a lower structure, a left structure, a right structure, a left structure, a middle structure, a right structure, a full surrounding structure, a half surrounding structure, a penetrating structure and a triangular structure, wherein the full surrounding structure is (6), the half surrounding structure is (7), and the triangular structure is (8). At present communication software, the information text is all horizontal in each big platform, with the word according to upper and lower structure split horizontal order arrangement again in horizontal text, can understand the degree greatly reduced undoubtedly, and open the word about left and right can clearly convey former text meaning, the word that the actual analysis got is also sensitive word split is mostly the word of opening of structure mode about, a small amount is opened three words such as "tide" is opened up for "tall and tall month", "do" and be opened up for left, middle and right structures such as "ancient character".

In addition, the words of various sensitive word lexicons issued by a network supervision office are collected for statistical analysis, independent words are removed, the rest of fonts are split according to the action structure, the split words and the words before splitting are combined to be used as keyword combinations, the situations that partial words are split and partial words are not split exist in the information content, in order to improve the recall rate of auditing, a filtering keyword list is preprocessed, all possible splitting arrangement combinations are enumerated for the phrases containing the split words, and a new keyword list is formed as a result. For example, the word stock will have various combinations of "soil-Zeng gangster-standing grain blending invoice" and "soil-Zeng-people-standing grain blending invoice".

And step two, establishing a search tree of the word-splitting sensitive words for the word-splitting sensitive word information base. Establishing a word-splitting sensitive word information base based on the first step can result in larger vocabulary expansion amount, so that the retrieval tree preferably adopts a double array Trie (DoubleArrayTrie); the dictionary storage mode of the double-array Trie tree is a Trie tree with low space complexity, is applied to the field of word segmentation of languages (such as Chinese, Japanese and the like) with large character intervals, and has little influence on the word segmentation efficiency even if the dictionary range is greatly increased, because the responsibility of the dictionary query time is o (1), the letter o is time-consuming in the 'o (1)', and the result can be queried within the linear time complexity of the '1' generation, thereby indicating that the query speed is high;

Step three, model training: preparing a training corpus, and training a word2vec model which corresponds to an interception corpus and is intercepted and released corresponding to a release corpus; the invention is based on word2vec to make the most relevant word statistical function with sensitive words, taking 'network loan' as an example, finding out the following results in the model of passing and intercepting information training respectively:

the most relevant words after information placement are ranked according to relevance as follows: away, educate, correct, guide, process, remind, report, etc.

Ranking the most relevant terms of the interception information according to the relevance degree is as follows: network, bare, photo, video, good, free, etc.

For the word splitting (the synthesized word is restored, and the word splitting is combined and replaced by preprocessing in the training and testing corpus) taking the network loan as an example, if the appearance of words such as 'objection, rejection' and the like in the context window is contained, the words contain the put information, and normal weight examination and strengthening are carried out.

If the upper and lower windows contain words such as 'photo, video, naked, good', etc., the interception probability is strengthened.

If the relevant words and the interception words are both positively and negatively related, the position information of the degree of correlation is calculated. The weight of the related words closer to the sensitive word 'network loan' is larger, the weight of the related words farther from the sensitive word 'network loan' is smaller, the weights are cumulatively summed to obtain a result value result, if the result value result is larger than zero, the interception weight is increased, and if the result value is smaller than 0, the passing weight is increased. Specifically, n words are taken before and after the sensitive word 'network loan', the weight value of each word is calculated according to the distance and the distance of the sensitive word, the specific weight is 'n-d', namely the window distance between the sensitive word and the word, the weight of the word adjacent to the sensitive word 'network loan' is n, the most marginal weight is 1, then the weight value between the two words is calculated, and the influence effect ratio interception is serious considering the release of the words, so that the result ratio is 1.2Y when comparing, the result ratio is 1.2Y _{Interception word}-Y_{Released word}(ii) a If result>And 0, increasing the interception weight, and increasing the passing weight when the interception weight is less than 0. The blocking weight means that the blocking probability of the information is calculated in the auditing system book, if the split word occurs, the blocking result is increased aiming at the original probability, for example, if the split word is greater than 0, the blocking probability is increased, for example, the blocking probability is increased to 1.5 times of the original blocking probability, otherwise, the releasing probability is increased.

Step three, acquiring information sent by a user;

step four, preprocessing each piece of acquired information sent by the user;

step five, performing character retrieval analysis on the preprocessed information by using the sensitive words through a retrieval tree in a character streaming mode, and entering step six if the information does not contain the information content of the sensitive words with the detached characters; if the information contains the information content of the sensitive words with the characters separated, entering a seventh step;

wherein x is a sensitive word, y is a word co-occurring with x, P (x) is a probability of occurrence of a word-breaking sensitive word, P (y) is a probability of occurrence of a word co-occurring with a sensitive word, and P (x, y) is a probability of occurrence of a sensitive word together with a word co-occurring with a sensitive word, that is, if the binary (x and y) mutual information is larger, the mutual influence of the two words is larger, that is, the possibility that the two words may be a class of words is high. Therefore, in the invention, a specified threshold value is set, and the re-segmentation is carried out when the specified threshold value is smaller than the specified threshold value until the mutual information value obtained by one segmentation is larger than the threshold value so as to judge that the character-breaking sensitive words exist or the mutual information value is not larger than the threshold value after all the segmentation modes are completed so as to judge that the character-breaking sensitive words do not exist;

The obtained mutual information MI is compared with a preset threshold (in the present invention, the threshold is preferably 4.7 × 10)^-6Power) is compared, when the mutual information MI is smaller than the preset threshold value, the preprocessed data are segmented again until the mutual information MI is larger than or equal to the threshold value, and the data are sent to the seventh step; if the mutual information MI is not greater than the threshold value all the time, entering the step eight;

in actual use, the fact that no direct word breaking exists in the information is not equal to the fact that no word breaking is included in the information, for example, in the case of the door that the interest-inducing girl is convenient to open, if the information is wrongly classified as the door that the interest-inducing girl is convenient to open/work/convenient/door, the information is judged to be a regular text that does not include word breaking, and the judgment results in no detection of the sentence and reduces the accuracy of the detection, so that the judgment of the step six is required even if no direct word breaking exists in the information.

Continuing to exemplify according to the above, only when mutual information of the sensitive word and words in front and back windows is greater than or equal to a given threshold, segmenting the sensitive word according to the sensitive word segmentation mode, otherwise, not counting the sensitive word library, segmenting the content again, and determining that the information is not split until all segmentation is completed or the mutual information does not exist and is greater than or equal to the given threshold.

Based on the mutual information calculation, after the above contents are recut into "interest/girl doing/yes/it/open/work/convenience/gate", the mutual information MI is greater than or equal to the threshold value, which is the information content containing the sensitive word with the word breaking, and then step seven is entered.

result＝1.2*Y_{Interception word}-Y_{Released word}，

Y_{Interception word}Weight value of each interceptor, Y_{Released word}Weight values for each released word; if result >If the number of the information audits is more than or equal to 0, the information is not approved and intercepted; if result is less than 0, sending the information containing the influence factors of the sensitive words into the step eight;

and step eight, auditing the information, intercepting if the information contains sensitive words, and displaying the information if the information does not contain sensitive words. In the information auditing, the invention preferably uses NLP natural language analysis and fastext algorithm as the auditing module, the auditing module uses a fastext classification mode to audit the information of forum and other places, combines the recognition condition of word-splitting sensitive words, if the sensitive words are contained, the interception probability of the information is integrally expanded or reduced by the word-splitting processing result, and finally, the accuracy of the original classifier is improved because of the addition. And finally, the auditing module sends the result to the information display module if the result passes the judgment result, or enters the information interception module.

In the fourth step, the preprocessing of the data includes, but is not limited to, the following: converting traditional characters into simplified characters, converting full-angle characters into half-angle characters, removing meaningless characters, and replacing continuous blank spaces with one.

The method and the device have the advantages that the structural composition analysis of the sensitive words is incorporated into the analysis and detection of the sensitive words, the information of the sensitive words formed by separating the characters can be detected, and the omission of the sensitive words formed by separating the characters is avoided; meanwhile, when only one character is separated from the information, the context association relationship between the sensitive word and the sensitive word is analyzed to judge the sensitivity of the sensitive word, so that the detection result error caused by character separation transition recognition is avoided, whether the information containing one character separation is normal information can be accurately judged, and the judgment accuracy is improved. And the invention also carries on the segmentation to the information not directly comprising the word breaking, in order to analyze whether it contains the word breaking, and judge whether it is the sensitive word according to mutual information MI, further promote the sensitive word of word breaking of the invention and detect the accuracy, prevent from identifying the bit and producing and missing the detection or identifying the product error detection excessively.

Example 2

the sensitive word character analysis module is used for analyzing whether the information from the content preprocessing module contains a disassembled character and sending the information to the sensitive word mutual information MI computing module or the disassembled character processing module, if the information contains the disassembled character, the information is sent to the disassembled character processing module, and if the information does not contain the disassembled character, the information is sent to the sensitive word mutual information MI computing module so as to judge whether the information contains the hidden disassembled character;

the sensitive word mutual information MI calculation module is used for calculating the mutual information of the sensitive words in the information and sending the information to the word splitting processing module or the information auditing module, and when the mutual information is larger than a threshold value, the information contains word splitting and is sent to the word splitting processing module for processing; if the mutual information is not greater than the threshold value all the time, the information does not contain character splitting, and the information is directly sent to an information auditing module to be audited by conventional sensitive words;

The word splitting processing module is used for sending information to the information interception module or the information auditing module according to rules: when three or more than three words containing the split words exist in one message, the message is not approved and intercepted;

result＝1.2*Y_{Interception word}-Y_{Released word}，

Y_{Interception word}Weight value of each interceptor, Y_{Released word}Weight values for each released word; if result is greater than or equal to 0, the information is not checked and passed, and the result isSending the information to an information interception module for interception; if result is less than 0, sending the information to an information auditing module;

The information interception module is also used for recording an audit log of intercepted information.

Example 3

step three, acquiring information sent by a user;

step four, preprocessing each piece of acquired information sent by the user;

result＝A*Y_{Interception word}-Y_{Released word}，

Claims

1. A detection method for character-breaking sensitive words of short messages is characterized in that: the method comprises the following steps:

step one, establishing a word splitting sensitive word information base;

step three, acquiring information sent by a user;

step four, preprocessing each piece of acquired information sent by the user;

step five, performing word retrieval analysis on the preprocessed information by using the sensitive words through a retrieval tree, entering step six if the information does not contain the information content of the sensitive words with the word breaking function, and entering step seven if the information contains the information content of the sensitive words with the word breaking function;

result＝A*Y_{Interception word}-Y_{Released word}，

2. The method for detecting the word sensitive to the word separation by the short message according to claim 1, wherein: the retrieval tree is a double-array Trie tree.

3. The method for detecting the word sensitive to the word separation by the short message according to claim 1, wherein: and A is 1.2.

4. The method for detecting the word sensitive to the word separation by the short message according to claim 1, wherein: and in the step eight, performing information auditing based on NLP natural language analysis and fastext algorithm.

5. The method for detecting the word sensitive words separated by the short message according to any one of claims 1 to 4, wherein the method comprises the following steps: in the fourth step, the preprocessing of the data includes, but is not limited to, the following: converting traditional characters into simplified characters, converting full-angle characters into half-angle characters, removing meaningless characters, and replacing continuous blank spaces with one.

6. A detection device for character-breaking sensitive words of short messages is characterized in that: the detection device includes: an information receiving module, a content preprocessing module, a sensitive word character analysis module, a sensitive word mutual information MI calculation module, a character splitting processing module, an information auditing module, an information display module and an information intercepting module, wherein,

7. The apparatus of claim 6, wherein the apparatus comprises: the information interception module is also used for recording an audit log of intercepted information.

8. A computer storage medium on which a computer program is stored, characterized in that the program realizes the steps one to eight of claim 1 when executed by a processor.