CN113761112A - Sensitive word filtering method and device - Google Patents
Sensitive word filtering method and device Download PDFInfo
- Publication number
- CN113761112A CN113761112A CN202011070783.2A CN202011070783A CN113761112A CN 113761112 A CN113761112 A CN 113761112A CN 202011070783 A CN202011070783 A CN 202011070783A CN 113761112 A CN113761112 A CN 113761112A
- Authority
- CN
- China
- Prior art keywords
- sensitive
- word
- target text
- preset
- decision tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001914 filtration Methods 0.000 title claims abstract description 63
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000008451 emotion Effects 0.000 claims abstract description 62
- 230000011218 segmentation Effects 0.000 claims abstract description 41
- 238000012545 processing Methods 0.000 claims abstract description 20
- 238000003066 decision tree Methods 0.000 claims description 43
- 238000012549 training Methods 0.000 claims description 40
- 238000012360 testing method Methods 0.000 claims description 26
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 230000002996 emotional effect Effects 0.000 claims description 9
- 230000014509 gene expression Effects 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 5
- 238000013138 pruning Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 12
- 230000035945 sensitivity Effects 0.000 description 8
- 206010010144 Completed suicide Diseases 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 239000010813 municipal solid waste Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002018 overexpression Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
- G06F18/295—Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a sensitive word filtering method and device. The method comprises the following steps: acquiring a target text; performing word segmentation processing on the target text; sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set; determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model; deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set; filtering the target text using the sensitive words in the second set of sensitive words. The method can improve the accuracy of sensitive word filtering and avoid the false killing condition on the premise of reducing labor cost.
Description
Technical Field
The invention relates to the technical field of information processing, in particular to a sensitive word filtering method and device.
Background
At present, sensitive word filtering is mainly carried out through a sensitive word bank and a plurality of sensitive word dictionary trees, and support vector machine classification, naive Bayes classification and the like are mainly adopted for realizing the sensitive word dictionary trees.
In the process of implementing the present application, the inventor finds that the accuracy of filtering by using the sensitive word filtering method is difficult to achieve expectations.
Disclosure of Invention
In view of this, the application provides a method and a device for filtering sensitive words, which can improve the accuracy of filtering sensitive words and avoid false killing on the premise of reducing labor cost.
In order to solve the technical problem, the technical scheme of the application is realized as follows:
in one embodiment, a sensitive word filtering method is provided, the method comprising:
acquiring a target text;
performing word segmentation processing on the target text;
sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set;
determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set;
filtering the target text using the sensitive words in the second set of sensitive words.
In another embodiment, there is provided a sensitive word filtering apparatus, the apparatus including: the device comprises an acquisition unit, a preprocessing unit, a matching unit, an analysis unit, a first filtering unit and a second filtering unit;
the acquisition unit is used for acquiring a target text;
the word segmentation unit is used for carrying out word segmentation processing on the target text acquired by the acquisition unit;
the matching unit is used for matching sensitive words of the target text subjected to word segmentation processing by the word segmentation unit and storing the matched sensitive words into a first sensitive word set;
the analysis unit is used for determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
the first filtering unit is used for deleting the sensitive words corresponding to the contexts which belong to the specified emotion types and are determined by the analyzing unit from the first sensitive word set matched by the matching unit to obtain a second sensitive word set;
and the second filtering unit is used for filtering the target text acquired by the acquiring unit by using the sensitive words in the second sensitive word set acquired by the first filtering unit.
In another embodiment, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the sensitive word filtering method when executing the program.
In another embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the sensitive word filtering method.
According to the technical scheme, after the target text is subjected to sensitive word matching, whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is the designated emotion type or not is determined based on the preset Markov logic network model, and whether the matched sensitive word needs to be filtered from the target text or not is further determined. The implementation scheme combines the sensitive word matching and the emotion analysis to realize the filtering of the sensitive words of the target text, can improve the accuracy of the filtering of the sensitive words on the premise of reducing the labor cost, and avoids the false killing condition.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a schematic diagram illustrating a sensitive word filtering process according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a sensitive word filtering process according to a second embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an obtaining process of a predetermined decision tree model in the embodiment of the present application;
fig. 4 is a schematic diagram illustrating an obtaining process of a preset markov logic network model in the embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus for implementing the above technique in an embodiment of the present application;
fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.
The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.
The embodiment of the application provides a sensitive word filtering method which is applied to a sensitive word filtering device, wherein the device can be a PC (personal computer), a server and the like.
In the existing implementation, the scheme of matching the sensitive words directly through a dictionary or a classification decision tree model can cause mistaken killing because the emotion analysis is not carried out in combination with the context and the matched sensitive words are directly filtered. If the word suicide is a very negative word, but the appearance context is different and the meaning is completely different, the word suicide is ten million! | A "and" I want to suicide "are distinct over expression of emotion. When the keyword of suicide is shielded, if the semantics can not be well distinguished, the benign publicity words of suicide can be mistakenly killed. For another example, in a news story, the same word is considered for a negative, positive story. In order to avoid the situation of mistaken killing, manual review needs to be added, so that the implementation cost is greatly improved, and errors are easy to occur.
Based on the problems, emotion analysis is carried out on the matched sensitive words through the Markov logic network model, so that the filtering accuracy is improved, the false killing rate is reduced, and the cost is saved.
The sensitive word filtering process is described in detail below with reference to the accompanying drawings.
Example one
Referring to fig. 1, fig. 1 is a schematic diagram illustrating a filtering process of sensitive words in an embodiment of the present application. The method comprises the following specific steps:
The target text is the text which needs sensitive word filtering.
The target text can be obtained by receiving a text request sent by a client;
or may be obtained by copying, transmission, or the like.
Before performing word segmentation on the target text, the target text may be preprocessed.
In this step, the preprocessing of the target text is specifically realized, and includes:
filtering out special symbols in the target text;
the special symbols are as follows: ",%, #, @ and the like.
Converting traditional characters in the target text into simplified characters;
and filtering out stop words in the target text.
The stop words are as follows: word strength, word aid, etc.
The above is only an example implementation, and the specific implementation is not limited to the above operation.
And 102, performing word segmentation processing on the target text.
In the embodiment of the present application, a specific implementation manner of performing word segmentation is not limited, for example, a jieba word segmentation system may be used to perform word segmentation.
Matching the sensitive words after the sensitive text after word segmentation is executed; the sensitive text can also be backed up to a training set and a testing set, and can be used as a sample of training and testing of subsequent iterations.
And 103, performing sensitive word matching on the target text after the word segmentation, and storing the matched sensitive words into a first sensitive word set.
And 104, determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type based on a preset Markov logic network model.
In the step, a preset Markov logic network model is further used for determining the emotion type corresponding to the noun in the first sensitive word set; the emotion types can be various or only two, and are determined according to the classification of a preset emotion analysis model; and defining the emotion types needing to be filtered out from the first sensitive word set as the specified emotion types.
When determining that a certain sensitive word in the first set is input with a specified emotion type based on a preset Markov logic network model, deleting the certain sensitive word from the first sensitive word set.
And 105, deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set.
And 106, filtering the target text by using the sensitive words in the second sensitive word set.
After the target text is subjected to sensitive word matching in the embodiment of the application, whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not is determined based on a preset Markov logic network model, and whether the matched sensitive word needs to be filtered from the target text or not is further determined. According to the implementation scheme, the sensitive words of the target text are filtered by combining the sensitive word matching and the emotion analysis, the accuracy rate of the sensitive word filtering is improved on the premise that the labor cost is reduced, and the false killing condition is avoided.
Example two
Referring to fig. 2, fig. 2 is a schematic diagram of a sensitive word filtering process in the second embodiment of the present application. The method comprises the following specific steps:
The target text is the text which needs sensitive word filtering.
The target text can be obtained by receiving a text request sent by a client;
or may be obtained by copying, transmission, or the like.
In the embodiment of the application, before the target text is subjected to word segmentation, word segmentation processing can be performed on the target document.
The method specifically realizes the preprocessing of the target text, and comprises the following steps:
filtering out special symbols in the target text;
converting traditional characters in the target text into simplified characters;
and filtering out stop words in the target text.
The above is only an example implementation, and the specific implementation is not limited to the above operation.
In the embodiment of the present application, a specific implementation manner of performing word segmentation is not limited, for example, a jieba word segmentation system may be used to perform word segmentation.
Matching the sensitive words after the sensitive text after word segmentation is executed; the sensitive text can also be backed up to a training set and a testing set, and can be used as a sample of training and testing of subsequent iterations.
And 204, performing sensitive word matching on the target text after word segmentation processing through a preset decision tree model, and storing the matched sensitive words into a first sensitive word set.
The preset decision tree model is a Tier tree based on DFA.
Referring to fig. 3, fig. 3 is a schematic diagram of an obtaining process of a preset decision tree model in the embodiment of the present application. The method comprises the following specific steps:
The training samples are some set words or word sets obtained by segmenting the text.
Vectorization is a technology widely applied to the field of machine learning, and by vectorizing parameters, code inner circulation is saved, so that the function of improving the operation efficiency is achieved.
TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).
The TF-IDF algorithm considers that if a word or phrase appears in an article with high frequency TF and rarely appears in other articles, i.e., the frequency of occurrence is low, the word or phrase is considered to have good category discrimination capability and is suitable for classification. TFIDF is actually TF × IDF, TF Term Frequency (Term Frequency), IDF Inverse file Frequency (Inverse Document Frequency). TF represents the frequency with which terms appear in document d.
The main idea of IDF is: if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong.
In practice, however, if a term frequently appears in a document of a class, it indicates that the term can well represent the characteristics of the text of the class, and such terms should be given higher weight and selected as characteristic words of the text of the class to distinguish the document from other classes. This is the deficiency of IDF, where in a given document, the Term Frequency (TF) refers to the frequency with which a given word appears in the document. This number is a normalization of the number of words (term count) to prevent it from biasing towards long documents. The same word may have a higher number of words in a long document than in a short document, regardless of the importance of the word. For a word in a particular document, its importance can be expressed as:
the numerator in the above equation is the number of occurrences of the word in the document, and the denominator is the sum of the number of occurrences of all words in the document.
Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term, and taking the resulting quotient to be a base-10 logarithm:
wherein, | D |, the total number of files in the corpus; the number of documents containing a term (i.e., the number of documents) if the term is not in the corpus results in a denominator of zero, and is therefore typically used as the denominator. The value of the IDF is larger if the number of documents containing the word is smaller.
The product of TF and IDF is then calculated.
A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF.
And step 304, classifying the training samples by using a C4.5 decision tree algorithm.
C4.5 is one of the decision tree algorithms. Decision tree algorithm as a classification algorithm, the goal is to classify n samples with p-dimensional features into c classes. Corresponding to a projection, c ═ f (n), the sample is assigned a class label via a transformation. Decision trees to achieve this, the process of classification can be represented as a tree, bifurcated each time by selecting a feature pi.
In the embodiment of the application, classification can be realized through a C4.5 decision tree, and classification can also be realized through other algorithms, and comparison is not limited, such as CART and other algorithms.
The establishment and training of the preset decision tree model in the embodiment of the present application may be performed on a sensitive word filtering device, or may be performed on other devices, which is not limited in the embodiment of the present application.
And 305, constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
After the C4.5 algorithm and the TF-IDF algorithm, the initial training set becomes a training set with m characteristic dimensions, and each characteristic has an information gain rate of the characteristic. Selecting a feature with the largest information gain rate as a root of the decision tree, then taking a sample meeting the feature as a child node of the root, and continuing to divide by the child node until a classification can be uniquely determined.
For sensitive words, firstly, a classification dimension is determined, namely whether the sensitive words are the simplest sensitive words or not, the dimension at this time is 2 (yes/no), and characteristics, such as political sensitivity, military sensitivity, financial sensitivity and the like, are determined, wherein each sensitivity is lower, generally, higher than three values, and according to the steps, a tree structure is finally generated. Each result is a leaf.
The method further comprises:
testing the preset decision tree model by using a word test set;
when the accuracy of the test result is smaller than a preset threshold value, updating the preset decision tree model through pruning;
testing the updated preset decision tree model again;
and when the accuracy of the test result is not less than a preset threshold value, taking the current preset decision tree model as the preset decision tree model to be subjected to sensitive word matching.
The objective of pruning is not to make decision trees overfit, because during the repeated training process, the branches of the decision trees are certainly more and more, at this time, in the testing process, if the accuracy rate is found to be reduced, the current decision tree model needs to be analyzed to have places which need to be cut, for example, in the initially constructed decision tree, when the political sensitivity is higher than the military sensitivity (generally, low and high) of the root node, if the military sensitivity is low, the decision result is not a sensitive word, and generally and highly is a sensitive word. In the subsequent testing process, the current accuracy rate is found to be 60%, and the analysis at this time shows that if the words with high political sensitivity are directly regarded as the sensitive words, the accuracy rate reaches 66.6% and is more than 60%. It is decided to trim the branches.
And 204, determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type based on a preset Markov logic network model.
The Markov logic network is a first-order logic knowledge base with each criterion or statement having a weight, wherein constants represent objects in the base, and each possible element of a first-order logic criterion in a basic Markov knowledge base is agreed to have a corresponding weight. Markov logic network reasoning is achieved by applying the markov chain monte carlo method on the minimum subset of elements required to answer a question. The weights are obtained from the relational database through optimized iterative efficient learning of quasi-likelihood metrics, and optionally, additional clauses can be learned using inductive logic program techniques. Experiments using a real world database and knowledge base within a university have shown that this approach is promising.
A first-order logical knowledge base can be viewed as adding a set of hard constraints over a range of possible worlds: even if it conflicts with only one rule. The basic idea of the markov logic network is to soften these constraints: a possible world is not unlikely to exist if it conflicts with the knowledge base rules, but rather the probability decreases, the fewer the number of conflicting rules, the greater the probability. Each rule is associated with a weight reflecting its constraint strength: the higher the weight, the greater the difference in log probability between events that satisfy and do not satisfy the rule, all else being equal.
Referring to fig. 4, fig. 4 is a schematic diagram of an obtaining process of a preset markov logic network model in the embodiment of the present application. The method comprises the following specific steps:
And step 402, extracting feature words of the training samples.
And step 403, constructing a preset Markov logic network model according to the defined predicate and rule which accord with the emotional expression and the feature words.
The training sample can adopt a crawler technology to collect some blogs or comments on Internet social platforms such as microblogs and Baidu post bars, and the quality of the current comments is judged by the number of praise and trample; it can also refer to Chinese Character Orientation Analysis and Evaluation (COAE) or many corpora on the Internet, some corpora labeled manually.
In the embodiment of the present application, the obtaining method of the training sample is not limited.
For example, we classify emotions as happy, sad, disgusted, frightened, and so on. At present, a sentence of linguistic data 'XX mobile phone real garbage wastes a lot of money and cannot be bought any more next time', and the linguistic data is marked with labels of disgust and anger in a corpus. Then, a first-order logic rule that can be constructed at this time is that XX mobile phones are in an aversive feeling due to real rubbish, much money is wasted in the aversive feeling, and the XX mobile phones cannot buy the aversive feeling any more next time. Similar sentences have certain weight in the aversive feelings in the corpus. The logical rules, the public expressions of the logical rules, and the weights form a logical net. When the new corpus hits the emotion classification, an emotion classification can be confirmed according to a certain algorithm.
The emotion of the sensitive words in the corresponding context can be determined by establishing a preset Markov logic network model by using the Markov logic network for emotion analysis, and whether the preliminarily matched sensitive words are real sensitive words is further determined, so that the false killing rate of the sensitive words is greatly reduced.
The Markov logic network model can divide the text into happiness, sadness, anger and the like, and no matter how kinds of emotions are divided, the emotions are divided into two types according to the application scene of the target text: a specified emotion type and a non-specified emotion type. If the appointed emotion type is a positive emotion type, the non-appointed emotion type is a negative emotion type; in the embodiment of the application, the sensitive words corresponding to the specified emotion types are used as the words which are not required to be filtered from the target text, so that the false killing condition is avoided, the filtering of the sensitive words can be automatically realized, the cost is reduced, and the success rate is improved.
The method further comprises:
performing regression testing on the preset Markov logic network model by using a text test set;
when the test result does not meet the preset condition, reestablishing the preset decision tree model;
performing regression testing on the reestablished preset decision tree model again;
and when the test result meets the preset condition, taking the currently established preset Markov logic network model as the preset Markov logic network model for determining the emotion type.
The establishment and training of the preset markov logic network model in the embodiment of the application can be executed on a sensitive word filtering device and also can be executed on other equipment, and the embodiment of the application is not limited thereto.
In the embodiment of the application, the designated emotion type is set according to the application environment of the target text, for example, the designated emotion type corresponding to the mourning text is sad; the corresponding designated emotion types in the wedding congratulatory words are happy, etc., and this is merely an example and not limited to this arrangement.
And step 206, filtering the target text by using the sensitive words in the second sensitive word set.
In the embodiment of the application, after the target text is subjected to sensitive word matching based on the DFA Tier tree, the emotional type to which the context corresponding to the sensitive word in the target text belongs is determined through the Markov logic network model, and whether the matched sensitive word needs to be filtered from the target text is further determined. According to the implementation scheme, the sensitive words of the target text are filtered by combining the sensitive word matching and the emotion analysis, the accuracy rate of the sensitive word filtering is improved on the premise that the labor cost is reduced, and the false killing condition is avoided.
Based on the same inventive concept, the embodiment of the application also provides a sensitive word filtering device. Referring to fig. 5, fig. 5 is a schematic structural diagram of an apparatus applied to the above technology in the embodiment of the present application. The device comprises: the system comprises an acquisition unit 501, a word segmentation unit 502, a matching unit 503, an analysis unit 504, a first filtering unit 505 and a second filtering unit 506;
an obtaining unit 501, configured to obtain a target text;
a word segmentation unit 502, configured to perform word segmentation on the target text preprocessed by the preprocessing unit 502;
the matching unit 503 is configured to perform sensitive word matching on the target text subjected to word segmentation processing by the word segmentation unit 502, and store the matched sensitive words in the first sensitive word set;
an analyzing unit 504, configured to determine, based on a preset markov logic network model, whether an emotion type to which a context corresponding to the sensitive word matched by the matching unit 503 in the target text belongs is a specified emotion type;
the first filtering unit 505 is configured to delete the sensitive word determined by the analyzing unit 504 and corresponding to the context of the specified emotion type from the first sensitive word set matched by the matching unit 503 to obtain a second sensitive word set;
the second filtering unit 506 is configured to filter the target text acquired by the acquiring unit 501 by using the sensitive words in the second sensitive word set acquired by the first filtering unit 505.
Preferably, the apparatus further comprises: a preprocessing unit 507;
the preprocessing unit 507 is configured to preprocess the target text acquired by the acquisition unit 501.
Preferably, the first and second electrodes are formed of a metal,
the matching unit 503 is specifically configured to perform sensitive word matching on the target text after word segmentation processing through a preset decision tree model; the obtaining of the preset decision tree model comprises: obtaining a training sample; vectorizing the training samples; determining the weight of each training sample through a TF-IDF algorithm to obtain a weight matrix of the training samples; classifying the training samples by using a C4.5 decision tree algorithm; and constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
Preferably, the first and second electrodes are formed of a metal,
the acquisition of the preset Markov logic network model comprises the following steps: defining predicates and rules which accord with emotional expression based on a Markov logic network; extracting feature words of the training samples; and constructing a preset Markov logic network model according to the defined predicates and rules which accord with the emotional expressions and the characteristic words.
The units of the above embodiments may be integrated into one body, or may be separately deployed; may be combined into one unit or further divided into a plurality of sub-units.
In another embodiment, an electronic device is also provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the sensitive word filtering method when executing the program.
In another embodiment, a computer-readable storage medium is also provided having stored thereon computer instructions that, when executed by a processor, may implement the steps in the sensitive word filtering method.
Fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 6, the electronic device may include: a Processor (Processor)610, a communication Interface (Communications Interface)620, a Memory (Memory)630 and a communication bus 640, wherein the Processor 610, the communication Interface 620 and the Memory 630 communicate with each other via the communication bus 640. The processor 610 may call logic instructions in the memory 630 to perform the following method:
acquiring a target text;
performing word segmentation processing on the target text;
sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set;
determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set;
filtering the target text using the sensitive words in the second set of sensitive words.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (11)
1. A sensitive word filtering method, the method comprising:
acquiring a target text;
performing word segmentation processing on the target text;
sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set;
determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a designated emotion type or not by using a preset Markov logic network model;
deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set;
filtering the target text using the sensitive words in the second set of sensitive words.
2. The method of claim 1, wherein after the obtaining the target text and before the performing the word segmentation on the target text, the method further comprises:
and preprocessing the target text.
3. The method of claim 1, wherein performing sensitive word matching on the participled target text comprises:
sensitive word matching is carried out on the target text after word segmentation processing through a preset decision tree model;
the obtaining of the preset decision tree model comprises:
obtaining a training sample;
vectorizing the training samples;
determining the weight of each training sample through a word frequency TF-inverse text frequency index IDF algorithm to obtain a weight matrix of the training samples;
classifying the training samples by using a C4.5 decision tree algorithm;
and constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
4. The method of claim 3, further comprising:
testing the preset decision tree model by using a word test set;
when the accuracy of the test result is smaller than a preset threshold value, updating the preset decision tree model through pruning;
testing the updated preset decision tree model again;
and when the accuracy of the test result is not less than a preset threshold value, taking the current preset decision tree model as the preset decision tree model to be subjected to sensitive word matching.
5. The method of any one of claims 1-4, wherein obtaining the preset Markov logic network model comprises:
defining predicates and rules which accord with emotional expression based on a Markov logic network;
extracting feature words of the training samples;
and constructing a preset Markov logic network model according to the defined predicates and rules which accord with the emotional expressions and the characteristic words.
6. The method of claim 5, further comprising:
performing regression testing on the preset Markov logic network model by using a text test set;
when the test result does not meet the preset condition, reestablishing the preset decision tree model;
performing regression testing on the reestablished preset decision tree model again;
and when the test result meets the preset condition, taking the currently established preset Markov logic network model as the preset Markov logic network model for determining the emotion type.
7. A sensitive word filtering device, the device comprising: the device comprises an acquisition unit, a word segmentation unit, a matching unit, an analysis unit, a first filtering unit and a second filtering unit;
the acquisition unit is used for acquiring a target text;
the word segmentation unit is used for carrying out word segmentation processing on the target text acquired by the acquisition unit;
the matching unit is used for matching sensitive words of the target text subjected to word segmentation processing by the word segmentation unit and storing the matched sensitive words into a first sensitive word set;
the analysis unit is used for determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
the first filtering unit is used for deleting the sensitive words corresponding to the contexts which belong to the specified emotion types and are determined by the analyzing unit from the first sensitive word set matched by the matching unit to obtain a second sensitive word set;
and the second filtering unit is used for filtering the target text acquired by the acquiring unit by using the sensitive words in the second sensitive word set acquired by the first filtering unit.
8. The apparatus of claim 7,
the matching unit is specifically used for performing sensitive word matching on the target text after word segmentation processing through a preset decision tree model; the obtaining of the preset decision tree model comprises: obtaining a training sample; vectorizing the training samples; determining the weight of each training sample through a word frequency TF-inverse text frequency index IDF algorithm to obtain a weight matrix of the training samples; classifying the training samples by using a C4.5 decision tree algorithm; and constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
9. The apparatus of claim 7, wherein the obtaining of the preset Markov logic network model comprises: defining predicates and rules which accord with emotional expression based on a Markov logic network; extracting feature words of the training samples; and constructing a preset Markov logic network model according to the defined predicates and rules which accord with the emotional expressions and the characteristic words.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-6 when executing the program.
11. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011070783.2A CN113761112A (en) | 2020-10-09 | 2020-10-09 | Sensitive word filtering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011070783.2A CN113761112A (en) | 2020-10-09 | 2020-10-09 | Sensitive word filtering method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113761112A true CN113761112A (en) | 2021-12-07 |
Family
ID=78785784
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011070783.2A Pending CN113761112A (en) | 2020-10-09 | 2020-10-09 | Sensitive word filtering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113761112A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114417883A (en) * | 2022-01-10 | 2022-04-29 | 马上消费金融股份有限公司 | Data processing method, device and equipment |
CN114706940A (en) * | 2022-01-19 | 2022-07-05 | 浙报融媒体科技(浙江)股份有限公司 | Sensitive word-based news file auditing method and system |
CN118013963A (en) * | 2024-04-09 | 2024-05-10 | 四川易景智能终端有限公司 | Method and device for identifying and replacing sensitive words |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100082332A1 (en) * | 2008-09-26 | 2010-04-01 | Rite-Solutions, Inc. | Methods and apparatus for protecting users from objectionable text |
CN106055541A (en) * | 2016-06-29 | 2016-10-26 | 清华大学 | News content sensitive word filtering method and system |
CN107515877A (en) * | 2016-06-16 | 2017-12-26 | 百度在线网络技术(北京)有限公司 | The generation method and device of sensitive theme word set |
CN107992471A (en) * | 2017-11-10 | 2018-05-04 | 北京光年无限科技有限公司 | Information filtering method and device in a kind of interactive process |
CN110991171A (en) * | 2019-09-30 | 2020-04-10 | 奇安信科技集团股份有限公司 | Sensitive word detection method and device |
-
2020
- 2020-10-09 CN CN202011070783.2A patent/CN113761112A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100082332A1 (en) * | 2008-09-26 | 2010-04-01 | Rite-Solutions, Inc. | Methods and apparatus for protecting users from objectionable text |
CN107515877A (en) * | 2016-06-16 | 2017-12-26 | 百度在线网络技术(北京)有限公司 | The generation method and device of sensitive theme word set |
CN106055541A (en) * | 2016-06-29 | 2016-10-26 | 清华大学 | News content sensitive word filtering method and system |
CN107992471A (en) * | 2017-11-10 | 2018-05-04 | 北京光年无限科技有限公司 | Information filtering method and device in a kind of interactive process |
CN110991171A (en) * | 2019-09-30 | 2020-04-10 | 奇安信科技集团股份有限公司 | Sensitive word detection method and device |
Non-Patent Citations (3)
Title |
---|
成邦文,杨宏进著: "《统计调查数据质量控制 数据审核与评估的理论、方法及实践》", 31 October 2019, 北京:科学技术文献出版社, pages: 76 * |
李伟;: "网页敏感词过滤与敏感文本分类系统设计", 电脑知识与技术, no. 08 * |
郝志峰主编: "《数据科学与数学建模》", 31 January 2019, 武汉:华中科技大学出版社, pages: 90 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114417883A (en) * | 2022-01-10 | 2022-04-29 | 马上消费金融股份有限公司 | Data processing method, device and equipment |
CN114417883B (en) * | 2022-01-10 | 2022-10-25 | 马上消费金融股份有限公司 | Data processing method, device and equipment |
CN114706940A (en) * | 2022-01-19 | 2022-07-05 | 浙报融媒体科技(浙江)股份有限公司 | Sensitive word-based news file auditing method and system |
CN118013963A (en) * | 2024-04-09 | 2024-05-10 | 四川易景智能终端有限公司 | Method and device for identifying and replacing sensitive words |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11494648B2 (en) | Method and system for detecting fake news based on multi-task learning model | |
KR102020756B1 (en) | Method for Analyzing Reviews Using Machine Leaning | |
US8892580B2 (en) | Transformation of regular expressions | |
CN113761112A (en) | Sensitive word filtering method and device | |
CN111767716B (en) | Method and device for determining enterprise multi-level industry information and computer equipment | |
CN110309304A (en) | A kind of file classification method, device, equipment and storage medium | |
Sheikhi et al. | An effective model for SMS spam detection using content-based features and averaged neural network | |
Suleiman et al. | SMS spam detection using H2O framework | |
CN113239268B (en) | Commodity recommendation method, device and system | |
KR102527937B1 (en) | A method for searching the similar patents based on artificial intelligence and an apparatus thereof | |
CN115795061A (en) | Knowledge graph construction method and system based on word vectors and dependency syntax | |
Deekshan et al. | Detection and summarization of honest reviews using text mining | |
Dagar et al. | Twitter sentiment analysis using supervised machine learning techniques | |
Redondo-Gutierrez et al. | Detecting malware using text documents extracted from spam email through machine learning | |
CN112966507B (en) | Method, device, equipment and storage medium for constructing recognition model and attack recognition | |
Bokolo et al. | Cyberbullying detection on social media using machine learning | |
Guermazi et al. | Using a semi-automatic keyword dictionary for improving violent web site filtering | |
CN115496066A (en) | Text analysis system, text analysis method, electronic device, and storage medium | |
George et al. | Bangla fake news detection based on multichannel combined CNN-LSTM | |
CN115098773A (en) | Big data-based public opinion monitoring and analyzing system and method | |
Haque et al. | Sentiment analysis in low-resource bangla text using active learning | |
Shil et al. | An approach for detecting Bangla spam comments on Facebook | |
CN114117047A (en) | Method and system for classifying illegal voice based on C4.5 algorithm | |
Boyko et al. | Categorizing False Information in News Content Using an Ensemble Machine Learning Model. | |
Rajalingam et al. | Implementation of vocabulary-based classification for spam filtering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |