CN113761112A - Sensitive word filtering method and device - Google Patents

Sensitive word filtering method and device Download PDF

Info

Publication number
CN113761112A
CN113761112A CN202011070783.2A CN202011070783A CN113761112A CN 113761112 A CN113761112 A CN 113761112A CN 202011070783 A CN202011070783 A CN 202011070783A CN 113761112 A CN113761112 A CN 113761112A
Authority
CN
China
Prior art keywords
sensitive
word
target text
preset
decision tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011070783.2A
Other languages
Chinese (zh)
Inventor
李雨航
余欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202011070783.2A priority Critical patent/CN113761112A/en
Publication of CN113761112A publication Critical patent/CN113761112A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a sensitive word filtering method and device. The method comprises the following steps: acquiring a target text; performing word segmentation processing on the target text; sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set; determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model; deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set; filtering the target text using the sensitive words in the second set of sensitive words. The method can improve the accuracy of sensitive word filtering and avoid the false killing condition on the premise of reducing labor cost.

Description

Sensitive word filtering method and device
Technical Field
The invention relates to the technical field of information processing, in particular to a sensitive word filtering method and device.
Background
At present, sensitive word filtering is mainly carried out through a sensitive word bank and a plurality of sensitive word dictionary trees, and support vector machine classification, naive Bayes classification and the like are mainly adopted for realizing the sensitive word dictionary trees.
In the process of implementing the present application, the inventor finds that the accuracy of filtering by using the sensitive word filtering method is difficult to achieve expectations.
Disclosure of Invention
In view of this, the application provides a method and a device for filtering sensitive words, which can improve the accuracy of filtering sensitive words and avoid false killing on the premise of reducing labor cost.
In order to solve the technical problem, the technical scheme of the application is realized as follows:
in one embodiment, a sensitive word filtering method is provided, the method comprising:
acquiring a target text;
performing word segmentation processing on the target text;
sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set;
determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set;
filtering the target text using the sensitive words in the second set of sensitive words.
In another embodiment, there is provided a sensitive word filtering apparatus, the apparatus including: the device comprises an acquisition unit, a preprocessing unit, a matching unit, an analysis unit, a first filtering unit and a second filtering unit;
the acquisition unit is used for acquiring a target text;
the word segmentation unit is used for carrying out word segmentation processing on the target text acquired by the acquisition unit;
the matching unit is used for matching sensitive words of the target text subjected to word segmentation processing by the word segmentation unit and storing the matched sensitive words into a first sensitive word set;
the analysis unit is used for determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
the first filtering unit is used for deleting the sensitive words corresponding to the contexts which belong to the specified emotion types and are determined by the analyzing unit from the first sensitive word set matched by the matching unit to obtain a second sensitive word set;
and the second filtering unit is used for filtering the target text acquired by the acquiring unit by using the sensitive words in the second sensitive word set acquired by the first filtering unit.
In another embodiment, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the sensitive word filtering method when executing the program.
In another embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the sensitive word filtering method.
According to the technical scheme, after the target text is subjected to sensitive word matching, whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is the designated emotion type or not is determined based on the preset Markov logic network model, and whether the matched sensitive word needs to be filtered from the target text or not is further determined. The implementation scheme combines the sensitive word matching and the emotion analysis to realize the filtering of the sensitive words of the target text, can improve the accuracy of the filtering of the sensitive words on the premise of reducing the labor cost, and avoids the false killing condition.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a schematic diagram illustrating a sensitive word filtering process according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a sensitive word filtering process according to a second embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an obtaining process of a predetermined decision tree model in the embodiment of the present application;
fig. 4 is a schematic diagram illustrating an obtaining process of a preset markov logic network model in the embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus for implementing the above technique in an embodiment of the present application;
fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.
The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.
The embodiment of the application provides a sensitive word filtering method which is applied to a sensitive word filtering device, wherein the device can be a PC (personal computer), a server and the like.
In the existing implementation, the scheme of matching the sensitive words directly through a dictionary or a classification decision tree model can cause mistaken killing because the emotion analysis is not carried out in combination with the context and the matched sensitive words are directly filtered. If the word suicide is a very negative word, but the appearance context is different and the meaning is completely different, the word suicide is ten million! | A "and" I want to suicide "are distinct over expression of emotion. When the keyword of suicide is shielded, if the semantics can not be well distinguished, the benign publicity words of suicide can be mistakenly killed. For another example, in a news story, the same word is considered for a negative, positive story. In order to avoid the situation of mistaken killing, manual review needs to be added, so that the implementation cost is greatly improved, and errors are easy to occur.
Based on the problems, emotion analysis is carried out on the matched sensitive words through the Markov logic network model, so that the filtering accuracy is improved, the false killing rate is reduced, and the cost is saved.
The sensitive word filtering process is described in detail below with reference to the accompanying drawings.
Example one
Referring to fig. 1, fig. 1 is a schematic diagram illustrating a filtering process of sensitive words in an embodiment of the present application. The method comprises the following specific steps:
step 101, obtaining a target text.
The target text is the text which needs sensitive word filtering.
The target text can be obtained by receiving a text request sent by a client;
or may be obtained by copying, transmission, or the like.
Before performing word segmentation on the target text, the target text may be preprocessed.
In this step, the preprocessing of the target text is specifically realized, and includes:
filtering out special symbols in the target text;
the special symbols are as follows: ",%, #, @ and the like.
Converting traditional characters in the target text into simplified characters;
and filtering out stop words in the target text.
The stop words are as follows: word strength, word aid, etc.
The above is only an example implementation, and the specific implementation is not limited to the above operation.
And 102, performing word segmentation processing on the target text.
In the embodiment of the present application, a specific implementation manner of performing word segmentation is not limited, for example, a jieba word segmentation system may be used to perform word segmentation.
Matching the sensitive words after the sensitive text after word segmentation is executed; the sensitive text can also be backed up to a training set and a testing set, and can be used as a sample of training and testing of subsequent iterations.
And 103, performing sensitive word matching on the target text after the word segmentation, and storing the matched sensitive words into a first sensitive word set.
And 104, determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type based on a preset Markov logic network model.
In the step, a preset Markov logic network model is further used for determining the emotion type corresponding to the noun in the first sensitive word set; the emotion types can be various or only two, and are determined according to the classification of a preset emotion analysis model; and defining the emotion types needing to be filtered out from the first sensitive word set as the specified emotion types.
When determining that a certain sensitive word in the first set is input with a specified emotion type based on a preset Markov logic network model, deleting the certain sensitive word from the first sensitive word set.
And 105, deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set.
And 106, filtering the target text by using the sensitive words in the second sensitive word set.
After the target text is subjected to sensitive word matching in the embodiment of the application, whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not is determined based on a preset Markov logic network model, and whether the matched sensitive word needs to be filtered from the target text or not is further determined. According to the implementation scheme, the sensitive words of the target text are filtered by combining the sensitive word matching and the emotion analysis, the accuracy rate of the sensitive word filtering is improved on the premise that the labor cost is reduced, and the false killing condition is avoided.
Example two
Referring to fig. 2, fig. 2 is a schematic diagram of a sensitive word filtering process in the second embodiment of the present application. The method comprises the following specific steps:
step 201, obtaining a target text.
The target text is the text which needs sensitive word filtering.
The target text can be obtained by receiving a text request sent by a client;
or may be obtained by copying, transmission, or the like.
In the embodiment of the application, before the target text is subjected to word segmentation, word segmentation processing can be performed on the target document.
The method specifically realizes the preprocessing of the target text, and comprises the following steps:
filtering out special symbols in the target text;
converting traditional characters in the target text into simplified characters;
and filtering out stop words in the target text.
The above is only an example implementation, and the specific implementation is not limited to the above operation.
Step 202, performing word segmentation processing on the target text.
In the embodiment of the present application, a specific implementation manner of performing word segmentation is not limited, for example, a jieba word segmentation system may be used to perform word segmentation.
Matching the sensitive words after the sensitive text after word segmentation is executed; the sensitive text can also be backed up to a training set and a testing set, and can be used as a sample of training and testing of subsequent iterations.
And 204, performing sensitive word matching on the target text after word segmentation processing through a preset decision tree model, and storing the matched sensitive words into a first sensitive word set.
The preset decision tree model is a Tier tree based on DFA.
Referring to fig. 3, fig. 3 is a schematic diagram of an obtaining process of a preset decision tree model in the embodiment of the present application. The method comprises the following specific steps:
step 301, a training sample is obtained.
The training samples are some set words or word sets obtained by segmenting the text.
Step 302, vectorizing the training samples.
Vectorization is a technology widely applied to the field of machine learning, and by vectorizing parameters, code inner circulation is saved, so that the function of improving the operation efficiency is achieved.
Step 303, determining the weight of each training sample through a TF-IDF algorithm to obtain a weight matrix of the training samples.
TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).
The TF-IDF algorithm considers that if a word or phrase appears in an article with high frequency TF and rarely appears in other articles, i.e., the frequency of occurrence is low, the word or phrase is considered to have good category discrimination capability and is suitable for classification. TFIDF is actually TF × IDF, TF Term Frequency (Term Frequency), IDF Inverse file Frequency (Inverse Document Frequency). TF represents the frequency with which terms appear in document d.
The main idea of IDF is: if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong.
In practice, however, if a term frequently appears in a document of a class, it indicates that the term can well represent the characteristics of the text of the class, and such terms should be given higher weight and selected as characteristic words of the text of the class to distinguish the document from other classes. This is the deficiency of IDF, where in a given document, the Term Frequency (TF) refers to the frequency with which a given word appears in the document. This number is a normalization of the number of words (term count) to prevent it from biasing towards long documents. The same word may have a higher number of words in a long document than in a short document, regardless of the importance of the word. For a word in a particular document, its importance can be expressed as:
Figure BDA0002714908880000071
the numerator in the above equation is the number of occurrences of the word in the document, and the denominator is the sum of the number of occurrences of all words in the document.
Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term, and taking the resulting quotient to be a base-10 logarithm:
Figure BDA0002714908880000072
wherein, | D |, the total number of files in the corpus; the number of documents containing a term (i.e., the number of documents) if the term is not in the corpus results in a denominator of zero, and is therefore typically used as the denominator. The value of the IDF is larger if the number of documents containing the word is smaller.
The product of TF and IDF is then calculated.
A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF.
And step 304, classifying the training samples by using a C4.5 decision tree algorithm.
C4.5 is one of the decision tree algorithms. Decision tree algorithm as a classification algorithm, the goal is to classify n samples with p-dimensional features into c classes. Corresponding to a projection, c ═ f (n), the sample is assigned a class label via a transformation. Decision trees to achieve this, the process of classification can be represented as a tree, bifurcated each time by selecting a feature pi.
In the embodiment of the application, classification can be realized through a C4.5 decision tree, and classification can also be realized through other algorithms, and comparison is not limited, such as CART and other algorithms.
The establishment and training of the preset decision tree model in the embodiment of the present application may be performed on a sensitive word filtering device, or may be performed on other devices, which is not limited in the embodiment of the present application.
And 305, constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
After the C4.5 algorithm and the TF-IDF algorithm, the initial training set becomes a training set with m characteristic dimensions, and each characteristic has an information gain rate of the characteristic. Selecting a feature with the largest information gain rate as a root of the decision tree, then taking a sample meeting the feature as a child node of the root, and continuing to divide by the child node until a classification can be uniquely determined.
For sensitive words, firstly, a classification dimension is determined, namely whether the sensitive words are the simplest sensitive words or not, the dimension at this time is 2 (yes/no), and characteristics, such as political sensitivity, military sensitivity, financial sensitivity and the like, are determined, wherein each sensitivity is lower, generally, higher than three values, and according to the steps, a tree structure is finally generated. Each result is a leaf.
The method further comprises:
testing the preset decision tree model by using a word test set;
when the accuracy of the test result is smaller than a preset threshold value, updating the preset decision tree model through pruning;
testing the updated preset decision tree model again;
and when the accuracy of the test result is not less than a preset threshold value, taking the current preset decision tree model as the preset decision tree model to be subjected to sensitive word matching.
The objective of pruning is not to make decision trees overfit, because during the repeated training process, the branches of the decision trees are certainly more and more, at this time, in the testing process, if the accuracy rate is found to be reduced, the current decision tree model needs to be analyzed to have places which need to be cut, for example, in the initially constructed decision tree, when the political sensitivity is higher than the military sensitivity (generally, low and high) of the root node, if the military sensitivity is low, the decision result is not a sensitive word, and generally and highly is a sensitive word. In the subsequent testing process, the current accuracy rate is found to be 60%, and the analysis at this time shows that if the words with high political sensitivity are directly regarded as the sensitive words, the accuracy rate reaches 66.6% and is more than 60%. It is decided to trim the branches.
And 204, determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type based on a preset Markov logic network model.
The Markov logic network is a first-order logic knowledge base with each criterion or statement having a weight, wherein constants represent objects in the base, and each possible element of a first-order logic criterion in a basic Markov knowledge base is agreed to have a corresponding weight. Markov logic network reasoning is achieved by applying the markov chain monte carlo method on the minimum subset of elements required to answer a question. The weights are obtained from the relational database through optimized iterative efficient learning of quasi-likelihood metrics, and optionally, additional clauses can be learned using inductive logic program techniques. Experiments using a real world database and knowledge base within a university have shown that this approach is promising.
A first-order logical knowledge base can be viewed as adding a set of hard constraints over a range of possible worlds: even if it conflicts with only one rule. The basic idea of the markov logic network is to soften these constraints: a possible world is not unlikely to exist if it conflicts with the knowledge base rules, but rather the probability decreases, the fewer the number of conflicting rules, the greater the probability. Each rule is associated with a weight reflecting its constraint strength: the higher the weight, the greater the difference in log probability between events that satisfy and do not satisfy the rule, all else being equal.
Referring to fig. 4, fig. 4 is a schematic diagram of an obtaining process of a preset markov logic network model in the embodiment of the present application. The method comprises the following specific steps:
step 401, defining predicates and rules according with emotional expressions based on the Markov logic network.
And step 402, extracting feature words of the training samples.
And step 403, constructing a preset Markov logic network model according to the defined predicate and rule which accord with the emotional expression and the feature words.
The training sample can adopt a crawler technology to collect some blogs or comments on Internet social platforms such as microblogs and Baidu post bars, and the quality of the current comments is judged by the number of praise and trample; it can also refer to Chinese Character Orientation Analysis and Evaluation (COAE) or many corpora on the Internet, some corpora labeled manually.
In the embodiment of the present application, the obtaining method of the training sample is not limited.
For example, we classify emotions as happy, sad, disgusted, frightened, and so on. At present, a sentence of linguistic data 'XX mobile phone real garbage wastes a lot of money and cannot be bought any more next time', and the linguistic data is marked with labels of disgust and anger in a corpus. Then, a first-order logic rule that can be constructed at this time is that XX mobile phones are in an aversive feeling due to real rubbish, much money is wasted in the aversive feeling, and the XX mobile phones cannot buy the aversive feeling any more next time. Similar sentences have certain weight in the aversive feelings in the corpus. The logical rules, the public expressions of the logical rules, and the weights form a logical net. When the new corpus hits the emotion classification, an emotion classification can be confirmed according to a certain algorithm.
The emotion of the sensitive words in the corresponding context can be determined by establishing a preset Markov logic network model by using the Markov logic network for emotion analysis, and whether the preliminarily matched sensitive words are real sensitive words is further determined, so that the false killing rate of the sensitive words is greatly reduced.
The Markov logic network model can divide the text into happiness, sadness, anger and the like, and no matter how kinds of emotions are divided, the emotions are divided into two types according to the application scene of the target text: a specified emotion type and a non-specified emotion type. If the appointed emotion type is a positive emotion type, the non-appointed emotion type is a negative emotion type; in the embodiment of the application, the sensitive words corresponding to the specified emotion types are used as the words which are not required to be filtered from the target text, so that the false killing condition is avoided, the filtering of the sensitive words can be automatically realized, the cost is reduced, and the success rate is improved.
The method further comprises:
performing regression testing on the preset Markov logic network model by using a text test set;
when the test result does not meet the preset condition, reestablishing the preset decision tree model;
performing regression testing on the reestablished preset decision tree model again;
and when the test result meets the preset condition, taking the currently established preset Markov logic network model as the preset Markov logic network model for determining the emotion type.
The establishment and training of the preset markov logic network model in the embodiment of the application can be executed on a sensitive word filtering device and also can be executed on other equipment, and the embodiment of the application is not limited thereto.
Step 205, deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set.
In the embodiment of the application, the designated emotion type is set according to the application environment of the target text, for example, the designated emotion type corresponding to the mourning text is sad; the corresponding designated emotion types in the wedding congratulatory words are happy, etc., and this is merely an example and not limited to this arrangement.
And step 206, filtering the target text by using the sensitive words in the second sensitive word set.
In the embodiment of the application, after the target text is subjected to sensitive word matching based on the DFA Tier tree, the emotional type to which the context corresponding to the sensitive word in the target text belongs is determined through the Markov logic network model, and whether the matched sensitive word needs to be filtered from the target text is further determined. According to the implementation scheme, the sensitive words of the target text are filtered by combining the sensitive word matching and the emotion analysis, the accuracy rate of the sensitive word filtering is improved on the premise that the labor cost is reduced, and the false killing condition is avoided.
Based on the same inventive concept, the embodiment of the application also provides a sensitive word filtering device. Referring to fig. 5, fig. 5 is a schematic structural diagram of an apparatus applied to the above technology in the embodiment of the present application. The device comprises: the system comprises an acquisition unit 501, a word segmentation unit 502, a matching unit 503, an analysis unit 504, a first filtering unit 505 and a second filtering unit 506;
an obtaining unit 501, configured to obtain a target text;
a word segmentation unit 502, configured to perform word segmentation on the target text preprocessed by the preprocessing unit 502;
the matching unit 503 is configured to perform sensitive word matching on the target text subjected to word segmentation processing by the word segmentation unit 502, and store the matched sensitive words in the first sensitive word set;
an analyzing unit 504, configured to determine, based on a preset markov logic network model, whether an emotion type to which a context corresponding to the sensitive word matched by the matching unit 503 in the target text belongs is a specified emotion type;
the first filtering unit 505 is configured to delete the sensitive word determined by the analyzing unit 504 and corresponding to the context of the specified emotion type from the first sensitive word set matched by the matching unit 503 to obtain a second sensitive word set;
the second filtering unit 506 is configured to filter the target text acquired by the acquiring unit 501 by using the sensitive words in the second sensitive word set acquired by the first filtering unit 505.
Preferably, the apparatus further comprises: a preprocessing unit 507;
the preprocessing unit 507 is configured to preprocess the target text acquired by the acquisition unit 501.
Preferably, the first and second electrodes are formed of a metal,
the matching unit 503 is specifically configured to perform sensitive word matching on the target text after word segmentation processing through a preset decision tree model; the obtaining of the preset decision tree model comprises: obtaining a training sample; vectorizing the training samples; determining the weight of each training sample through a TF-IDF algorithm to obtain a weight matrix of the training samples; classifying the training samples by using a C4.5 decision tree algorithm; and constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
Preferably, the first and second electrodes are formed of a metal,
the acquisition of the preset Markov logic network model comprises the following steps: defining predicates and rules which accord with emotional expression based on a Markov logic network; extracting feature words of the training samples; and constructing a preset Markov logic network model according to the defined predicates and rules which accord with the emotional expressions and the characteristic words.
The units of the above embodiments may be integrated into one body, or may be separately deployed; may be combined into one unit or further divided into a plurality of sub-units.
In another embodiment, an electronic device is also provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the sensitive word filtering method when executing the program.
In another embodiment, a computer-readable storage medium is also provided having stored thereon computer instructions that, when executed by a processor, may implement the steps in the sensitive word filtering method.
Fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 6, the electronic device may include: a Processor (Processor)610, a communication Interface (Communications Interface)620, a Memory (Memory)630 and a communication bus 640, wherein the Processor 610, the communication Interface 620 and the Memory 630 communicate with each other via the communication bus 640. The processor 610 may call logic instructions in the memory 630 to perform the following method:
acquiring a target text;
performing word segmentation processing on the target text;
sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set;
determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set;
filtering the target text using the sensitive words in the second set of sensitive words.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (11)

1. A sensitive word filtering method, the method comprising:
acquiring a target text;
performing word segmentation processing on the target text;
sensitive word matching is carried out on the target text after word segmentation processing, and the matched sensitive words are stored in a first sensitive word set;
determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a designated emotion type or not by using a preset Markov logic network model;
deleting the sensitive words corresponding to the context belonging to the specified emotion type from the first sensitive word set to obtain a second sensitive word set;
filtering the target text using the sensitive words in the second set of sensitive words.
2. The method of claim 1, wherein after the obtaining the target text and before the performing the word segmentation on the target text, the method further comprises:
and preprocessing the target text.
3. The method of claim 1, wherein performing sensitive word matching on the participled target text comprises:
sensitive word matching is carried out on the target text after word segmentation processing through a preset decision tree model;
the obtaining of the preset decision tree model comprises:
obtaining a training sample;
vectorizing the training samples;
determining the weight of each training sample through a word frequency TF-inverse text frequency index IDF algorithm to obtain a weight matrix of the training samples;
classifying the training samples by using a C4.5 decision tree algorithm;
and constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
4. The method of claim 3, further comprising:
testing the preset decision tree model by using a word test set;
when the accuracy of the test result is smaller than a preset threshold value, updating the preset decision tree model through pruning;
testing the updated preset decision tree model again;
and when the accuracy of the test result is not less than a preset threshold value, taking the current preset decision tree model as the preset decision tree model to be subjected to sensitive word matching.
5. The method of any one of claims 1-4, wherein obtaining the preset Markov logic network model comprises:
defining predicates and rules which accord with emotional expression based on a Markov logic network;
extracting feature words of the training samples;
and constructing a preset Markov logic network model according to the defined predicates and rules which accord with the emotional expressions and the characteristic words.
6. The method of claim 5, further comprising:
performing regression testing on the preset Markov logic network model by using a text test set;
when the test result does not meet the preset condition, reestablishing the preset decision tree model;
performing regression testing on the reestablished preset decision tree model again;
and when the test result meets the preset condition, taking the currently established preset Markov logic network model as the preset Markov logic network model for determining the emotion type.
7. A sensitive word filtering device, the device comprising: the device comprises an acquisition unit, a word segmentation unit, a matching unit, an analysis unit, a first filtering unit and a second filtering unit;
the acquisition unit is used for acquiring a target text;
the word segmentation unit is used for carrying out word segmentation processing on the target text acquired by the acquisition unit;
the matching unit is used for matching sensitive words of the target text subjected to word segmentation processing by the word segmentation unit and storing the matched sensitive words into a first sensitive word set;
the analysis unit is used for determining whether the emotion type to which the context corresponding to the sensitive word in the target text belongs is a specified emotion type or not based on a preset Markov logic network model;
the first filtering unit is used for deleting the sensitive words corresponding to the contexts which belong to the specified emotion types and are determined by the analyzing unit from the first sensitive word set matched by the matching unit to obtain a second sensitive word set;
and the second filtering unit is used for filtering the target text acquired by the acquiring unit by using the sensitive words in the second sensitive word set acquired by the first filtering unit.
8. The apparatus of claim 7,
the matching unit is specifically used for performing sensitive word matching on the target text after word segmentation processing through a preset decision tree model; the obtaining of the preset decision tree model comprises: obtaining a training sample; vectorizing the training samples; determining the weight of each training sample through a word frequency TF-inverse text frequency index IDF algorithm to obtain a weight matrix of the training samples; classifying the training samples by using a C4.5 decision tree algorithm; and constructing a preset decision tree model according to the weight matrix of the training sample and the classification result.
9. The apparatus of claim 7, wherein the obtaining of the preset Markov logic network model comprises: defining predicates and rules which accord with emotional expression based on a Markov logic network; extracting feature words of the training samples; and constructing a preset Markov logic network model according to the defined predicates and rules which accord with the emotional expressions and the characteristic words.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-6 when executing the program.
11. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 6.
CN202011070783.2A 2020-10-09 2020-10-09 Sensitive word filtering method and device Pending CN113761112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011070783.2A CN113761112A (en) 2020-10-09 2020-10-09 Sensitive word filtering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011070783.2A CN113761112A (en) 2020-10-09 2020-10-09 Sensitive word filtering method and device

Publications (1)

Publication Number Publication Date
CN113761112A true CN113761112A (en) 2021-12-07

Family

ID=78785784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011070783.2A Pending CN113761112A (en) 2020-10-09 2020-10-09 Sensitive word filtering method and device

Country Status (1)

Country Link
CN (1) CN113761112A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417883A (en) * 2022-01-10 2022-04-29 马上消费金融股份有限公司 Data processing method, device and equipment
CN114706940A (en) * 2022-01-19 2022-07-05 浙报融媒体科技(浙江)股份有限公司 Sensitive word-based news file auditing method and system
CN118013963A (en) * 2024-04-09 2024-05-10 四川易景智能终端有限公司 Method and device for identifying and replacing sensitive words

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082332A1 (en) * 2008-09-26 2010-04-01 Rite-Solutions, Inc. Methods and apparatus for protecting users from objectionable text
CN106055541A (en) * 2016-06-29 2016-10-26 清华大学 News content sensitive word filtering method and system
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN107992471A (en) * 2017-11-10 2018-05-04 北京光年无限科技有限公司 Information filtering method and device in a kind of interactive process
CN110991171A (en) * 2019-09-30 2020-04-10 奇安信科技集团股份有限公司 Sensitive word detection method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082332A1 (en) * 2008-09-26 2010-04-01 Rite-Solutions, Inc. Methods and apparatus for protecting users from objectionable text
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN106055541A (en) * 2016-06-29 2016-10-26 清华大学 News content sensitive word filtering method and system
CN107992471A (en) * 2017-11-10 2018-05-04 北京光年无限科技有限公司 Information filtering method and device in a kind of interactive process
CN110991171A (en) * 2019-09-30 2020-04-10 奇安信科技集团股份有限公司 Sensitive word detection method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
成邦文,杨宏进著: "《统计调查数据质量控制 数据审核与评估的理论、方法及实践》", 31 October 2019, 北京:科学技术文献出版社, pages: 76 *
李伟;: "网页敏感词过滤与敏感文本分类系统设计", 电脑知识与技术, no. 08 *
郝志峰主编: "《数据科学与数学建模》", 31 January 2019, 武汉:华中科技大学出版社, pages: 90 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417883A (en) * 2022-01-10 2022-04-29 马上消费金融股份有限公司 Data processing method, device and equipment
CN114417883B (en) * 2022-01-10 2022-10-25 马上消费金融股份有限公司 Data processing method, device and equipment
CN114706940A (en) * 2022-01-19 2022-07-05 浙报融媒体科技(浙江)股份有限公司 Sensitive word-based news file auditing method and system
CN118013963A (en) * 2024-04-09 2024-05-10 四川易景智能终端有限公司 Method and device for identifying and replacing sensitive words

Similar Documents

Publication Publication Date Title
US11494648B2 (en) Method and system for detecting fake news based on multi-task learning model
KR102020756B1 (en) Method for Analyzing Reviews Using Machine Leaning
US8892580B2 (en) Transformation of regular expressions
CN113761112A (en) Sensitive word filtering method and device
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN110309304A (en) A kind of file classification method, device, equipment and storage medium
Sheikhi et al. An effective model for SMS spam detection using content-based features and averaged neural network
Suleiman et al. SMS spam detection using H2O framework
CN113239268B (en) Commodity recommendation method, device and system
KR102527937B1 (en) A method for searching the similar patents based on artificial intelligence and an apparatus thereof
CN115795061A (en) Knowledge graph construction method and system based on word vectors and dependency syntax
Deekshan et al. Detection and summarization of honest reviews using text mining
Dagar et al. Twitter sentiment analysis using supervised machine learning techniques
Redondo-Gutierrez et al. Detecting malware using text documents extracted from spam email through machine learning
CN112966507B (en) Method, device, equipment and storage medium for constructing recognition model and attack recognition
Bokolo et al. Cyberbullying detection on social media using machine learning
Guermazi et al. Using a semi-automatic keyword dictionary for improving violent web site filtering
CN115496066A (en) Text analysis system, text analysis method, electronic device, and storage medium
George et al. Bangla fake news detection based on multichannel combined CNN-LSTM
CN115098773A (en) Big data-based public opinion monitoring and analyzing system and method
Haque et al. Sentiment analysis in low-resource bangla text using active learning
Shil et al. An approach for detecting Bangla spam comments on Facebook
CN114117047A (en) Method and system for classifying illegal voice based on C4.5 algorithm
Boyko et al. Categorizing False Information in News Content Using an Ensemble Machine Learning Model.
Rajalingam et al. Implementation of vocabulary-based classification for spam filtering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination