CN112231442A

CN112231442A - Sensitive word filtering method and device

Info

Publication number: CN112231442A
Application number: CN202011100936.3A
Authority: CN
Inventors: 尹琼; 底亚峰; 谭佳琳; 薛晗庆; 王晓天; 李萌萌; 金娜; 毛强; 窦小明; 顾天祺; 李昊星; 魏珂; 赵翔宇; 梁瑞卿
Original assignee: Beijing Institute of Near Space Vehicles System Engineering
Current assignee: Beijing Institute of Near Space Vehicles System Engineering
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-15

Abstract

The application discloses a sensitive word filtering method and device, which are used for improving the efficiency of sensitive word filtering in the aerospace field. The sensitive word filtering method provided by the application comprises the following steps: preprocessing the text; performing feature extraction on the preprocessed text to form text feature data; filtering the sensitive words of the text characteristic data according to a sensitive word dynamic dictionary library; filtered document data is generated. The application also provides a sensitive word filtering device.

Description

Sensitive word filtering method and device

Technical Field

The present application relates to the field of information filtering, and in particular, to a method and an apparatus for filtering sensitive words.

Background

With the deep advance of the strategy of military and civil fusion development, the number of externally communicated and published articles in the aerospace field is gradually increased year by year. In order to avoid great loss to national security and benefits, sensitive word filtering processing is required for the communication contents and articles. However, in the prior art, sensitive words are mainly screened in a manual reading mode, the labor cost is high, the consumed time is long, and no automatic sensitive word filtering method aiming at the aerospace field exists at present.

Disclosure of Invention

In view of the above technical problems, embodiments of the present application provide a method and an apparatus for filtering sensitive words, so as to improve the efficiency of filtering sensitive words in the aerospace field.

In one aspect, a method for filtering sensitive words provided in an embodiment of the present application includes:

preprocessing the text;

performing feature extraction on the preprocessed text to form text feature data;

filtering the sensitive words of the text characteristic data according to a sensitive word dynamic dictionary library;

filtered document data is generated.

Further, the preprocessing the text includes the following steps:

extracting text information;

judging the text type;

if the document type is Chinese, performing word segmentation on the text to be processed by adopting a Chinese word segmentation method; if the document type is English, performing word segmentation on the text to be processed by adopting an English word segmentation method;

deleting punctuation marks in the text;

searching whether stop words exist in the text or not according to a stop word bank, and if so, deleting the stop words from the text;

and deleting the label.

Further, the performing feature extraction on the preprocessed text to form text feature data includes:

performing corpus model training, converting text word segmentation into word vectors, and extracting word surface features

Performing corpus model training, converting text word segmentation into word vectors, and extracting word meaning characteristics;

and performing serial feature fusion on the surface features of the text and the semantic features to form feature-fused text word vectors. .

Further, the filtering the sensitive words of the text feature data according to the dynamic dictionary library of the sensitive words includes:

performing sensitive word matching and filtering on the word vectors of the text characteristic data according to the sensitive word dynamic dictionary library to form a first sensitive word set;

deleting the sensitive words in the first sensitive word set from word vectors of the text characteristic data to form first data;

calculating the similarity between the first data and the sensitive word dynamic dictionary library;

if the similarity exceeds a first preset threshold, determining a second sensitive word set according to the sensitive word dynamic dictionary library and the first data;

merging the first sensitive word set and the second sensitive word set to form a third sensitive word set;

and according to the third sensitive word set, carrying out sensitive word highlight display or replacement on the text.

As a preferred example, the sensitive word dynamic dictionary library is determined according to the following method:

preprocessing a secret document set in the aerospace field to obtain a first dictionary alternative text;

establishing word list language category mapping according to the first dictionary alternative text;

according to the first dictionary alternative text and the category mapping, counting the word frequency of all words in each category and the total number of words in each category;

performing feature extraction on the first dictionary alternative text to obtain a second dictionary alternative text;

determining a sensitive word classification, and constructing a sensitive word classifier according to the sensitive word classification and the second dictionary alternative text;

and determining a sensitive word dynamic dictionary library according to the second dictionary alternative text and the sensitive word classifier.

As a preferred example, the preprocessing the set of aerospace domain confidential documents includes:

extracting text information;

judging the text type;

deleting punctuation marks in the text;

and deleting the label.

As a preferred example, the determining a sensitive word dynamic dictionary according to the third dictionary alternative text and the sensitive word classifier includes:

and inputting the third dictionary alternative text into the sensitive word classifier to form a sensitive word high latitude sparse matrix table, and determining the sensitive word high latitude sparse matrix table as a sensitive word dynamic dictionary.

The sensitive word filtering method provided by the invention further comprises the following steps: and updating the sensitive word dynamic dictionary. Specifically, updating the sensitive word dynamic dictionary includes:

adding the newly added document into the data set;

preprocessing the documents in the data set to obtain a first updated text;

extracting the features of the first updated text to obtain a first updated text feature vector;

and using a sensitive word filter to perform information filtering on the word to be searched in the first updated text feature vector, if the word to be searched accords with the filtering feature, performing duplication searching in an existing sensitive word dynamic dictionary library, and if the word to be searched does not exist, adding the word to be searched into the sensitive word dynamic dictionary.

By the sensitive word filtering method provided by the invention, the efficiency of filtering the sensitive words in the aerospace field is improved, and the cost is reduced.

On the other hand, the embodiment of the present application further provides a sensitive word filtering apparatus, including:

the preprocessing module is used for extracting information of the text, judging the text type, segmenting words according to the text type, deleting punctuation marks after the words are segmented and deleting stop words after the words are segmented;

the text feature extraction module is used for extracting the word surface features and the semantic features of the text;

the information filtering module is used for filtering the text and deleting the sensitive words;

the dynamic dictionary library module is used for recording a sensitive word set in the aerospace field;

and the word adding and updating submodule is used for updating the sensitive words in the dynamic dictionary library.

The method constructs the dynamic dictionary base based on the ontology knowledge in the aerospace field, overcomes the limitation of exhaustion of sensitive words, and performs the information filtering of the sensitive words in a mixed mode based on the dynamic dictionary base, so that the filtering precision and accuracy of the sensitive words in classified documents in the aerospace field can be remarkably improved, and the labor and time cost in the filtering and examining process of the sensitive words is effectively reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating a sensitive word filtering method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a text preprocessing flow provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a text feature extraction process provided in the embodiment of the present application;

FIG. 4 is a schematic diagram of an information filtering process provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a dynamic dictionary library generation process provided in the embodiment of the present application;

fig. 6 is a schematic diagram illustrating a dynamic dictionary base updating process according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a sensitive word filtering apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Some of the words that appear in the text are explained below:

1. the term "and/or" in the embodiments of the present invention describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

2. In the embodiments of the present application, the term "plurality" means two or more, and other terms are similar thereto.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the display sequence of the embodiment of the present application only represents the sequence of the embodiment, and does not represent the merits of the technical solutions provided by the embodiments.

Example one

Referring to fig. 1, a schematic diagram of a sensitive word filtering method provided in an embodiment of the present application, as shown in the figure, the method includes steps S101 to S104:

s101, preprocessing a text;

in the step, preprocessing comprises removing stop words, Chinese and English word segmentation, removing noise data and the like, extracting text information of the original confidential document in the aerospace field, performing word segmentation on the text by using a Chinese and English word segmentation method on the basis, performing processing operations such as removing stop words and removing noise on the text, and forming a word segmentation list as a basis and foundation for filtering subsequent information. The steps of the pre-treatment are shown in fig. 2 and comprise:

s201, extracting text information;

in this step, since the document includes a plurality of types of contents such as a cover, a picture, and a figure, it is necessary to extract text information in the document.

S202, judging the text type; if the Chinese character is the character, the Chinese character is preprocessed; if the English is English, carrying out English preprocessing;

the existing classified documents are usually of two types, namely Chinese and English, and because the subsequent processing modes of the two types of documents are different, the text type needs to be distinguished, and different word segmentation methods are selected according to different text types.

S203, Chinese word segmentation;

if the text type is Chinese, performing word segmentation on the Chinese secret-related text information by adopting a Chinese word segmentation method;

s204, removing Chinese punctuations;

removing Chinese punctuation marks and deleting punctuation marks which are not needed in the text information, such as ^ and $% # @ and the like. It should be noted that, all punctuations are not necessarily deleted, but unnecessary punctuations are deleted according to the word segmentation rule, and are related to a specific word segmentation algorithm.

S205, English word segmentation;

if the text type is English, an independent word is distinguished from the English classified text through punctuation and blank spaces.

S206, removing English punctuations;

similar to S204, not all punctuations are necessarily deleted, but unnecessary punctuations, such as & $% # @ and the like, are deleted in relation to a specific word segmentation rule, and this step is not particularly limited.

S207, deleting stop words;

in the step, the stop words are deleted from the confidential text information according to the stop word library.

And S208, deleting the label.

In this step, unnecessary tags, such as tags created for attributes such as date, number, mobile phone, and email, are deleted.

The preprocessing provided by the invention utilizes a text information extraction technology to extract text information of an original secret-related document in the aerospace field, on the basis, a Chinese and English word segmentation method is used for carrying out word segmentation on the text, and processing operations such as word deactivation, noise removal and the like are carried out on the text to form a word segmentation list which is used as the basis and foundation for subsequent information filtering.

S102, extracting the features of the preprocessed text to form text feature data;

in this step, feature extraction is performed on the preprocessed text for subsequent processing. The method mainly comprises the following three steps:

step 1: extracting word surface features, wherein the surface features only consider the surface meanings of texts, such as keywords extracted by using methods of TF-IDF, mutual information and the like, namely the word surface features; extracting text word vectors by using a TF-IDF model based on the preprocessed text data as a training corpus;

step 2: and extracting word meaning characteristics, and converting the preprocessed text word segmentation into word vectors. In order to further reduce the probability of semantic ambiguity problems, deep semantic vectors of texts, namely semantic features of word vectors, are obtained by utilizing models such as word2vec and GloVe. As a preferred example, the training process is as follows: training the corpus by using a word2vec model based on the preprocessed text data as a training corpus to obtain text word vectors;

and step 3: and fusing the surface features and the semantic features to form text feature data.

And performing serial feature fusion on the surface features of the text and the semantic features, namely performing serial fusion on the word vectors obtained by using the TF-IDF model and the word vectors obtained by using the word2vec model to finally form feature-fused text word vectors (converting text data into numerical data).

Specifically, the description with reference to fig. 3 includes steps S301 to S306:

s301, inputting a preprocessed text;

s302, training by using a TF-IDF model to obtain a word vector;

s303, extracting the surface features of the words;

in the two steps S302 and S303, corpus model training is performed, text segmentation is converted into word vectors, and word surface features are extracted

S304, training by using the word2rec model to obtain a word vector;

s305, extracting word meaning characteristics;

in the two steps S304 and S305, corpus model training is performed, text segmentation is converted into word vectors, and word meaning features are extracted;

and S306, performing serial feature fusion on the surface features and the semantic features to form text feature data. The specific fusion method is as follows: assume that the surface word feature of the text is M ═ (M)₁,m₂,...,m_j) The semantic feature of the text is N ═ N (N)₁,n₂,...,n_j) Then, the serial feature fusion result is expressed as L ═ { M ═ u ═ N } ═ M₁,m₂,...,m_j，n₁,n₂,...,n_j}。

After the above processing, text feature data is formed. It should be noted that in this step, the word surface features and the word semantic features of the text are fused in a serial manner to obtain word vectors fused with the text features, so that the original text features are efficiently extracted, the probability of semantic ambiguity of the word features is reduced, and a data basis is provided for subsequent sensitive word information filtering.

S103, filtering the sensitive words of the text characteristic data according to a sensitive word dynamic dictionary library;

in the step, based on the sensitive word dynamic dictionary library, a sensitive word filtering mode combining two modes can be adopted to improve the accuracy of sensitive word filtering, sensitive words are replaced or highlighted in a sensitive word matching filtering mode, then a result is calculated according to the similarity between each input text characteristic word vector and the word vector in the sensitive word dynamic dictionary library, and suspicious or similar (exceeding a set threshold value alpha) sensitive words are further searched by using an information filtering mode, so that the sensitive word filtering efficiency and quality are effectively improved.

Specifically, the processing of this step is as shown in fig. 4, and includes S401 to S407:

s401, performing sensitive word matching and filtering on word vectors of the text characteristic data according to the sensitive word dynamic dictionary library to form a first sensitive word set;

s402, deleting the sensitive words in the first sensitive word set from the word vectors of the text characteristic data to form first data;

s403, calculating the first data (set as D) and the sensitive word dynamic dictionary library (set as W), and calculating similarity Sim (Di, Wj) between each word in D and each word in the dynamic dictionary library W, where the specific calculation method is as follows:

wherein, Dis (Di, Wj) represents the distance between the word Di in the first data set and the word Wj in the sensitive word dynamic dictionary base, and the calculation formula is

λ is the parameter being adjusted, and the default value is taken to be λ₀，λ₀Values can be given according to actual engineering experience, and meanwhile, calculation results Sim (Di, Wj) are stored in the array A from large to small;

s404, judging whether the similarity exceeds a first preset threshold, if so, executing S405, otherwise, executing S406; for example, a first preset threshold is set as α, and the default value is set as α₀(e.g.. alpha.)₀The value of (a) should be not less than 0.6, and the specific value can be given according to the practical engineering experience);

s405, determining a second sensitive word set according to the sensitive word dynamic dictionary library and the first data;

s406, merging the first sensitive word set and the second sensitive word set to form a third sensitive word set;

and S407, performing sensitive word replacement on the text according to the third sensitive word set.

The sensitive word filtering method provided by the invention is characterized in that the word vectors fused with the text features are subjected to sensitive word filtering matching, sensitive words are deleted or replaced, then a filtering mode of similarity calculation is utilized, if the calculation result exceeds a threshold value, the sensitive words are deleted or replaced, otherwise, the processing operation is not carried out. Through the processing of the step S103, the sensitive words of the text to be processed are filtered, and the filtering efficiency is improved.

S104, generating filtered document data;

it should be noted that, the sensitive word filtering method provided by the present invention needs to generate a sensitive word dynamic dictionary library before filtering. As a preferred example, the method for generating the sensitive word dynamic dictionary library is shown in fig. 5, and includes the following steps:

s501, preprocessing a secret-related document set in the aerospace field to obtain a first dictionary alternative text;

the processing in this step may refer to S101, and details are not repeated.

S502, establishing word list language category mapping according to the first dictionary alternative text, and counting the word frequency of all words in each category and the total number of words in each category according to the first dictionary alternative text and the category mapping;

as a preferred example, the mapping relationship can be expressed as: [ Category 1, Category 2, Category 3, … … ] [ [ word List 1], [ word List 2], [ word List 3], … ], i.e., word List 1 belongs to Category 1, word List 2 belongs to Category 2, and so on.

S503, extracting the features of the first dictionary alternative text to obtain a second dictionary alternative text;

s504, determining the sensitive word classification, and constructing a sensitive word classifier according to the sensitive word classification and the second dictionary alternative text;

and S505, determining a sensitive word dynamic dictionary library according to the second dictionary alternative text and the sensitive word classifier.

After the steps from S501 to S505, a dynamic dictionary library of sensitive words in the aerospace field is constructed. Aiming at ontology knowledge in the aerospace field, the sensitive word dynamic dictionary library is constructed through processing operations of text word segmentation, ontology data word list and category mapping establishment, word frequency statistics, strategy selection extraction text feature construction, sensitive word classifier construction and the like.

In practical use, as the mission in the aerospace field is continuously increased, the coverage of the sensitive words is expanded, and the dynamic dictionary library of the sensitive words needs to be continuously updated. The update process of the sensitive word dynamic dictionary database is shown in fig. 6, and includes S601 to S604:

s601, adding the newly added document into a data set;

s602, preprocessing the document in the data set to obtain a first updated text;

s603, extracting the features of the first updated text to obtain a first updated text feature vector;

s604, using a sensitive word filter to filter information of the words to be searched in the first updated text feature vector, if the information accords with the filtering feature, searching duplication in an existing sensitive word dynamic dictionary library, and if the information does not accord with the filtering feature, adding the words to be searched into the sensitive word dynamic dictionary.

Through operation processing such as text preprocessing and feature text extraction on the updated data set, the processed text data are classified by using a sensitive word filter, and the sensitive word data set in the dictionary is dynamically expanded through S601 to S604.

The method can construct the dynamic dictionary base based on the ontology knowledge in the aerospace field, make up the limitation of exhausting the sensitive words, and filter the sensitive word information in a mixed mode based on the dynamic dictionary base, thereby obviously improving the filtering precision and accuracy of the sensitive words in the confidential documents in the aerospace field and effectively reducing the labor and time cost in the filtering and examining process of the sensitive words. .

Example two

Based on the same inventive concept, an embodiment of the present invention further provides a sensitive word filtering apparatus, as shown in fig. 7, the apparatus includes:

the preprocessing module 701 is used for extracting information of a text, judging the text type, segmenting words according to the text type, deleting punctuations after segmentation, and deleting stop words after segmentation;

a text feature extraction module 702, configured to extract word surface features and semantic features of the text;

the information filtering module 703 is configured to filter the text and delete the sensitive word;

the dynamic dictionary library module 704 is used for recording a sensitive word set in the aerospace field;

and the word adding updating sub-module 705 is used for updating the sensitive words in the dynamic dictionary library.

It should be noted that the preprocessing module 701 provided in this embodiment can implement all functions of the preprocessing method in fig. 2, solve the same technical problem, achieve the same technical effect, and is not described herein again; correspondingly, the text feature extraction module 702 provided in this embodiment can implement all the functions of feature extraction shown in fig. 3, solve the same technical problem, achieve the same technical effect, and is not described herein again; the information filtering module 703 provided in this embodiment can implement all functions of the sensitive word filtering method shown in fig. 4, solve the same technical problem, achieve the same technical effect, and is not described herein again; the dynamic dictionary library module 704 provided in this embodiment can implement all functions related to the dynamic sensitive dictionary library in the first embodiment, can implement the construction method of the dynamic sensitive dictionary library shown in fig. 5, solves the same technical problem, achieves the same technical effect, and is not described herein again; the word-adding updating sub-module 705 provided in this embodiment can implement all functions of updating the dynamic sensitive dictionary base in fig. 6, solve the same technical problem, achieve the same technical effect, and is not described herein again.

It should be noted that the apparatus provided in the second embodiment and the method provided in the first embodiment belong to the same inventive concept, solve the same technical problem, and achieve the same technical effect, and the apparatus provided in the second embodiment can implement all the methods of the first embodiment, and the same parts are not described again.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A sensitive word filtering method, comprising:

preprocessing the text;

filtered document data is generated.

2. The method of claim 1, wherein preprocessing the text comprises:

extracting text information;

judging the text type;

deleting punctuation marks in the text;

and deleting the label.

3. The method according to claim 1, wherein the performing feature extraction on the preprocessed text to form text feature data comprises:

and performing serial feature fusion on the surface features of the text and the semantic features to form feature-fused text word vectors.

4. The method of claim 1, wherein the sensitive word filtering the text feature data according to a sensitive word dynamic dictionary library comprises:

5. The method according to one of claims 1 to 4, characterized in that the sensitive word dynamic dictionary repository is determined according to the following method:

6. The method of claim 5, wherein preprocessing the set of aerospace domain secret-related documents comprises:

extracting text information;

judging the text type;

deleting punctuation marks in the text;

and deleting the label.

7. The method of claim 5, wherein determining a sensitive word dynamic dictionary based on the third dictionary alternative text and the sensitive word classifier comprises:

8. The method of claim 1, further comprising:

and updating the sensitive word dynamic dictionary.

9. The method of claim 8, wherein the updating the sensitive word dynamic dictionary comprises:

adding the newly added document into the data set;

preprocessing the documents in the data set to obtain a first updated text;

10. A sensitive word filtering device, comprising: