CN117113988A

CN117113988A - NLP-based sensitive vocabulary shielding method and system

Info

Publication number: CN117113988A
Application number: CN202311068514.6A
Authority: CN
Inventors: 陈竑; 韩三普
Original assignee: Beijing Shenwei Zhixin Technology Co ltd
Current assignee: Beijing Shenwei Zhixin Technology Co ltd
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-11-24
Anticipated expiration: 2043-08-23
Also published as: CN117113988B

Abstract

The invention belongs to the technical field of sensitive vocabulary shielding, and discloses a sensitive vocabulary shielding method and system based on NLP. The method comprises the following steps: constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus; constructing a sensitive vocabulary recognition model by using an NLP algorithm; extracting text from a file to be analyzed; performing word segmentation on the text to be analyzed by using a word segmentation algorithm; inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition; according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, and if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol; and acquiring the text after the sensitive vocabulary shielding, and loading the text after the sensitive vocabulary shielding to a file to be analyzed to obtain the file after the sensitive vocabulary shielding. The invention solves the problems of low accuracy, low efficiency and low practicability of shielding sensitive words in the prior art.

Description

NLP-based sensitive vocabulary shielding method and system

Technical Field

The invention belongs to the technical field of sensitive vocabulary shielding, and particularly relates to a sensitive vocabulary shielding method and system based on NLP.

Background

According to the requirements of supervision and safety, some sensitive words existing in circulation files on the internet, such as privacy information of names, identity cards or mobile phone numbers of users, sensitive information of unsuitable words and the like, business information of names, information, core technologies or employee situations of enterprises and the like, need to be shielded.

The prior art has the defects that:

1) The existing sensitive vocabulary shielding algorithm can only recognize some key sensitive vocabularies, but can not recognize pinyin, character form similarity, sound form similarity or synonyms of the sensitive vocabularies, so that the accuracy of shielding the sensitive vocabularies is low;

2) The existing sensitive vocabulary shielding algorithm utilizes a sensitive word library to perform character string matching, so that the sensitive word recognition is realized, the efficiency of the mode is low, only the sensitive vocabulary shielding can be performed on plain text files, and the text recognition and the sensitive vocabulary shielding cannot be performed on image files or video files, so that the practicability is low.

Disclosure of Invention

In order to solve the problems of low accuracy, low efficiency and low practicability of shielding sensitive vocabulary in the prior art, the invention aims to provide a sensitive vocabulary shielding method and system based on NLP.

The technical scheme adopted by the invention is as follows:

a sensitive vocabulary shielding method based on NLP comprises the following steps:

constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus;

according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, constructing a sensitive vocabulary recognition model by using an NLP algorithm;

extracting text from the file to be analyzed to obtain the text to be analyzed;

performing word segmentation on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed;

inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, so as to obtain sensitive vocabulary of the word sequence to be analyzed;

according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed;

and according to the word sequence after the sensitive word shielding, obtaining a text after the sensitive word shielding, and loading the text after the sensitive word shielding into a file to be analyzed to obtain the file after the sensitive word shielding.

Further, a sensitive vocabulary corpus and a non-sensitive vocabulary corpus are constructed, comprising the following steps:

capturing a plurality of known sensitive words of Chinese or English in the Internet by using a crawler tool;

capturing pinyin, paraphrasing vocabulary and homonyms of a plurality of known sensitive vocabularies;

carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of known sensitive words and pinyin, close meaning word and homonyms thereof to obtain a plurality of processed known sensitive words and pinyin, close meaning word and homonyms thereof;

constructing a sensitive vocabulary corpus according to the processed known sensitive vocabulary and pinyin, paraphrasing vocabulary and homonymy thereof;

collecting a plurality of Chinese or English universal words in the Internet by using a crawler tool;

according to the sensitive vocabulary corpus, rejecting known sensitive vocabulary mixed in a plurality of universal vocabularies and pinyin, near-meaning vocabulary and homonym thereof to obtain a plurality of non-sensitive vocabularies;

carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of non-sensitive words to obtain a plurality of processed non-sensitive words;

and constructing a non-sensitive vocabulary corpus according to the processed non-sensitive vocabularies.

Further, according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, using an NLP algorithm to construct a sensitive vocabulary recognition model, comprising the following steps:

randomly extracting known sensitive vocabulary in the sensitive vocabulary corpus and pinyin, near-meaning vocabulary thereof and homonym vocabulary and nonsensitive vocabulary in the nonsensitive vocabulary corpus to form a plurality of training text data;

constructing an initial sensitive vocabulary recognition model by using a BERT-BILSTM-CRF algorithm in an NLP algorithm;

and optimizing network parameters of the initial sensitive vocabulary recognition model by using an IWOA optimizing algorithm, and inputting a plurality of training text data for optimization training to obtain the optimal sensitive vocabulary recognition model.

Further, the sensitive vocabulary recognition model comprises an input layer, a vector characterization layer provided with a BERT pre-training language sub-model, a BILSTM layer, a feature fusion layer, a CRF layer and an output layer which are connected in sequence;

introducing a Circle chaotic sequence initialization and dynamic reverse learning strategy to improve the traditional WOA optimizing algorithm to obtain an IWOA optimizing algorithm;

the formula for initializing the Circle chaotic sequence is as follows:

wherein x is _i+1,j+1 Initial positions of whale populations generated for Circle chaotic map; x is x _i,j Initial positions for a randomly generated whale population; mod (-) is a mod function; i is the individual indication of whale; j is a dimension indicating quantity;

the formula of the dynamic reverse learning strategy is:

x' _ij (t)＝k(a _j (t)+b _j (t))-x _ij (t)

wherein x 'is' _ij (t)、x _ij (t) the reverse and forward positions in the j-th dimension of the i-th whale individual, respectively; a, a _j (t)、b _j (t) upper and lower bounds, respectively, for the j-th dimension of the current whale population; k is a decreasing inertia factor, k=0.9-0.5D/D _max ；D、D _max The current iteration number and the maximum iteration number are respectively; t is the time indication quantity.

Further, the network parameters of the initial sensitive vocabulary recognition model are optimized by using an IWOA optimizing algorithm, and the method comprises the following steps:

taking the number of hidden layer neurons of the BILSTM layer, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate as optimization targets, namely the positions of whale individuals of the IWOA population;

initializing parameters of an IWOA optimizing algorithm, and initializing an IWOA population by using a Circle chaotic sequence;

calculating the fitness value of each whale individual in the IWOA population;

performing hunting, bubble network attack or hunting, updating whale individuals and IWOA populations;

dynamically and reversely learning the updated IWOA population to obtain reverse solutions corresponding to each forward solution in the IWOA population, and screening optimal whale individuals and optimal fitness values thereof according to all forward solutions and fitness values of whale individuals of all reverse solutions in the IWOA population;

if the optimal fitness value meets the requirement or the iteration number meets the requirement, outputting the position of a global optimal solution corresponding to the optimal whale individual, namely the number of hidden layer neurons, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate of the BILSTM layer, and if not, repeatedly updating whale individual and IWOA population.

Further, text extraction is carried out on the file to be analyzed to obtain a text to be analyzed, and the method comprises the following steps:

receiving a file to be analyzed, and analyzing the file name suffix of the file to be analyzed to obtain the data format of the file to be analyzed;

if the data format of the file to be analyzed is a text format, extracting the text of the file to be analyzed in the text format to obtain a text to be analyzed corresponding to the file to be analyzed in the text format;

if the data format of the file to be analyzed is a picture format, text extraction is carried out on the file to be analyzed in the picture format by using a picture-text recognition model, and a text to be analyzed corresponding to the file to be analyzed in the picture format is obtained;

if the data format of the file to be analyzed is a video format, frame interception is carried out on the file to be analyzed in the video format to obtain continuous frames of images to be analyzed, text extraction is carried out on the continuous frames of images to be analyzed by using a picture-text recognition model to obtain a plurality of original texts to be analyzed, and text combination and de-duplication processing are carried out on the plurality of original texts to be analyzed to obtain texts to be analyzed corresponding to the file to be analyzed in the video format.

Further, word segmentation is carried out on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed, and the method comprises the following steps:

performing word segmentation on the text to be analyzed by using a pkuseg word segmentation algorithm to obtain a first word sequence to be analyzed;

performing word segmentation on the text to be analyzed by using a jieba word segmentation algorithm to obtain a second word sequence to be analyzed;

performing word segmentation processing on the text to be analyzed by using an ltp word segmentation algorithm to obtain a third word sequence to be analyzed;

performing word segmentation on the text to be analyzed by using a hanlp word segmentation algorithm to obtain a fourth word sequence to be analyzed;

and merging and screening the first word sequence to be analyzed, the second word sequence to be analyzed, the third word sequence to be analyzed and the fourth word sequence to be analyzed to obtain a final word sequence to be analyzed.

Further, the word sequence to be analyzed is input into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, so as to obtain sensitive vocabulary of the word sequence to be analyzed, and the method comprises the following steps:

receiving a word sequence to be analyzed by using an input layer of a sensitive vocabulary recognition model;

converting a plurality of word fragments in the word sequence to be analyzed into word vectors by using a vector characterization layer of the sensitive vocabulary recognition model to obtain the word sequence to be analyzed comprising a plurality of word vectors;

converting each word vector in the word sequence to be analyzed comprising a plurality of word vectors into a word vector by using a vector characterization layer of the sensitive vocabulary recognition model to obtain a word sequence to be analyzed comprising a plurality of word vectors;

extracting word semantic features of each word vector and word semantic features of each word vector by using a BILSTM layer of the sensitive vocabulary recognition model;

feature fusion is carried out on the word meaning features of all word vectors and the word meaning features of all word vectors by using a feature fusion layer of the sensitive word recognition model, so as to obtain a fusion feature sequence;

using a CRF layer of a sensitive vocabulary recognition model, carrying out dependency processing on each word vector in a word sequence to be analyzed according to the fusion feature sequence, and adding a sensitive vocabulary label to obtain a sensitive vocabulary mark word sequence;

and outputting the corresponding sensitive vocabulary of the word sequence to be analyzed according to the sensitive vocabulary labels in the sensitive vocabulary mark word sequence by using an output layer of the sensitive vocabulary recognition model, and recording the position information of the sensitive vocabulary of the word sequence to be analyzed in the sensitive vocabulary mark word sequence.

Further, according to the sensitive vocabulary corpus, verifying the sensitive vocabulary of the word sequence to be analyzed, if the verification result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed, and the method comprises the following steps:

inputting each sensitive vocabulary of the word sequence to be analyzed into a sensitive vocabulary corpus, and performing similarity matching with known sensitive vocabulary and pinyin, near meaning vocabulary and homonym thereof in the sensitive vocabulary corpus;

if the similarity value between the known sensitive vocabulary, pinyin, paraphrasal or homonym and the current sensitive vocabulary of the word sequence to be analyzed is greater than a threshold value, entering the next step, otherwise, outputting a verification result as unreal;

outputting a verification result to be true if all the sensitive words of the word sequence to be analyzed pass verification, otherwise, inputting the next sensitive word of the word sequence to be analyzed into a sensitive word corpus to perform verification;

if the verification result is true, replacing the sensitive vocabulary at the corresponding position in the word sequence to be analyzed by using a shielding symbol according to the position information of the sensitive vocabulary in the word sequence marked by the sensitive vocabulary, so as to obtain a word sequence after the sensitive vocabulary is shielded, otherwise, carrying out sensitive vocabulary recognition again on the word sequence to be analyzed.

The sensitive vocabulary shielding system based on the NLP is used for realizing a sensitive vocabulary shielding method and comprises a corpus construction unit, a sensitive vocabulary recognition model construction unit, a text extraction unit, a word segmentation processing unit, a sensitive vocabulary recognition unit and a sensitive vocabulary verification unit, wherein the corpus construction unit is respectively connected with the sensitive vocabulary recognition model construction unit and the sensitive vocabulary verification unit, the corpus construction unit is connected with an external internet corpus, the sensitive vocabulary recognition model construction unit is connected with the sensitive vocabulary recognition unit, the sensitive vocabulary recognition unit is respectively connected with the word segmentation processing unit and the sensitive vocabulary verification unit, and the word segmentation processing unit is connected with the text extraction unit;

the corpus construction unit is used for capturing a plurality of sensitive words and a plurality of universal words in an external internet corpus, constructing a sensitive word corpus according to the plurality of sensitive words and constructing a non-sensitive word corpus according to the plurality of universal words;

the sensitive vocabulary recognition model building unit is used for calling the sensitive vocabulary corpus and the non-sensitive vocabulary corpus built by the corpus building unit and building a sensitive vocabulary recognition model by using an NLP algorithm;

the text extraction unit is used for receiving the file to be analyzed and extracting the text of the file to be analyzed to obtain the text to be analyzed;

the word segmentation processing unit is used for performing word segmentation processing on the text to be analyzed obtained by the text extraction unit by using a word segmentation algorithm to obtain a word sequence to be analyzed, and sending the word sequence to be analyzed to the sensitive vocabulary recognition unit;

the sensitive vocabulary recognition unit is used for calling the sensitive vocabulary recognition model constructed by the sensitive vocabulary recognition model construction unit, inputting the word sequence to be analyzed sent by the word segmentation processing unit into the sensitive vocabulary recognition model to perform sensitive vocabulary recognition, obtaining sensitive vocabulary of the word sequence to be analyzed, receiving a verification result sent by the sensitive vocabulary verification unit, if the verification result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol, obtaining a word sequence after the sensitive vocabulary is shielded, and obtaining a text after the sensitive vocabulary is shielded according to the word sequence after the sensitive vocabulary is shielded;

the sensitive vocabulary verification unit is used for extracting the sensitive vocabulary of the word sequence to be analyzed, which is obtained by the sensitive vocabulary recognition unit, calling the sensitive vocabulary corpus constructed by the corpus construction unit, verifying the sensitive vocabulary of the word sequence to be analyzed according to the sensitive vocabulary corpus, obtaining a verification result, and sending the verification result to the sensitive vocabulary recognition unit.

The beneficial effects of the invention are as follows:

according to the sensitive vocabulary shielding method and system based on the NLP, the sensitive vocabulary corpus and the non-sensitive vocabulary corpus are built, so that the sensitive vocabulary recognition training samples are enriched, the data support of the sensitive vocabulary is expanded, the sensitive vocabulary recognition model is built by using an NLP algorithm, the sensitive vocabulary recognition training samples are fully learned, the automatic and accurate recognition of the sensitive vocabulary is realized, and the efficiency and accuracy of the subsequent sensitive vocabulary shielding are improved; the sensitive vocabulary recognition model can extract semantic features, and analyze the semantic features in combination with a semantic environment, so that false triggering of sensitive vocabulary is avoided, and the use experience of a user is improved; the method can be applied to files to be analyzed in different data formats, and the practicability of the method is improved; and the sensitive vocabulary is checked by using the sensitive vocabulary corpus, so that the accuracy of sensitive vocabulary shielding is further improved.

Other advantageous effects of the present invention will be further described in the detailed description.

Drawings

FIG. 1 is a block flow diagram of an NLP-based sensitive vocabulary masking method in accordance with the present invention.

Fig. 2 is a block diagram of the sensitive vocabulary masking system based on NLP in the present invention.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings.

Example 1:

as shown in fig. 1, this embodiment provides a sensitive vocabulary shielding method based on NLP, which includes the following steps:

constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus, comprising the following steps:

the sensitivity of the sensitive vocabulary corpus to the sensitive vocabulary is improved, the problem that the sensitive vocabulary is replaced by pinyin, paraphrase vocabulary and homonym vocabulary so as to escape from the sensitive vocabulary shielding is avoided, the sensitive vocabulary recognition range is enlarged, and the effectiveness of the sensitive vocabulary shielding is improved;

the data compression process compresses the memory amount occupied by the sensitive vocabulary and the pinyin, the paranym and the homonym thereof, so that the processing efficiency of hardware is improved, the noise data existing in the sensitive vocabulary and the pinyin, the paranym and the homonym thereof are removed by the data noise reduction process, the authenticity of the sensitive vocabulary and the pinyin, the paranym and the homonym thereof and the accuracy of a subsequent sensitive vocabulary shielding method are improved, the repeated data is removed by the data cleaning process, the used memory is reduced, and the processing efficiency of the hardware is further improved;

constructing a non-sensitive vocabulary corpus according to the processed non-sensitive vocabularies;

according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, using an NLP algorithm to construct a sensitive vocabulary recognition model, comprising the following steps:

the effectiveness of the sensitive vocabulary recognition model is characterized in that the quantity and the quality of training text data are critical, the quantity of training samples is enriched through a plurality of training text data generated by a non-sensitive vocabulary corpus and a sensitive vocabulary corpus, and the quality of the plurality of training text data is improved through a series of processes, so that the sensitive vocabulary recognition model can fully learn the data characteristics of the training text data, and can accurately recognize the sensitive vocabulary in the text;

constructing an initial sensitive vocabulary recognition model by using a BERT-BILSTM-CRF algorithm in a natural language processing (NLP, natural Language Processing) algorithm;

the sensitive vocabulary recognition model comprises an input layer, a vector characterization layer provided with a Bi-directional coding representation (BERT, bidirectional Encoder Representation from Transformers) pre-training language sub-model from a transformer, a Bi-directional long-short-Term Memory network (BILSTM, bi-directional Long Short-Term Memory) layer, a feature fusion layer, a linear chain member random field (CRF, conditional Random Field) and an output layer which are connected in sequence;

the BERT pre-training language sub-model is built through pre-training, a plurality of word fragments in a word sequence can be converted into word vectors, the word vectors can be subjected to character splitting and converted into word vectors, vector characterization is realized, the BILSTM layer can be combined with context information to extract semantic features of the word vectors and the word vectors, the feature fusion layer carries out feature fusion on the feature fusion layer, sensitive vocabulary missing caused by Chinese-English fusion, disordered word sequences or mispronounced words is avoided, and the semantic features also avoid misidentification of the sensitive vocabulary, so that the accuracy of the model is improved;

optimizing network parameters of an initial sensitive vocabulary recognition model by using an improved whale optimizing (IWOA, improved Whale Optimization Algorithm) algorithm, and inputting a plurality of training text data for optimization training to obtain an optimal sensitive vocabulary recognition model;

introducing a Circle chaotic sequence initialization and dynamic reverse learning strategy to improve a traditional whale optimizing (WOA, whale Optimization Algorithm) algorithm to obtain an IWOA optimizing algorithm;

the formula for initializing the Circle chaotic sequence is as follows:

compared with the initial population distributed randomly, the initial population generated by using the Circle chaotic sequence mapping has the advantages that the distribution of the initial positions of the improved population is more uniform, the searching range of whales in space is enlarged, the diversity of the population positions is increased, the defect that an algorithm is easy to trap into a local extremum is improved to a certain extent, and therefore the optimizing efficiency of the algorithm is improved;

the formula of the dynamic reverse learning strategy is:

x' _ij (t)＝k(a _j (t)+b _j (t))-x _ij (t)

wherein x 'is' _ij (t)、x _ij (t) the reverse and forward positions in the j-th dimension of the i-th whale individual, respectively; a, a _j (t)、b _j (t) upper and lower bounds, respectively, for the j-th dimension of the current whale population; k is a decreasing inertia factor, k=0.9-0.5D/D _max ；D、D _max The current iteration number and the maximum iteration number are respectively; t is a time indication quantity;

dynamic reverse learning reduces searching blind spots, and effectively avoids premature algorithm and sinking into a local optimal value;

the network parameters of the initial sensitive vocabulary recognition model are optimized by using an IWOA optimizing algorithm, and the method comprises the following steps:

calculating the fitness value of each whale individual in the IWOA population;

outputting the position of a global optimal solution corresponding to the optimal whale individual if the optimal fitness value meets the requirement or the iteration number meets the requirement, namely, the number of hidden layer neurons, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate of the BILSTM layer, otherwise, repeatedly updating whale individual and IWOA population;

the problem that BILSTM is sensitive to the initial value of the network parameter is solved, and the training speed and accuracy of the model are accelerated;

extracting text from a file to be analyzed to obtain the text to be analyzed, wherein the text extraction method comprises the following steps:

if the data format of the file to be analyzed is a video format, carrying out frame interception on the file to be analyzed in the video format to obtain continuous frames of images to be analyzed, carrying out text extraction on the continuous frames of images to be analyzed by using a picture-text recognition model to obtain a plurality of original texts to be analyzed, and carrying out text combination and de-duplication on the plurality of original texts to be analyzed to obtain texts to be analyzed corresponding to the file to be analyzed in the video format;

performing word segmentation on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed, wherein the word sequence to be analyzed comprises the following steps:

combining and screening the first word sequence to be analyzed, the second word sequence to be analyzed, the third word sequence to be analyzed and the fourth word sequence to be analyzed to obtain a final word sequence to be analyzed;

the word segmentation is carried out by adopting a plurality of word segmentation algorithms, so that the situation of wrong segmentation or missed segmentation existing in a single word segmentation algorithm is avoided, and the accuracy of the word segmentation algorithm is improved;

inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, and obtaining sensitive vocabulary of the word sequence to be analyzed, wherein the method comprises the following steps:

using an output layer of the sensitive vocabulary recognition model, outputting sensitive vocabulary of a corresponding word sequence to be analyzed according to sensitive vocabulary tags in the sensitive vocabulary tag word sequence, and recording position information of the sensitive vocabulary of the word sequence to be analyzed in the sensitive vocabulary tag word sequence;

according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed, and the method comprises the following steps:

if the verification result is true, replacing the sensitive vocabulary at the corresponding position in the word sequence to be analyzed by using a shielding symbol according to the position information of the sensitive vocabulary in the word sequence marked by the sensitive vocabulary, so as to obtain a word sequence after shielding the sensitive vocabulary, otherwise, carrying out sensitive vocabulary recognition again on the word sequence to be analyzed;

In this embodiment, as a method for masking a sensitive word, after obtaining a text after masking the sensitive word, the text after masking the sensitive word is further loaded into a file to be analyzed, and different loading tools are used for the file to be analyzed with different data formats, for example, if the data format of the file to be analyzed is a text format, the text after masking the sensitive word is directly replaced to the text to be analyzed of the file to be analyzed, if the data format of the file to be analyzed is a picture format, an image editing tool is used, the text after masking the sensitive word is added to the corresponding position of the file to be analyzed of the picture format, or a masking symbol in the text after masking the sensitive word is added to the corresponding position of the file to be analyzed of the picture format, if the data format of the file to be analyzed is a video format, the text to be masked according to the sensitive word is matched with a plurality of original texts to be analyzed, and the image to be analyzed of the corresponding frame is matched, and the text after masking the sensitive word is added to the corresponding position of the image to be analyzed of the corresponding frame to the image to be analyzed of the corresponding frame is added to the masking symbol in the text to the sensitive word.

Example 2:

as shown in fig. 2, the embodiment provides a sensitive vocabulary shielding system based on NLP, which is used for realizing a sensitive vocabulary shielding method, and comprises a corpus construction unit, a sensitive vocabulary recognition model construction unit, a text extraction unit, a word segmentation processing unit, a sensitive vocabulary recognition unit and a sensitive vocabulary verification unit, wherein the corpus construction unit is respectively connected with the sensitive vocabulary recognition model construction unit and the sensitive vocabulary verification unit, the corpus construction unit is connected with an external internet corpus, the sensitive vocabulary recognition model construction unit is connected with the sensitive vocabulary recognition unit, the sensitive vocabulary recognition unit is respectively connected with the word segmentation processing unit and the sensitive vocabulary verification unit, and the word segmentation processing unit is connected with the text extraction unit;

The invention is not limited to the alternative embodiments described above, but any person may derive other various forms of products in the light of the present invention. The above detailed description should not be construed as limiting the scope of the invention, which is defined in the claims and the description may be used to interpret the claims.

Claims

1. A sensitive vocabulary shielding method based on NLP is characterized in that: the method comprises the following steps:

extracting text from the file to be analyzed to obtain the text to be analyzed;

2. The NLP-based sensitive vocabulary shielding method of claim 1, wherein: constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus, comprising the following steps:

3. The NLP-based sensitive vocabulary masking method of claim 2, wherein: according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, using an NLP algorithm to construct a sensitive vocabulary recognition model, comprising the following steps:

4. A sensitive vocabulary shielding method based on NLP according to claim 3, wherein: the sensitive vocabulary recognition model comprises an input layer, a vector characterization layer provided with a BERT pre-training language sub-model, a BILSTM layer, a feature fusion layer, a CRF layer and an output layer which are connected in sequence;

the formula for initializing the Circle chaotic sequence is as follows:

the formula of the dynamic reverse learning strategy is:

x' _ij (t)＝k(a _j (t)+b _j (t))-x _ij (t)

5. The NLP-based sensitive vocabulary shielding method of claim 4, wherein: the network parameters of the initial sensitive vocabulary recognition model are optimized by using an IWOA optimizing algorithm, and the method comprises the following steps:

calculating the fitness value of each whale individual in the IWOA population;

6. The NLP-based sensitive vocabulary shielding method of claim 1, wherein: extracting text from a file to be analyzed to obtain the text to be analyzed, wherein the text extraction method comprises the following steps:

7. The NLP-based sensitive vocabulary shielding method of claim 1, wherein: performing word segmentation on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed, wherein the word sequence to be analyzed comprises the following steps:

8. The NLP-based sensitive vocabulary shielding method of claim 4, wherein: inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, and obtaining sensitive vocabulary of the word sequence to be analyzed, wherein the method comprises the following steps:

9. The NLP-based sensitive vocabulary masking method of claim 8 wherein: according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed, and the method comprises the following steps:

10. A sensitive vocabulary shielding system based on NLP for implementing the sensitive vocabulary shielding method according to any one of claims 1-9, characterized in that: the system comprises a corpus construction unit, a sensitive vocabulary recognition model construction unit, a text extraction unit, a word segmentation processing unit, a sensitive vocabulary recognition unit and a sensitive vocabulary verification unit, wherein the corpus construction unit is respectively connected with the sensitive vocabulary recognition model construction unit and the sensitive vocabulary verification unit, the corpus construction unit is connected with an external internet corpus, the sensitive vocabulary recognition model construction unit is connected with the sensitive vocabulary recognition unit, the sensitive vocabulary recognition unit is respectively connected with the word segmentation processing unit and the sensitive vocabulary verification unit, and the word segmentation processing unit is connected with the text extraction unit;