CN117113988A - NLP-based sensitive vocabulary shielding method and system - Google Patents

NLP-based sensitive vocabulary shielding method and system Download PDF

Info

Publication number
CN117113988A
CN117113988A CN202311068514.6A CN202311068514A CN117113988A CN 117113988 A CN117113988 A CN 117113988A CN 202311068514 A CN202311068514 A CN 202311068514A CN 117113988 A CN117113988 A CN 117113988A
Authority
CN
China
Prior art keywords
analyzed
sensitive
sensitive vocabulary
word
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311068514.6A
Other languages
Chinese (zh)
Other versions
CN117113988B (en
Inventor
陈竑
韩三普
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenwei Zhixin Technology Co ltd
Original Assignee
Beijing Shenwei Zhixin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenwei Zhixin Technology Co ltd filed Critical Beijing Shenwei Zhixin Technology Co ltd
Priority to CN202311068514.6A priority Critical patent/CN117113988B/en
Priority claimed from CN202311068514.6A external-priority patent/CN117113988B/en
Publication of CN117113988A publication Critical patent/CN117113988A/en
Application granted granted Critical
Publication of CN117113988B publication Critical patent/CN117113988B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of sensitive vocabulary shielding, and discloses a sensitive vocabulary shielding method and system based on NLP. The method comprises the following steps: constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus; constructing a sensitive vocabulary recognition model by using an NLP algorithm; extracting text from a file to be analyzed; performing word segmentation on the text to be analyzed by using a word segmentation algorithm; inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition; according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, and if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol; and acquiring the text after the sensitive vocabulary shielding, and loading the text after the sensitive vocabulary shielding to a file to be analyzed to obtain the file after the sensitive vocabulary shielding. The invention solves the problems of low accuracy, low efficiency and low practicability of shielding sensitive words in the prior art.

Description

NLP-based sensitive vocabulary shielding method and system
Technical Field
The invention belongs to the technical field of sensitive vocabulary shielding, and particularly relates to a sensitive vocabulary shielding method and system based on NLP.
Background
According to the requirements of supervision and safety, some sensitive words existing in circulation files on the internet, such as privacy information of names, identity cards or mobile phone numbers of users, sensitive information of unsuitable words and the like, business information of names, information, core technologies or employee situations of enterprises and the like, need to be shielded.
The prior art has the defects that:
1) The existing sensitive vocabulary shielding algorithm can only recognize some key sensitive vocabularies, but can not recognize pinyin, character form similarity, sound form similarity or synonyms of the sensitive vocabularies, so that the accuracy of shielding the sensitive vocabularies is low;
2) The existing sensitive vocabulary shielding algorithm utilizes a sensitive word library to perform character string matching, so that the sensitive word recognition is realized, the efficiency of the mode is low, only the sensitive vocabulary shielding can be performed on plain text files, and the text recognition and the sensitive vocabulary shielding cannot be performed on image files or video files, so that the practicability is low.
Disclosure of Invention
In order to solve the problems of low accuracy, low efficiency and low practicability of shielding sensitive vocabulary in the prior art, the invention aims to provide a sensitive vocabulary shielding method and system based on NLP.
The technical scheme adopted by the invention is as follows:
a sensitive vocabulary shielding method based on NLP comprises the following steps:
constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus;
according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, constructing a sensitive vocabulary recognition model by using an NLP algorithm;
extracting text from the file to be analyzed to obtain the text to be analyzed;
performing word segmentation on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed;
inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, so as to obtain sensitive vocabulary of the word sequence to be analyzed;
according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed;
and according to the word sequence after the sensitive word shielding, obtaining a text after the sensitive word shielding, and loading the text after the sensitive word shielding into a file to be analyzed to obtain the file after the sensitive word shielding.
Further, a sensitive vocabulary corpus and a non-sensitive vocabulary corpus are constructed, comprising the following steps:
capturing a plurality of known sensitive words of Chinese or English in the Internet by using a crawler tool;
capturing pinyin, paraphrasing vocabulary and homonyms of a plurality of known sensitive vocabularies;
carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of known sensitive words and pinyin, close meaning word and homonyms thereof to obtain a plurality of processed known sensitive words and pinyin, close meaning word and homonyms thereof;
constructing a sensitive vocabulary corpus according to the processed known sensitive vocabulary and pinyin, paraphrasing vocabulary and homonymy thereof;
collecting a plurality of Chinese or English universal words in the Internet by using a crawler tool;
according to the sensitive vocabulary corpus, rejecting known sensitive vocabulary mixed in a plurality of universal vocabularies and pinyin, near-meaning vocabulary and homonym thereof to obtain a plurality of non-sensitive vocabularies;
carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of non-sensitive words to obtain a plurality of processed non-sensitive words;
and constructing a non-sensitive vocabulary corpus according to the processed non-sensitive vocabularies.
Further, according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, using an NLP algorithm to construct a sensitive vocabulary recognition model, comprising the following steps:
randomly extracting known sensitive vocabulary in the sensitive vocabulary corpus and pinyin, near-meaning vocabulary thereof and homonym vocabulary and nonsensitive vocabulary in the nonsensitive vocabulary corpus to form a plurality of training text data;
constructing an initial sensitive vocabulary recognition model by using a BERT-BILSTM-CRF algorithm in an NLP algorithm;
and optimizing network parameters of the initial sensitive vocabulary recognition model by using an IWOA optimizing algorithm, and inputting a plurality of training text data for optimization training to obtain the optimal sensitive vocabulary recognition model.
Further, the sensitive vocabulary recognition model comprises an input layer, a vector characterization layer provided with a BERT pre-training language sub-model, a BILSTM layer, a feature fusion layer, a CRF layer and an output layer which are connected in sequence;
introducing a Circle chaotic sequence initialization and dynamic reverse learning strategy to improve the traditional WOA optimizing algorithm to obtain an IWOA optimizing algorithm;
the formula for initializing the Circle chaotic sequence is as follows:
wherein x is i+1,j+1 Initial positions of whale populations generated for Circle chaotic map; x is x i,j Initial positions for a randomly generated whale population; mod (-) is a mod function; i is the individual indication of whale; j is a dimension indicating quantity;
the formula of the dynamic reverse learning strategy is:
x' ij (t)=k(a j (t)+b j (t))-x ij (t)
wherein x 'is' ij (t)、x ij (t) the reverse and forward positions in the j-th dimension of the i-th whale individual, respectively; a, a j (t)、b j (t) upper and lower bounds, respectively, for the j-th dimension of the current whale population; k is a decreasing inertia factor, k=0.9-0.5D/D max ;D、D max The current iteration number and the maximum iteration number are respectively; t is the time indication quantity.
Further, the network parameters of the initial sensitive vocabulary recognition model are optimized by using an IWOA optimizing algorithm, and the method comprises the following steps:
taking the number of hidden layer neurons of the BILSTM layer, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate as optimization targets, namely the positions of whale individuals of the IWOA population;
initializing parameters of an IWOA optimizing algorithm, and initializing an IWOA population by using a Circle chaotic sequence;
calculating the fitness value of each whale individual in the IWOA population;
performing hunting, bubble network attack or hunting, updating whale individuals and IWOA populations;
dynamically and reversely learning the updated IWOA population to obtain reverse solutions corresponding to each forward solution in the IWOA population, and screening optimal whale individuals and optimal fitness values thereof according to all forward solutions and fitness values of whale individuals of all reverse solutions in the IWOA population;
if the optimal fitness value meets the requirement or the iteration number meets the requirement, outputting the position of a global optimal solution corresponding to the optimal whale individual, namely the number of hidden layer neurons, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate of the BILSTM layer, and if not, repeatedly updating whale individual and IWOA population.
Further, text extraction is carried out on the file to be analyzed to obtain a text to be analyzed, and the method comprises the following steps:
receiving a file to be analyzed, and analyzing the file name suffix of the file to be analyzed to obtain the data format of the file to be analyzed;
if the data format of the file to be analyzed is a text format, extracting the text of the file to be analyzed in the text format to obtain a text to be analyzed corresponding to the file to be analyzed in the text format;
if the data format of the file to be analyzed is a picture format, text extraction is carried out on the file to be analyzed in the picture format by using a picture-text recognition model, and a text to be analyzed corresponding to the file to be analyzed in the picture format is obtained;
if the data format of the file to be analyzed is a video format, frame interception is carried out on the file to be analyzed in the video format to obtain continuous frames of images to be analyzed, text extraction is carried out on the continuous frames of images to be analyzed by using a picture-text recognition model to obtain a plurality of original texts to be analyzed, and text combination and de-duplication processing are carried out on the plurality of original texts to be analyzed to obtain texts to be analyzed corresponding to the file to be analyzed in the video format.
Further, word segmentation is carried out on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed, and the method comprises the following steps:
performing word segmentation on the text to be analyzed by using a pkuseg word segmentation algorithm to obtain a first word sequence to be analyzed;
performing word segmentation on the text to be analyzed by using a jieba word segmentation algorithm to obtain a second word sequence to be analyzed;
performing word segmentation processing on the text to be analyzed by using an ltp word segmentation algorithm to obtain a third word sequence to be analyzed;
performing word segmentation on the text to be analyzed by using a hanlp word segmentation algorithm to obtain a fourth word sequence to be analyzed;
and merging and screening the first word sequence to be analyzed, the second word sequence to be analyzed, the third word sequence to be analyzed and the fourth word sequence to be analyzed to obtain a final word sequence to be analyzed.
Further, the word sequence to be analyzed is input into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, so as to obtain sensitive vocabulary of the word sequence to be analyzed, and the method comprises the following steps:
receiving a word sequence to be analyzed by using an input layer of a sensitive vocabulary recognition model;
converting a plurality of word fragments in the word sequence to be analyzed into word vectors by using a vector characterization layer of the sensitive vocabulary recognition model to obtain the word sequence to be analyzed comprising a plurality of word vectors;
converting each word vector in the word sequence to be analyzed comprising a plurality of word vectors into a word vector by using a vector characterization layer of the sensitive vocabulary recognition model to obtain a word sequence to be analyzed comprising a plurality of word vectors;
extracting word semantic features of each word vector and word semantic features of each word vector by using a BILSTM layer of the sensitive vocabulary recognition model;
feature fusion is carried out on the word meaning features of all word vectors and the word meaning features of all word vectors by using a feature fusion layer of the sensitive word recognition model, so as to obtain a fusion feature sequence;
using a CRF layer of a sensitive vocabulary recognition model, carrying out dependency processing on each word vector in a word sequence to be analyzed according to the fusion feature sequence, and adding a sensitive vocabulary label to obtain a sensitive vocabulary mark word sequence;
and outputting the corresponding sensitive vocabulary of the word sequence to be analyzed according to the sensitive vocabulary labels in the sensitive vocabulary mark word sequence by using an output layer of the sensitive vocabulary recognition model, and recording the position information of the sensitive vocabulary of the word sequence to be analyzed in the sensitive vocabulary mark word sequence.
Further, according to the sensitive vocabulary corpus, verifying the sensitive vocabulary of the word sequence to be analyzed, if the verification result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed, and the method comprises the following steps:
inputting each sensitive vocabulary of the word sequence to be analyzed into a sensitive vocabulary corpus, and performing similarity matching with known sensitive vocabulary and pinyin, near meaning vocabulary and homonym thereof in the sensitive vocabulary corpus;
if the similarity value between the known sensitive vocabulary, pinyin, paraphrasal or homonym and the current sensitive vocabulary of the word sequence to be analyzed is greater than a threshold value, entering the next step, otherwise, outputting a verification result as unreal;
outputting a verification result to be true if all the sensitive words of the word sequence to be analyzed pass verification, otherwise, inputting the next sensitive word of the word sequence to be analyzed into a sensitive word corpus to perform verification;
if the verification result is true, replacing the sensitive vocabulary at the corresponding position in the word sequence to be analyzed by using a shielding symbol according to the position information of the sensitive vocabulary in the word sequence marked by the sensitive vocabulary, so as to obtain a word sequence after the sensitive vocabulary is shielded, otherwise, carrying out sensitive vocabulary recognition again on the word sequence to be analyzed.
The sensitive vocabulary shielding system based on the NLP is used for realizing a sensitive vocabulary shielding method and comprises a corpus construction unit, a sensitive vocabulary recognition model construction unit, a text extraction unit, a word segmentation processing unit, a sensitive vocabulary recognition unit and a sensitive vocabulary verification unit, wherein the corpus construction unit is respectively connected with the sensitive vocabulary recognition model construction unit and the sensitive vocabulary verification unit, the corpus construction unit is connected with an external internet corpus, the sensitive vocabulary recognition model construction unit is connected with the sensitive vocabulary recognition unit, the sensitive vocabulary recognition unit is respectively connected with the word segmentation processing unit and the sensitive vocabulary verification unit, and the word segmentation processing unit is connected with the text extraction unit;
the corpus construction unit is used for capturing a plurality of sensitive words and a plurality of universal words in an external internet corpus, constructing a sensitive word corpus according to the plurality of sensitive words and constructing a non-sensitive word corpus according to the plurality of universal words;
the sensitive vocabulary recognition model building unit is used for calling the sensitive vocabulary corpus and the non-sensitive vocabulary corpus built by the corpus building unit and building a sensitive vocabulary recognition model by using an NLP algorithm;
the text extraction unit is used for receiving the file to be analyzed and extracting the text of the file to be analyzed to obtain the text to be analyzed;
the word segmentation processing unit is used for performing word segmentation processing on the text to be analyzed obtained by the text extraction unit by using a word segmentation algorithm to obtain a word sequence to be analyzed, and sending the word sequence to be analyzed to the sensitive vocabulary recognition unit;
the sensitive vocabulary recognition unit is used for calling the sensitive vocabulary recognition model constructed by the sensitive vocabulary recognition model construction unit, inputting the word sequence to be analyzed sent by the word segmentation processing unit into the sensitive vocabulary recognition model to perform sensitive vocabulary recognition, obtaining sensitive vocabulary of the word sequence to be analyzed, receiving a verification result sent by the sensitive vocabulary verification unit, if the verification result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol, obtaining a word sequence after the sensitive vocabulary is shielded, and obtaining a text after the sensitive vocabulary is shielded according to the word sequence after the sensitive vocabulary is shielded;
the sensitive vocabulary verification unit is used for extracting the sensitive vocabulary of the word sequence to be analyzed, which is obtained by the sensitive vocabulary recognition unit, calling the sensitive vocabulary corpus constructed by the corpus construction unit, verifying the sensitive vocabulary of the word sequence to be analyzed according to the sensitive vocabulary corpus, obtaining a verification result, and sending the verification result to the sensitive vocabulary recognition unit.
The beneficial effects of the invention are as follows:
according to the sensitive vocabulary shielding method and system based on the NLP, the sensitive vocabulary corpus and the non-sensitive vocabulary corpus are built, so that the sensitive vocabulary recognition training samples are enriched, the data support of the sensitive vocabulary is expanded, the sensitive vocabulary recognition model is built by using an NLP algorithm, the sensitive vocabulary recognition training samples are fully learned, the automatic and accurate recognition of the sensitive vocabulary is realized, and the efficiency and accuracy of the subsequent sensitive vocabulary shielding are improved; the sensitive vocabulary recognition model can extract semantic features, and analyze the semantic features in combination with a semantic environment, so that false triggering of sensitive vocabulary is avoided, and the use experience of a user is improved; the method can be applied to files to be analyzed in different data formats, and the practicability of the method is improved; and the sensitive vocabulary is checked by using the sensitive vocabulary corpus, so that the accuracy of sensitive vocabulary shielding is further improved.
Other advantageous effects of the present invention will be further described in the detailed description.
Drawings
FIG. 1 is a block flow diagram of an NLP-based sensitive vocabulary masking method in accordance with the present invention.
Fig. 2 is a block diagram of the sensitive vocabulary masking system based on NLP in the present invention.
Detailed Description
The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings.
Example 1:
as shown in fig. 1, this embodiment provides a sensitive vocabulary shielding method based on NLP, which includes the following steps:
constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus, comprising the following steps:
capturing a plurality of known sensitive words of Chinese or English in the Internet by using a crawler tool;
capturing pinyin, paraphrasing vocabulary and homonyms of a plurality of known sensitive vocabularies;
the sensitivity of the sensitive vocabulary corpus to the sensitive vocabulary is improved, the problem that the sensitive vocabulary is replaced by pinyin, paraphrase vocabulary and homonym vocabulary so as to escape from the sensitive vocabulary shielding is avoided, the sensitive vocabulary recognition range is enlarged, and the effectiveness of the sensitive vocabulary shielding is improved;
carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of known sensitive words and pinyin, close meaning word and homonyms thereof to obtain a plurality of processed known sensitive words and pinyin, close meaning word and homonyms thereof;
the data compression process compresses the memory amount occupied by the sensitive vocabulary and the pinyin, the paranym and the homonym thereof, so that the processing efficiency of hardware is improved, the noise data existing in the sensitive vocabulary and the pinyin, the paranym and the homonym thereof are removed by the data noise reduction process, the authenticity of the sensitive vocabulary and the pinyin, the paranym and the homonym thereof and the accuracy of a subsequent sensitive vocabulary shielding method are improved, the repeated data is removed by the data cleaning process, the used memory is reduced, and the processing efficiency of the hardware is further improved;
constructing a sensitive vocabulary corpus according to the processed known sensitive vocabulary and pinyin, paraphrasing vocabulary and homonymy thereof;
collecting a plurality of Chinese or English universal words in the Internet by using a crawler tool;
according to the sensitive vocabulary corpus, rejecting known sensitive vocabulary mixed in a plurality of universal vocabularies and pinyin, near-meaning vocabulary and homonym thereof to obtain a plurality of non-sensitive vocabularies;
carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of non-sensitive words to obtain a plurality of processed non-sensitive words;
constructing a non-sensitive vocabulary corpus according to the processed non-sensitive vocabularies;
according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, using an NLP algorithm to construct a sensitive vocabulary recognition model, comprising the following steps:
randomly extracting known sensitive vocabulary in the sensitive vocabulary corpus and pinyin, near-meaning vocabulary thereof and homonym vocabulary and nonsensitive vocabulary in the nonsensitive vocabulary corpus to form a plurality of training text data;
the effectiveness of the sensitive vocabulary recognition model is characterized in that the quantity and the quality of training text data are critical, the quantity of training samples is enriched through a plurality of training text data generated by a non-sensitive vocabulary corpus and a sensitive vocabulary corpus, and the quality of the plurality of training text data is improved through a series of processes, so that the sensitive vocabulary recognition model can fully learn the data characteristics of the training text data, and can accurately recognize the sensitive vocabulary in the text;
constructing an initial sensitive vocabulary recognition model by using a BERT-BILSTM-CRF algorithm in a natural language processing (NLP, natural Language Processing) algorithm;
the sensitive vocabulary recognition model comprises an input layer, a vector characterization layer provided with a Bi-directional coding representation (BERT, bidirectional Encoder Representation from Transformers) pre-training language sub-model from a transformer, a Bi-directional long-short-Term Memory network (BILSTM, bi-directional Long Short-Term Memory) layer, a feature fusion layer, a linear chain member random field (CRF, conditional Random Field) and an output layer which are connected in sequence;
the BERT pre-training language sub-model is built through pre-training, a plurality of word fragments in a word sequence can be converted into word vectors, the word vectors can be subjected to character splitting and converted into word vectors, vector characterization is realized, the BILSTM layer can be combined with context information to extract semantic features of the word vectors and the word vectors, the feature fusion layer carries out feature fusion on the feature fusion layer, sensitive vocabulary missing caused by Chinese-English fusion, disordered word sequences or mispronounced words is avoided, and the semantic features also avoid misidentification of the sensitive vocabulary, so that the accuracy of the model is improved;
optimizing network parameters of an initial sensitive vocabulary recognition model by using an improved whale optimizing (IWOA, improved Whale Optimization Algorithm) algorithm, and inputting a plurality of training text data for optimization training to obtain an optimal sensitive vocabulary recognition model;
introducing a Circle chaotic sequence initialization and dynamic reverse learning strategy to improve a traditional whale optimizing (WOA, whale Optimization Algorithm) algorithm to obtain an IWOA optimizing algorithm;
the formula for initializing the Circle chaotic sequence is as follows:
wherein x is i+1,j+1 Initial positions of whale populations generated for Circle chaotic map; x is x i,j Initial positions for a randomly generated whale population; mod (-) is a mod function; i is the individual indication of whale; j is a dimension indicating quantity;
compared with the initial population distributed randomly, the initial population generated by using the Circle chaotic sequence mapping has the advantages that the distribution of the initial positions of the improved population is more uniform, the searching range of whales in space is enlarged, the diversity of the population positions is increased, the defect that an algorithm is easy to trap into a local extremum is improved to a certain extent, and therefore the optimizing efficiency of the algorithm is improved;
the formula of the dynamic reverse learning strategy is:
x' ij (t)=k(a j (t)+b j (t))-x ij (t)
wherein x 'is' ij (t)、x ij (t) the reverse and forward positions in the j-th dimension of the i-th whale individual, respectively; a, a j (t)、b j (t) upper and lower bounds, respectively, for the j-th dimension of the current whale population; k is a decreasing inertia factor, k=0.9-0.5D/D max ;D、D max The current iteration number and the maximum iteration number are respectively; t is a time indication quantity;
dynamic reverse learning reduces searching blind spots, and effectively avoids premature algorithm and sinking into a local optimal value;
the network parameters of the initial sensitive vocabulary recognition model are optimized by using an IWOA optimizing algorithm, and the method comprises the following steps:
taking the number of hidden layer neurons of the BILSTM layer, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate as optimization targets, namely the positions of whale individuals of the IWOA population;
initializing parameters of an IWOA optimizing algorithm, and initializing an IWOA population by using a Circle chaotic sequence;
calculating the fitness value of each whale individual in the IWOA population;
performing hunting, bubble network attack or hunting, updating whale individuals and IWOA populations;
dynamically and reversely learning the updated IWOA population to obtain reverse solutions corresponding to each forward solution in the IWOA population, and screening optimal whale individuals and optimal fitness values thereof according to all forward solutions and fitness values of whale individuals of all reverse solutions in the IWOA population;
outputting the position of a global optimal solution corresponding to the optimal whale individual if the optimal fitness value meets the requirement or the iteration number meets the requirement, namely, the number of hidden layer neurons, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate of the BILSTM layer, otherwise, repeatedly updating whale individual and IWOA population;
the problem that BILSTM is sensitive to the initial value of the network parameter is solved, and the training speed and accuracy of the model are accelerated;
extracting text from a file to be analyzed to obtain the text to be analyzed, wherein the text extraction method comprises the following steps:
receiving a file to be analyzed, and analyzing the file name suffix of the file to be analyzed to obtain the data format of the file to be analyzed;
if the data format of the file to be analyzed is a text format, extracting the text of the file to be analyzed in the text format to obtain a text to be analyzed corresponding to the file to be analyzed in the text format;
if the data format of the file to be analyzed is a picture format, text extraction is carried out on the file to be analyzed in the picture format by using a picture-text recognition model, and a text to be analyzed corresponding to the file to be analyzed in the picture format is obtained;
if the data format of the file to be analyzed is a video format, carrying out frame interception on the file to be analyzed in the video format to obtain continuous frames of images to be analyzed, carrying out text extraction on the continuous frames of images to be analyzed by using a picture-text recognition model to obtain a plurality of original texts to be analyzed, and carrying out text combination and de-duplication on the plurality of original texts to be analyzed to obtain texts to be analyzed corresponding to the file to be analyzed in the video format;
performing word segmentation on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed, wherein the word sequence to be analyzed comprises the following steps:
performing word segmentation on the text to be analyzed by using a pkuseg word segmentation algorithm to obtain a first word sequence to be analyzed;
performing word segmentation on the text to be analyzed by using a jieba word segmentation algorithm to obtain a second word sequence to be analyzed;
performing word segmentation processing on the text to be analyzed by using an ltp word segmentation algorithm to obtain a third word sequence to be analyzed;
performing word segmentation on the text to be analyzed by using a hanlp word segmentation algorithm to obtain a fourth word sequence to be analyzed;
combining and screening the first word sequence to be analyzed, the second word sequence to be analyzed, the third word sequence to be analyzed and the fourth word sequence to be analyzed to obtain a final word sequence to be analyzed;
the word segmentation is carried out by adopting a plurality of word segmentation algorithms, so that the situation of wrong segmentation or missed segmentation existing in a single word segmentation algorithm is avoided, and the accuracy of the word segmentation algorithm is improved;
inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, and obtaining sensitive vocabulary of the word sequence to be analyzed, wherein the method comprises the following steps:
receiving a word sequence to be analyzed by using an input layer of a sensitive vocabulary recognition model;
converting a plurality of word fragments in the word sequence to be analyzed into word vectors by using a vector characterization layer of the sensitive vocabulary recognition model to obtain the word sequence to be analyzed comprising a plurality of word vectors;
converting each word vector in the word sequence to be analyzed comprising a plurality of word vectors into a word vector by using a vector characterization layer of the sensitive vocabulary recognition model to obtain a word sequence to be analyzed comprising a plurality of word vectors;
extracting word semantic features of each word vector and word semantic features of each word vector by using a BILSTM layer of the sensitive vocabulary recognition model;
feature fusion is carried out on the word meaning features of all word vectors and the word meaning features of all word vectors by using a feature fusion layer of the sensitive word recognition model, so as to obtain a fusion feature sequence;
using a CRF layer of a sensitive vocabulary recognition model, carrying out dependency processing on each word vector in a word sequence to be analyzed according to the fusion feature sequence, and adding a sensitive vocabulary label to obtain a sensitive vocabulary mark word sequence;
using an output layer of the sensitive vocabulary recognition model, outputting sensitive vocabulary of a corresponding word sequence to be analyzed according to sensitive vocabulary tags in the sensitive vocabulary tag word sequence, and recording position information of the sensitive vocabulary of the word sequence to be analyzed in the sensitive vocabulary tag word sequence;
according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed, and the method comprises the following steps:
inputting each sensitive vocabulary of the word sequence to be analyzed into a sensitive vocabulary corpus, and performing similarity matching with known sensitive vocabulary and pinyin, near meaning vocabulary and homonym thereof in the sensitive vocabulary corpus;
if the similarity value between the known sensitive vocabulary, pinyin, paraphrasal or homonym and the current sensitive vocabulary of the word sequence to be analyzed is greater than a threshold value, entering the next step, otherwise, outputting a verification result as unreal;
outputting a verification result to be true if all the sensitive words of the word sequence to be analyzed pass verification, otherwise, inputting the next sensitive word of the word sequence to be analyzed into a sensitive word corpus to perform verification;
if the verification result is true, replacing the sensitive vocabulary at the corresponding position in the word sequence to be analyzed by using a shielding symbol according to the position information of the sensitive vocabulary in the word sequence marked by the sensitive vocabulary, so as to obtain a word sequence after shielding the sensitive vocabulary, otherwise, carrying out sensitive vocabulary recognition again on the word sequence to be analyzed;
and according to the word sequence after the sensitive word shielding, obtaining a text after the sensitive word shielding, and loading the text after the sensitive word shielding into a file to be analyzed to obtain the file after the sensitive word shielding.
In this embodiment, as a method for masking a sensitive word, after obtaining a text after masking the sensitive word, the text after masking the sensitive word is further loaded into a file to be analyzed, and different loading tools are used for the file to be analyzed with different data formats, for example, if the data format of the file to be analyzed is a text format, the text after masking the sensitive word is directly replaced to the text to be analyzed of the file to be analyzed, if the data format of the file to be analyzed is a picture format, an image editing tool is used, the text after masking the sensitive word is added to the corresponding position of the file to be analyzed of the picture format, or a masking symbol in the text after masking the sensitive word is added to the corresponding position of the file to be analyzed of the picture format, if the data format of the file to be analyzed is a video format, the text to be masked according to the sensitive word is matched with a plurality of original texts to be analyzed, and the image to be analyzed of the corresponding frame is matched, and the text after masking the sensitive word is added to the corresponding position of the image to be analyzed of the corresponding frame to the image to be analyzed of the corresponding frame is added to the masking symbol in the text to the sensitive word.
Example 2:
as shown in fig. 2, the embodiment provides a sensitive vocabulary shielding system based on NLP, which is used for realizing a sensitive vocabulary shielding method, and comprises a corpus construction unit, a sensitive vocabulary recognition model construction unit, a text extraction unit, a word segmentation processing unit, a sensitive vocabulary recognition unit and a sensitive vocabulary verification unit, wherein the corpus construction unit is respectively connected with the sensitive vocabulary recognition model construction unit and the sensitive vocabulary verification unit, the corpus construction unit is connected with an external internet corpus, the sensitive vocabulary recognition model construction unit is connected with the sensitive vocabulary recognition unit, the sensitive vocabulary recognition unit is respectively connected with the word segmentation processing unit and the sensitive vocabulary verification unit, and the word segmentation processing unit is connected with the text extraction unit;
the corpus construction unit is used for capturing a plurality of sensitive words and a plurality of universal words in an external internet corpus, constructing a sensitive word corpus according to the plurality of sensitive words and constructing a non-sensitive word corpus according to the plurality of universal words;
the sensitive vocabulary recognition model building unit is used for calling the sensitive vocabulary corpus and the non-sensitive vocabulary corpus built by the corpus building unit and building a sensitive vocabulary recognition model by using an NLP algorithm;
the text extraction unit is used for receiving the file to be analyzed and extracting the text of the file to be analyzed to obtain the text to be analyzed;
the word segmentation processing unit is used for performing word segmentation processing on the text to be analyzed obtained by the text extraction unit by using a word segmentation algorithm to obtain a word sequence to be analyzed, and sending the word sequence to be analyzed to the sensitive vocabulary recognition unit;
the sensitive vocabulary recognition unit is used for calling the sensitive vocabulary recognition model constructed by the sensitive vocabulary recognition model construction unit, inputting the word sequence to be analyzed sent by the word segmentation processing unit into the sensitive vocabulary recognition model to perform sensitive vocabulary recognition, obtaining sensitive vocabulary of the word sequence to be analyzed, receiving a verification result sent by the sensitive vocabulary verification unit, if the verification result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol, obtaining a word sequence after the sensitive vocabulary is shielded, and obtaining a text after the sensitive vocabulary is shielded according to the word sequence after the sensitive vocabulary is shielded;
the sensitive vocabulary verification unit is used for extracting the sensitive vocabulary of the word sequence to be analyzed, which is obtained by the sensitive vocabulary recognition unit, calling the sensitive vocabulary corpus constructed by the corpus construction unit, verifying the sensitive vocabulary of the word sequence to be analyzed according to the sensitive vocabulary corpus, obtaining a verification result, and sending the verification result to the sensitive vocabulary recognition unit.
According to the sensitive vocabulary shielding method and system based on the NLP, the sensitive vocabulary corpus and the non-sensitive vocabulary corpus are built, so that the sensitive vocabulary recognition training samples are enriched, the data support of the sensitive vocabulary is expanded, the sensitive vocabulary recognition model is built by using an NLP algorithm, the sensitive vocabulary recognition training samples are fully learned, the automatic and accurate recognition of the sensitive vocabulary is realized, and the efficiency and accuracy of the subsequent sensitive vocabulary shielding are improved; the sensitive vocabulary recognition model can extract semantic features, and analyze the semantic features in combination with a semantic environment, so that false triggering of sensitive vocabulary is avoided, and the use experience of a user is improved; the method can be applied to files to be analyzed in different data formats, and the practicability of the method is improved; and the sensitive vocabulary is checked by using the sensitive vocabulary corpus, so that the accuracy of sensitive vocabulary shielding is further improved.
The invention is not limited to the alternative embodiments described above, but any person may derive other various forms of products in the light of the present invention. The above detailed description should not be construed as limiting the scope of the invention, which is defined in the claims and the description may be used to interpret the claims.

Claims (10)

1. A sensitive vocabulary shielding method based on NLP is characterized in that: the method comprises the following steps:
constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus;
according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, constructing a sensitive vocabulary recognition model by using an NLP algorithm;
extracting text from the file to be analyzed to obtain the text to be analyzed;
performing word segmentation on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed;
inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, so as to obtain sensitive vocabulary of the word sequence to be analyzed;
according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed;
and according to the word sequence after the sensitive word shielding, obtaining a text after the sensitive word shielding, and loading the text after the sensitive word shielding into a file to be analyzed to obtain the file after the sensitive word shielding.
2. The NLP-based sensitive vocabulary shielding method of claim 1, wherein: constructing a sensitive vocabulary corpus and a non-sensitive vocabulary corpus, comprising the following steps:
capturing a plurality of known sensitive words of Chinese or English in the Internet by using a crawler tool;
capturing pinyin, paraphrasing vocabulary and homonyms of a plurality of known sensitive vocabularies;
carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of known sensitive words and pinyin, close meaning word and homonyms thereof to obtain a plurality of processed known sensitive words and pinyin, close meaning word and homonyms thereof;
constructing a sensitive vocabulary corpus according to the processed known sensitive vocabulary and pinyin, paraphrasing vocabulary and homonymy thereof;
collecting a plurality of Chinese or English universal words in the Internet by using a crawler tool;
according to the sensitive vocabulary corpus, rejecting known sensitive vocabulary mixed in a plurality of universal vocabularies and pinyin, near-meaning vocabulary and homonym thereof to obtain a plurality of non-sensitive vocabularies;
carrying out data compression processing, data noise reduction processing and data cleaning processing on a plurality of non-sensitive words to obtain a plurality of processed non-sensitive words;
and constructing a non-sensitive vocabulary corpus according to the processed non-sensitive vocabularies.
3. The NLP-based sensitive vocabulary masking method of claim 2, wherein: according to the sensitive vocabulary corpus and the non-sensitive vocabulary corpus, using an NLP algorithm to construct a sensitive vocabulary recognition model, comprising the following steps:
randomly extracting known sensitive vocabulary in the sensitive vocabulary corpus and pinyin, near-meaning vocabulary thereof and homonym vocabulary and nonsensitive vocabulary in the nonsensitive vocabulary corpus to form a plurality of training text data;
constructing an initial sensitive vocabulary recognition model by using a BERT-BILSTM-CRF algorithm in an NLP algorithm;
and optimizing network parameters of the initial sensitive vocabulary recognition model by using an IWOA optimizing algorithm, and inputting a plurality of training text data for optimization training to obtain the optimal sensitive vocabulary recognition model.
4. A sensitive vocabulary shielding method based on NLP according to claim 3, wherein: the sensitive vocabulary recognition model comprises an input layer, a vector characterization layer provided with a BERT pre-training language sub-model, a BILSTM layer, a feature fusion layer, a CRF layer and an output layer which are connected in sequence;
introducing a Circle chaotic sequence initialization and dynamic reverse learning strategy to improve the traditional WOA optimizing algorithm to obtain an IWOA optimizing algorithm;
the formula for initializing the Circle chaotic sequence is as follows:
wherein x is i+1,j+1 Initial positions of whale populations generated for Circle chaotic map; x is x i,j Initial positions for a randomly generated whale population; mod (-) is a mod function; i is the individual indication of whale; j is a dimension indicating quantity;
the formula of the dynamic reverse learning strategy is:
x' ij (t)=k(a j (t)+b j (t))-x ij (t)
wherein x 'is' ij (t)、x ij (t) the reverse and forward positions in the j-th dimension of the i-th whale individual, respectively; a, a j (t)、b j (t) upper and lower bounds, respectively, for the j-th dimension of the current whale population; k is a decreasing inertia factor, k=0.9-0.5D/D max ;D、D max The current iteration number and the maximum iteration number are respectively; t is the time indication quantity.
5. The NLP-based sensitive vocabulary shielding method of claim 4, wherein: the network parameters of the initial sensitive vocabulary recognition model are optimized by using an IWOA optimizing algorithm, and the method comprises the following steps:
taking the number of hidden layer neurons of the BILSTM layer, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate as optimization targets, namely the positions of whale individuals of the IWOA population;
initializing parameters of an IWOA optimizing algorithm, and initializing an IWOA population by using a Circle chaotic sequence;
calculating the fitness value of each whale individual in the IWOA population;
performing hunting, bubble network attack or hunting, updating whale individuals and IWOA populations;
dynamically and reversely learning the updated IWOA population to obtain reverse solutions corresponding to each forward solution in the IWOA population, and screening optimal whale individuals and optimal fitness values thereof according to all forward solutions and fitness values of whale individuals of all reverse solutions in the IWOA population;
if the optimal fitness value meets the requirement or the iteration number meets the requirement, outputting the position of a global optimal solution corresponding to the optimal whale individual, namely the number of hidden layer neurons, the initial weight and the initial threshold of the hidden layer neurons and the initial learning rate of the BILSTM layer, and if not, repeatedly updating whale individual and IWOA population.
6. The NLP-based sensitive vocabulary shielding method of claim 1, wherein: extracting text from a file to be analyzed to obtain the text to be analyzed, wherein the text extraction method comprises the following steps:
receiving a file to be analyzed, and analyzing the file name suffix of the file to be analyzed to obtain the data format of the file to be analyzed;
if the data format of the file to be analyzed is a text format, extracting the text of the file to be analyzed in the text format to obtain a text to be analyzed corresponding to the file to be analyzed in the text format;
if the data format of the file to be analyzed is a picture format, text extraction is carried out on the file to be analyzed in the picture format by using a picture-text recognition model, and a text to be analyzed corresponding to the file to be analyzed in the picture format is obtained;
if the data format of the file to be analyzed is a video format, frame interception is carried out on the file to be analyzed in the video format to obtain continuous frames of images to be analyzed, text extraction is carried out on the continuous frames of images to be analyzed by using a picture-text recognition model to obtain a plurality of original texts to be analyzed, and text combination and de-duplication processing are carried out on the plurality of original texts to be analyzed to obtain texts to be analyzed corresponding to the file to be analyzed in the video format.
7. The NLP-based sensitive vocabulary shielding method of claim 1, wherein: performing word segmentation on the text to be analyzed by using a word segmentation algorithm to obtain a word sequence to be analyzed, wherein the word sequence to be analyzed comprises the following steps:
performing word segmentation on the text to be analyzed by using a pkuseg word segmentation algorithm to obtain a first word sequence to be analyzed;
performing word segmentation on the text to be analyzed by using a jieba word segmentation algorithm to obtain a second word sequence to be analyzed;
performing word segmentation processing on the text to be analyzed by using an ltp word segmentation algorithm to obtain a third word sequence to be analyzed;
performing word segmentation on the text to be analyzed by using a hanlp word segmentation algorithm to obtain a fourth word sequence to be analyzed;
and merging and screening the first word sequence to be analyzed, the second word sequence to be analyzed, the third word sequence to be analyzed and the fourth word sequence to be analyzed to obtain a final word sequence to be analyzed.
8. The NLP-based sensitive vocabulary shielding method of claim 4, wherein: inputting the word sequence to be analyzed into a sensitive vocabulary recognition model to perform sensitive vocabulary recognition, and obtaining sensitive vocabulary of the word sequence to be analyzed, wherein the method comprises the following steps:
receiving a word sequence to be analyzed by using an input layer of a sensitive vocabulary recognition model;
converting a plurality of word fragments in the word sequence to be analyzed into word vectors by using a vector characterization layer of the sensitive vocabulary recognition model to obtain the word sequence to be analyzed comprising a plurality of word vectors;
converting each word vector in the word sequence to be analyzed comprising a plurality of word vectors into a word vector by using a vector characterization layer of the sensitive vocabulary recognition model to obtain a word sequence to be analyzed comprising a plurality of word vectors;
extracting word semantic features of each word vector and word semantic features of each word vector by using a BILSTM layer of the sensitive vocabulary recognition model;
feature fusion is carried out on the word meaning features of all word vectors and the word meaning features of all word vectors by using a feature fusion layer of the sensitive word recognition model, so as to obtain a fusion feature sequence;
using a CRF layer of a sensitive vocabulary recognition model, carrying out dependency processing on each word vector in a word sequence to be analyzed according to the fusion feature sequence, and adding a sensitive vocabulary label to obtain a sensitive vocabulary mark word sequence;
and outputting the corresponding sensitive vocabulary of the word sequence to be analyzed according to the sensitive vocabulary labels in the sensitive vocabulary mark word sequence by using an output layer of the sensitive vocabulary recognition model, and recording the position information of the sensitive vocabulary of the word sequence to be analyzed in the sensitive vocabulary mark word sequence.
9. The NLP-based sensitive vocabulary masking method of claim 8 wherein: according to the sensitive vocabulary corpus, checking the sensitive vocabulary of the word sequence to be analyzed, if the checking result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol to obtain a word sequence after shielding the sensitive vocabulary, otherwise, re-carrying out sensitive vocabulary recognition on the word sequence to be analyzed, and the method comprises the following steps:
inputting each sensitive vocabulary of the word sequence to be analyzed into a sensitive vocabulary corpus, and performing similarity matching with known sensitive vocabulary and pinyin, near meaning vocabulary and homonym thereof in the sensitive vocabulary corpus;
if the similarity value between the known sensitive vocabulary, pinyin, paraphrasal or homonym and the current sensitive vocabulary of the word sequence to be analyzed is greater than a threshold value, entering the next step, otherwise, outputting a verification result as unreal;
outputting a verification result to be true if all the sensitive words of the word sequence to be analyzed pass verification, otherwise, inputting the next sensitive word of the word sequence to be analyzed into a sensitive word corpus to perform verification;
if the verification result is true, replacing the sensitive vocabulary at the corresponding position in the word sequence to be analyzed by using a shielding symbol according to the position information of the sensitive vocabulary in the word sequence marked by the sensitive vocabulary, so as to obtain a word sequence after the sensitive vocabulary is shielded, otherwise, carrying out sensitive vocabulary recognition again on the word sequence to be analyzed.
10. A sensitive vocabulary shielding system based on NLP for implementing the sensitive vocabulary shielding method according to any one of claims 1-9, characterized in that: the system comprises a corpus construction unit, a sensitive vocabulary recognition model construction unit, a text extraction unit, a word segmentation processing unit, a sensitive vocabulary recognition unit and a sensitive vocabulary verification unit, wherein the corpus construction unit is respectively connected with the sensitive vocabulary recognition model construction unit and the sensitive vocabulary verification unit, the corpus construction unit is connected with an external internet corpus, the sensitive vocabulary recognition model construction unit is connected with the sensitive vocabulary recognition unit, the sensitive vocabulary recognition unit is respectively connected with the word segmentation processing unit and the sensitive vocabulary verification unit, and the word segmentation processing unit is connected with the text extraction unit;
the corpus construction unit is used for capturing a plurality of sensitive words and a plurality of universal words in an external internet corpus, constructing a sensitive word corpus according to the plurality of sensitive words and constructing a non-sensitive word corpus according to the plurality of universal words;
the sensitive vocabulary recognition model building unit is used for calling the sensitive vocabulary corpus and the non-sensitive vocabulary corpus built by the corpus building unit and building a sensitive vocabulary recognition model by using an NLP algorithm;
the text extraction unit is used for receiving the file to be analyzed and extracting the text of the file to be analyzed to obtain the text to be analyzed;
the word segmentation processing unit is used for performing word segmentation processing on the text to be analyzed obtained by the text extraction unit by using a word segmentation algorithm to obtain a word sequence to be analyzed, and sending the word sequence to be analyzed to the sensitive vocabulary recognition unit;
the sensitive vocabulary recognition unit is used for calling the sensitive vocabulary recognition model constructed by the sensitive vocabulary recognition model construction unit, inputting the word sequence to be analyzed sent by the word segmentation processing unit into the sensitive vocabulary recognition model to perform sensitive vocabulary recognition, obtaining sensitive vocabulary of the word sequence to be analyzed, receiving a verification result sent by the sensitive vocabulary verification unit, if the verification result is true, replacing the sensitive vocabulary of the word sequence to be analyzed by using a shielding symbol, obtaining a word sequence after the sensitive vocabulary is shielded, and obtaining a text after the sensitive vocabulary is shielded according to the word sequence after the sensitive vocabulary is shielded;
the sensitive vocabulary verification unit is used for extracting the sensitive vocabulary of the word sequence to be analyzed, which is obtained by the sensitive vocabulary recognition unit, calling the sensitive vocabulary corpus constructed by the corpus construction unit, verifying the sensitive vocabulary of the word sequence to be analyzed according to the sensitive vocabulary corpus, obtaining a verification result, and sending the verification result to the sensitive vocabulary recognition unit.
CN202311068514.6A 2023-08-23 NLP-based sensitive vocabulary shielding method and system Active CN117113988B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311068514.6A CN117113988B (en) 2023-08-23 NLP-based sensitive vocabulary shielding method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311068514.6A CN117113988B (en) 2023-08-23 NLP-based sensitive vocabulary shielding method and system

Publications (2)

Publication Number Publication Date
CN117113988A true CN117113988A (en) 2023-11-24
CN117113988B CN117113988B (en) 2024-06-07

Family

ID=

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content
US10878124B1 (en) * 2017-12-06 2020-12-29 Dataguise, Inc. Systems and methods for detecting sensitive information using pattern recognition
CN113988061A (en) * 2021-10-22 2022-01-28 平安国际智慧城市科技股份有限公司 Sensitive word detection method, device and equipment based on deep learning and storage medium
CN114298039A (en) * 2021-11-19 2022-04-08 马上消费金融股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10878124B1 (en) * 2017-12-06 2020-12-29 Dataguise, Inc. Systems and methods for detecting sensitive information using pattern recognition
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content
CN113988061A (en) * 2021-10-22 2022-01-28 平安国际智慧城市科技股份有限公司 Sensitive word detection method, device and equipment based on deep learning and storage medium
CN114298039A (en) * 2021-11-19 2022-04-08 马上消费金融股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
小陈IT: "机器学习之MATLAB代码—IWOA_BILSTM(基于改进鲸鱼算法优化的BILSTM预测算法)(十六)", pages 1 - 6, Retrieved from the Internet <URL:https://blog.csdn.net/weixin_44312889/article/details/128121895> *
张达敏 等: "嵌入Circle映射和逐维小孔成像反向学习的鲸鱼优化算法", 控制与决策, vol. 36, no. 5, 31 May 2021 (2021-05-31), pages 1173 - 1180 *

Similar Documents

Publication Publication Date Title
CN109299273B (en) Multi-source multi-label text classification method and system based on improved seq2seq model
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN110135457B (en) Event trigger word extraction method and system based on self-encoder fusion document information
CN113312500B (en) Method for constructing event map for safe operation of dam
CN111476023B (en) Method and device for identifying entity relationship
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN109165563B (en) Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product
CN110717324B (en) Judgment document answer information extraction method, device, extractor, medium and equipment
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN113223509B (en) Fuzzy statement identification method and system applied to multi-person mixed scene
CN110472548B (en) Video continuous sign language recognition method and system based on grammar classifier
CN111966812A (en) Automatic question answering method based on dynamic word vector and storage medium
CN111538809A (en) Voice service quality detection method, model training method and device
CN112667813B (en) Method for identifying sensitive identity information of referee document
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN107797988A (en) A kind of mixing language material name entity recognition method based on Bi LSTM
CN113761883A (en) Text information identification method and device, electronic equipment and storage medium
CN115408488A (en) Segmentation method and system for novel scene text
CN111460100A (en) Criminal legal document and criminal name recommendation method and system
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
CN113486174B (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN111159405B (en) Irony detection method based on background knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant