CN112364637B - Sensitive word detection method and device, electronic equipment and storage medium - Google Patents

Sensitive word detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112364637B
CN112364637B CN202011384381.XA CN202011384381A CN112364637B CN 112364637 B CN112364637 B CN 112364637B CN 202011384381 A CN202011384381 A CN 202011384381A CN 112364637 B CN112364637 B CN 112364637B
Authority
CN
China
Prior art keywords
word
sensitive
word sequence
sequence
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011384381.XA
Other languages
Chinese (zh)
Other versions
CN112364637A (en
Inventor
潘季明
贾蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202011384381.XA priority Critical patent/CN112364637B/en
Publication of CN112364637A publication Critical patent/CN112364637A/en
Application granted granted Critical
Publication of CN112364637B publication Critical patent/CN112364637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a sensitive word detection method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring words to be detected in a text of data to be detected; determining the language of the word to be tested; screening a first sensitive word with the same language as the word to be detected from a preset sensitive word stock; wherein, the preset sensitive word library stores sensitive words of different languages; and calculating the similarity between the word to be detected and the first sensitive word. Through the mode, corresponding sensitive word detection can be realized according to the words to be detected in different languages, and compared with the prior art, the method only can detect the words in a single language in the text, the detection range is more complete, the flexibility is higher, and the reliability is higher.

Description

Sensitive word detection method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of network security, in particular to a sensitive word detection method and device, electronic equipment and a storage medium.
Background
With the rapid development of information technology, computers and networks have become necessary tools and approaches for daily office work, communication and collaborative interaction. Data security is receiving increasing attention as an important topic in the field of information security. When various data are required to be transmitted through a network and processed through a terminal, whether the service data to be transmitted or processed currently are sensitive or not needs to be judged, and whether the service data can be subjected to various operations such as network transmission or the like is determined according to the sensitivity degree and the management strategy. Sensitive data identification is a very critical ring in sensitive data anti-leakage solutions, and only sensitive data can be effectively protected if the sensitive data is accurately identified.
However, the sensitive data required to be filtered by different industries are different, even the sensitive words corresponding to different enterprises in the same industry are different, the existing sensitive word detection mode can only detect words of a single language in a text, the detection range is small, and the flexibility and the reliability are low.
Disclosure of Invention
The embodiment of the application aims to provide a sensitive word detection method, a device, electronic equipment and a storage medium, so as to solve the problems that an existing sensitive word detection mode can only detect words of a single language in a text, and the detection range is small, and the flexibility and the reliability are low.
The invention is realized in the following way:
in a first aspect, an embodiment of the present application provides a method for detecting a sensitive word, including: acquiring words to be detected in a text of data to be detected; determining the language of the word to be tested; screening a first sensitive word with the same language as the word to be detected from a preset sensitive word stock; wherein, the preset sensitive word library stores sensitive words of different languages; and calculating the similarity between the word to be detected and the first sensitive word.
In this embodiment of the application, first sensitive words of the same language are screened from a preset sensitive word stock by identifying the languages of the words to be tested, and then the similarity between the words to be tested and the first sensitive words is calculated.
With reference to the foregoing technical solution provided in the first aspect, in some possible implementation manners, the constructing the preset sensitive word stock includes: acquiring an original data text; converting the original data text into a word sequence; wherein the word sequence is a sensitive word; obtaining the language of each word sequence; inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a paraphrasing and/or synonym of the word sequence; and constructing the preset sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence.
In the embodiment of the application, the synonyms and/or the paraphrasing of the word sequence are obtained by inputting the word sequences of different languages into the corresponding language models, and the number of words in the preset sensitive word stock is expanded in the mode. And the method also realizes the expansion of sensitive words of different languages so as to facilitate the subsequent detection of the sensitive words based on the words to be detected of different languages.
With reference to the foregoing technical solution provided in the first aspect, in some possible implementation manners, the converting the original data text into a word sequence includes: word segmentation is carried out on the original data text to obtain a first word sequence; calculating a characteristic value of the first word sequence through a characteristic selection algorithm; the first word sequence with the characteristic value larger than a preset threshold value is the word sequence.
In the embodiment of the application, the first word sequence is obtained by word segmentation processing of the original data text; and calculating the characteristic value of the first word sequence through a characteristic selection algorithm, and taking the first word sequence with the characteristic value larger than a preset threshold value as the word sequence. By the method, the accuracy and the rationality of the subsequent construction of the preset sensitive word stock are improved.
With reference to the foregoing technical solution provided in the first aspect, in some possible implementation manners, after the performing word segmentation on the original data text to obtain a first word sequence, the method further includes: clustering the original data text according to text content; correspondingly, the calculating the feature value of the first word sequence through the feature selection algorithm comprises the following steps: and respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm.
In the embodiment of the application, the original data text is clustered according to the text content; and then, respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm. In this way, the calculation of the characteristic value is more accurate.
With reference to the foregoing technical solution provided in the first aspect, in some possible implementation manners, after the calculating a similarity between the word to be tested and the first sensitive word, the method further includes: and outputting the sensitive words, of which the similarity with the word to be detected is larger than a preset sensitive threshold, in the first sensitive words.
In the embodiment of the application, after the similarity between the word to be detected and the first sensitive word is calculated, outputting the sensitive word, in the first sensitive word, of which the similarity with the word to be detected is larger than a preset sensitive threshold value. By the method, a user can intuitively know whether the text of the data to be tested contains sensitive words or not.
With reference to the foregoing technical solution provided in the first aspect, in some possible implementation manners, after the calculating a similarity between the word to be tested and the first sensitive word, the method further includes: and outputting the first N sensitive words with the similarity to the words to be tested in the first sensitive words sequenced from big to small.
In the embodiment of the application, after the similarity between the word to be detected and the first sensitive word is calculated, the first N sensitive words with the similarity with the word to be detected in the first sensitive word being sequenced from big to small are output. By the method, a user can intuitively know the content related to the text of the data to be tested, and the user can analyze and manage the text of the data to be tested conveniently.
With reference to the foregoing technical solution of the first aspect, in some possible implementation manners, the calculating a similarity between the word to be detected and the first sensitive word includes: calculating the similarity between the word to be detected and the first sensitive word through a similarity algorithm corresponding to the language of the word to be detected; wherein, the similarity algorithm corresponding to different languages is different.
In the embodiment of the application, the similarity between the word to be detected and the first sensitive word is calculated through the similarity algorithm corresponding to the language of the word to be detected, so that the accuracy of similarity calculation between the word to be detected and the first sensitive word in different languages is improved, and the accuracy of sensitive word detection is further improved.
In a second aspect, an embodiment of the present application provides a method for constructing a sensitive word stock, including: acquiring an original data text; converting the original data text into a word sequence; wherein the word sequence is a sensitive word; obtaining the language of each word sequence; inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a paraphrasing and/or synonym of the word sequence; and constructing a sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence.
In a third aspect, an embodiment of the present application provides a sensitive word detection apparatus, including: the acquisition module is used for acquiring words to be detected in the text of the data to be detected; the determining module is used for determining the language of the word to be detected; the screening module is used for screening first sensitive words with the same language as the words to be tested from a preset sensitive word stock; wherein, the preset sensitive word library stores sensitive words of different languages; and the calculating module is used for calculating the similarity between the word to be detected and the first sensitive word.
In a fourth aspect, an embodiment of the present application provides a sensitive word stock construction device, including: the first acquisition module is used for acquiring the original data text; the conversion module is used for converting the original data text into a word sequence; wherein the word sequence is a sensitive word; the second acquisition module is used for acquiring the language of each word sequence; the processing module is used for inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a hyponym and/or a synonym of the word sequence; and the construction module is used for constructing a sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence.
In a fifth aspect, embodiments of the present application provide an electronic device, including: the device comprises a processor and a memory, wherein the processor is connected with the memory; the memory is used for storing programs; the processor is configured to invoke a program stored in the memory, perform a method as provided by the embodiments of the first aspect described above, and/or perform a method as provided by the embodiments of the second aspect described above.
In a sixth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program which, when executed by a processor, performs a method as provided in the embodiments of the first aspect described above, and/or performs a method as provided in the embodiments of the second aspect described above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Fig. 2 is a flowchart of steps of a method for detecting a sensitive word according to an embodiment of the present application.
Fig. 3 is a flowchart of a step of constructing a preset sensitive word stock according to an embodiment of the present application.
Fig. 4 is a schematic diagram of a process for determining a word sequence according to an embodiment of the present application.
Fig. 5 is a block diagram of a sensitive word detection device according to an embodiment of the present application.
Fig. 6 is a block diagram of a sensitive word stock construction device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
In view of the fact that the existing sensitive word detection mode can only detect words of a single language in a text, the problems of small detection range, low flexibility and low reliability exist, the inventor of the application provides the following embodiments to solve the problems through research and exploration.
Referring to fig. 1, a schematic block diagram of an electronic device 100 applying a sensitive word detection method and/or a sensitive word stock construction method is provided in an embodiment of the present application. In the present embodiment, the electronic device 100 may be, but is not limited to, a server, a personal computer (Personal Computer, PC), a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), or the like. Structurally, the electronic device 100 may include a processor 110 and a memory 120.
The processor 110 is electrically connected to the memory 120, either directly or indirectly, to enable data transmission or interaction, for example, the elements may be electrically connected to each other via one or more communication buses or signal lines. The sensitive word detection means and/or the sensitive word stock construction means comprise at least one software module which may be stored in the memory 120 in the form of software or Firmware (Firmware) or which is solidified in the Operating System (OS) of the electronic device 100. The processor 110 is configured to execute executable modules stored in the memory 120, for example, execute software function modules and computer programs included in the sensitive word detection device to implement a sensitive word detection method, and execute software function modules and computer programs included in the sensitive word library construction device to implement a sensitive word library construction method. The processor 110 may execute the computer program after receiving the execution instructions.
The processor 110 may be an integrated circuit chip with signal processing capability. The processor 110 may also be a general-purpose processor, for example, a central processing unit (Central Processing Unit, CPU), digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), discrete gate or transistor logic, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. Further, the general purpose processor may be a microprocessor or any conventional processor or the like.
The Memory 120 may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), and electrically erasable programmable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM). The memory 120 is used for storing a program, and the processor 110 executes the program after receiving an execution instruction.
It should be understood that the configuration shown in fig. 1 is merely illustrative, and the electronic device 100 provided in the embodiment of the present application may also have fewer or more components than those shown in fig. 1, or may have a different configuration than that shown in fig. 1. In addition, the components shown in fig. 1 may be implemented by software, hardware, or a combination thereof.
Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a method for detecting a sensitive word according to an embodiment of the present application, where the method is applied to the electronic device 100 shown in fig. 1. It should be noted that, the method for detecting a sensitive word provided in the embodiment of the present application is not limited by the sequence shown in fig. 2 and the following description, and the method includes: step S101 to step S104.
Step S101: and obtaining the words to be tested in the text of the data to be tested.
Step S102: and determining the language of the word to be tested.
Step S103: screening a first sensitive word with the same language as the word to be detected from a preset sensitive word stock; wherein the preset sensitive word library stores sensitive words of different languages.
Step S104: and calculating the similarity between the word to be detected and the first sensitive word.
That is, after receiving the text of the data to be tested, the electronic device obtains the words to be tested in the text of the data to be tested, then determines the language of each word to be tested, and then screens out the first sensitive word with the same language as each language to be tested from the preset sensitive word stock. For example, if the word to be tested is "network", all Chinese sensitive words are screened out from a preset sensitive word stock to be used as first sensitive words, and if the word to be tested is "network", all English sensitive words are screened out from the preset sensitive word stock to be used as first sensitive words. It should be noted that, when constructing, the preset sensitive word library is constructed according to words of different languages, so the preset sensitive word library stores sensitive words of different languages. Finally, the electronic equipment calculates the similarity between the word to be detected and the corresponding sensitive word, and realizes the detection of the sensitive word.
In this embodiment of the application, first sensitive words of the same language are screened from a preset sensitive word stock by identifying the languages of the words to be tested, and then the similarity between the words to be tested and the first sensitive words is calculated.
In order to facilitate the explanation of the scheme, firstly, the construction method of the sensitive word stock preset in step S103 will be explained. Referring to fig. 3, fig. 3 is a possible flowchart for constructing a preset sensitive word stock, where the step of constructing the preset sensitive word stock includes: step S201 to step S205.
Step S201: and acquiring the original data text.
When a preset sensitive word stock is constructed, firstly, the original data text containing sensitive words needs to be acquired. The original data text may be data text obtained by a crawler, such as by the crawler automatically crawling information on the world wide web, resulting in some sensitive data text. The original data text may also be a data text collected by a network device, such as a data text intercepted by a firewall, and of course, the original data text may also be a data text specified by a user, such as a user pre-screening some sensitive word sets as the original data text. The present application is not limited thereto.
Step S202: converting the original data text into a word sequence; wherein the word sequence is a sensitive word.
After the original data text is obtained, it is converted into a word sequence. Specifically, the original data text can be processed through a word segmentation tool and an invalid word filtering tool to obtain a word sequence. The word sequence can be used as an original sensitive word for constructing a preset sensitive word bank subsequently.
It should be noted that, the word segmentation tool may split sentences in the text into terms, for example, the "obtain original data text" may be processed by the word segmentation tool to obtain four terms of "obtain", "original", "data" and "text". The word segmentation tool can be, but is not limited to, a crust word segmentation tool and an NLPIR word segmentation tool. The invalid word filtering tool can reject invalid words in the original data text, such as common language and gas auxiliary words, adverbs, connecting words, prepositions and the like. Of course, the invalid word filtering tool can also filter numbers, characters and punctuation in the original data text. By way of example, if the original data text "obtained" is processed by the word segmentation tool and the invalid word filtering tool, four word sequences "obtained", "original", "data" and "text" can be obtained.
As an alternative implementation manner, in order to improve the accuracy and rationality of the subsequent construction of the preset sensitive word stock. In converting the original data text into a word sequence, the word sequence may be screened once. Specifically, the step of converting the original data text into a word sequence includes: word segmentation processing is carried out on the original data text to obtain a first word sequence; calculating a characteristic value of the first word sequence through a characteristic selection algorithm; the first word sequence with the characteristic value larger than the preset threshold value is a word sequence.
The feature selection algorithm described above may be, but is not limited to, a desired cross entropy algorithm, an Information Gain (IG) algorithm.
For example, taking the expected cross entropy algorithm as an example, the expected cross entropy algorithm may be used to measure the importance of a word to the whole. The specific formula is as follows:
in the formula (1), t represents a word sequence; ci represents the ith category (when text is 1, only 1 category is included); p (t) represents the frequency with which the word sequence appears in the document; p (ci||t) represents the text frequency of the category Ci under the condition that the word sequence occurs, and ECE (t) represents the expected cross entropy of the word sequence t.
The above-mentioned preset threshold is a specific value, and may be set according to different situations, for example, may be set to 0.02, 0.04, 0.1, etc.
For example, after calculating the feature value of the first word sequence by using the feature selection algorithm, the following results are "novel 0.084, network 0.048, chapter 0.022, history 0.0005, and if the preset threshold is 0.02", the first word sequence with the feature value greater than the preset threshold is the word sequence obtained after screening, for example, "novel 0.084, network 0.048, chapter 0.022". That is, in this manner, the word sequence most representative of the original data text is screened out. For example, if a user takes an article introduced by a game as an original data text, a word sequence most relevant to the article can be screened out through the feature selection algorithm, so that a preset sensitive word stock can be constructed according to the screened word sequence.
Optionally, in order to make the calculation of the feature value more accurate, after the word segmentation processing is performed on the original data text to obtain the first word sequence, the method further includes: and clustering the original data text according to the text content.
It should be noted that a process of dividing a collection of physical or abstract objects into a plurality of classes composed of similar objects is called clustering. Clusters generated by a cluster are a collection of data objects that are similar to objects in the same cluster, and are different from objects in other clusters. In the embodiment of the application, the clustering can adopt an algorithm including but not limited to a K-means clustering algorithm and a mean shift clustering.
For example, taking a K-means clustering algorithm as an example, the K-means clustering algorithm is an iterative solution clustering analysis algorithm, and the method comprises the steps of pre-dividing data into K groups, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and distributing each object to the closest clustering center. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will repeat until a certain termination condition is met. The termination condition may be that no (or a minimum number of) objects are reassigned to different clusters, no (or a minimum number of) cluster centers are changed again, and the sum of squares of errors is locally minimum. It should be noted that, in the embodiment, the text content of the original data text is clustered, for example, after clustering, the text related to the network information is a cluster, and the text related to the game information is a cluster.
If the above manner is adopted to cluster the original data text, correspondingly, calculating the feature value of the first word sequence through a feature selection algorithm, including: and respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm.
That is, the first word sequence is divided into a plurality of class clusters, and the characteristic value of the first word sequence in the class cluster is calculated in different class clusters, so that the characteristic value is calculated more accurately.
The above process will be described with reference to fig. 4, and the original data text shown in fig. 4 includes: 4110788.Txt, 411088. Txt, 697638.Txt, and 892479.Txt. After the four original data texts are obtained, firstly word segmentation processing is carried out on the original data texts to obtain a first word sequence. The word segmentation result can refer to lines 7-10 in fig. 4, for example, the word segmentation process is performed on "world science" in 4110788.Tx to obtain two words of "world" and "science". After word segmentation is completed, clustering is carried out on four original data texts according to text contents, and clustering results refer to lines 12-13 in FIG. 4, wherein 41107,188. Txt and 411088. Txt are class cluster 1 (related to science), and 697638.Txt and 892479.Txt are class cluster 2 (related to novels). Finally, the characteristic value of the first word sequence in each clustered cluster is extracted by adopting the characteristic selection algorithm, and then the word sequence can be obtained by extracting the characteristic (for example, the threshold is set to be 0.02), and the finally obtained word sequence can refer to lines 16-17 in fig. 4.
Step S203: the language of each word sequence is obtained.
After converting the original data text into word sequences, the language of each word sequence is obtained, for example, a fasttet language detection model can be used for detecting the language of each word sequence.
Step S204: and inputting each word sequence into a language model corresponding to the language of the word sequence to obtain the paraphrasing and/or synonyms of the word sequence.
After determining the language of each word sequence, selecting a respective corresponding language model (such as a bert pre-training language model corresponding to different languages) according to different languages to obtain the paraphrasing and/or synonyms of each word sequence. To avoid duplication, the paraphrasing and/or synonyms for each word sequence may be combined and then the words in the same set deleted.
Step S205: and constructing the preset sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence.
And finally, constructing a preset sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence. That is, the preset sensitive word library is constructed to include the word sequences obtained in the above manner and the synonyms and/or the synonyms corresponding to each word sequence. Accordingly, to avoid duplication, the word sequences, the paraphrasing and/or synonyms for each word sequence may be combined and then the words in the same set may be deleted.
In summary, in the embodiment of the present application, by inputting word sequences of different languages into corresponding language models, synonyms and/or paraphrasing of the word sequences are further obtained, and in this way, the number of words in a preset sensitive word stock is expanded. And the method also realizes the expansion of sensitive words of different languages so as to facilitate the subsequent detection of the sensitive words based on the words to be detected of different languages.
It should be noted that, the preset sensitive word library may be directly constructed only according to the sensitive word set of each language selected by the user, which is not limited in this application.
The flow and steps of the sensitive word detection method shown in fig. 2 are described in detail below with reference to specific examples.
Step S101: and obtaining the words to be tested in the text of the data to be tested.
In the actual detection process, after the electronic device detects the text of the data to be detected, the words to be detected in the text of the data to be detected are obtained first, that is, the text of the data to be detected needs to be converted into the words to be detected one by one. Specifically, the original data text can be processed through a word segmentation tool and an invalid word filtering tool to obtain the words to be detected. It should be noted that, the description of the word segmentation tool and the invalid word filtering tool has been described in the foregoing embodiments, and will not be repeated here.
Step S102: and determining the language of the word to be tested.
After the text of the data to be tested is converted into the word sequence, the language of each word to be tested is obtained, for example, a fasttet language detection model can be used for detecting the language of each word to be tested.
Step S103: screening a first sensitive word with the same language as the word to be detected from a preset sensitive word stock; wherein the preset sensitive word library stores sensitive words of different languages.
After determining the language of each word to be tested, screening out a first sensitive word with the same language as the word to be tested from a preset sensitive word stock. For example, when the word to be tested is "novel", all Chinese sensitive words are selected from a preset sensitive word stock to be used as first sensitive words, for example, "files and texts" in the preset sensitive word stock are selected to be used as first sensitive words, and for example, when the word to be tested is "work", all English sensitive words are selected from the preset sensitive word stock to be used as first sensitive words, for example, "files and apply" in the preset sensitive word stock are selected to be used as first sensitive words.
Step S104: and calculating the similarity between the word to be detected and the first sensitive word.
After the first sensitive word with the same language as the word to be detected is obtained, the similarity between the word to be detected and the first sensitive word can be calculated. The similarity may be calculated by a similarity algorithm, for example: the Euclidean distance algorithm and the cosine similarity algorithm are not limited in the application.
In order to improve accuracy of similarity calculation between the words to be detected and the first sensitive words in different languages, in the embodiment of the application, the similarity between the words to be detected and the first sensitive words is calculated through a similarity algorithm corresponding to the languages of the words to be detected; wherein, the similarity algorithm corresponding to different languages is different.
For example, the Chinese word to be tested can be calculated by adopting a cosine similarity algorithm, and the English word to be tested can be calculated by adopting Jaro-Winkler similarity.
Specifically, when the word to be detected is detected as Chinese, all Chinese sensitive words (namely first sensitive words) in a preset sensitive word bank are obtained, then the BERT Chinese model is adopted to vector the word to be detected and the first sensitive words, and then the similarity between the word to be detected and each first sensitive word is calculated according to a cosine similarity algorithm. The cosine similarity algorithm is formulated as follows:
in the formula (2), A i Representing components, B, of the vector of the word A to be measured i Each component of the vector representing the first sensitive word B; similarity represents cosine similarity between the word a to be measured and the first sensitive word B.
Specifically, when the word to be detected is detected as English, all English sensitive words (namely first sensitive words) in a preset sensitive word bank are obtained, and the similarity between the word to be detected and each first sensitive word is calculated according to a Jaro-Winkler similarity algorithm. The formula of the Jaro-Winkler similarity algorithm is as follows:
sim w =sim j +lp(1-sim j ) (3)
in the formula (3), s1 represents the character string length of the word to be detected, s2 represents the character string length of the first sensitive word, m represents the number of matched characters of the two character strings, and t represents half of the number of transposition; l represents the number of common prefix characters (maximum number is not more than 4) of two character strings, p is a scaling factor constant, and describes the contribution of common prefix to similarity, the larger p represents the larger the common prefix weight, the biggest p is not more than 0.25, and the default value of p is 0.1.
In summary, in the embodiment of the application, the similarity between the word to be detected and the first sensitive word is calculated through the similarity algorithm corresponding to the language of the word to be detected, so that the accuracy of similarity calculation between the word to be detected and the first sensitive word in different languages is improved, and the accuracy of sensitive word detection is further improved.
Optionally, after calculating the similarity between the word to be tested and the first sensitive word in step S104, the method further includes: and outputting the sensitive words, of which the similarity with the word to be detected is larger than a preset sensitive threshold, in the first sensitive words.
The preset sensitivity threshold may be determined according to practical situations, for example, the preset sensitivity threshold may be 40%, 60%,0.5, 0.7, etc. That is, in this manner, whether the text of the data to be detected includes a sensitive word can be checked conveniently by setting a preset sensitive threshold, for example, the preset sensitive threshold is 0.6, if a first sensitive word having a similarity with the word to be detected greater than 0.6 is calculated through the above steps, the first sensitive word is output, and if a first sensitive word having a similarity with the word to be detected greater than 0.6 is not calculated through the above steps, the output is 0, that is, the currently detected text of the data to be detected does not include a sensitive word. By the method, a user can intuitively know whether the text of the data to be tested contains sensitive words or not.
Optionally, after calculating the similarity between the word to be tested and the first sensitive word in step S104, the method further includes: and outputting the first N sensitive words with the similarity with the words to be tested in the first sensitive words sequenced from big to small.
The above N may be determined according to practical situations, for example, N may be 3, 5, 6, 8, etc. In this way, the first N sensitive words with the similarity to the word to be tested in the first sensitive word are output so that the user can know the content related to the current text of the data to be tested conveniently, and the output N sensitive words can be understood to be the sensitive words closest to the text of the data to be tested, so that the user can know the content related to the text of the data to be tested intuitively, and the user can analyze and manage the text of the data to be tested conveniently.
Based on the same concept, the embodiment of the application also provides a sensitive word stock construction method, which can be independently applied to the electronic device 100 shown in fig. 1. The method comprises the following steps: acquiring an original data text; converting the original data text into a word sequence; wherein the word sequence is a sensitive word; obtaining the language of each word sequence; inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a paraphrasing and/or synonym of the word sequence; and constructing a sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence.
It should be noted that, the method for constructing the sensitive word stock provided in the embodiment of the present application is the same as the method for constructing the preset sensitive word stock in the foregoing embodiment, so specific processes of the method for constructing the sensitive word stock may refer to specific processes of the method for constructing the preset sensitive word stock in the foregoing embodiment, and in order to avoid redundancy, repeated description is not made here.
Referring to fig. 5, based on the same inventive concept, the embodiment of the present application further provides a sensitive word detection device 200, which includes an obtaining module 201, a determining module 202, a screening module 203, and a calculating module 204.
The obtaining module 201 is configured to obtain a word to be tested in the text of the data to be tested.
And the determining module 202 is used for determining the language of the word to be tested.
The screening module 203 is configured to screen a first sensitive word having the same language as the word to be tested from a preset sensitive word stock; wherein the preset sensitive word library stores sensitive words of different languages.
The calculating module 204 is configured to calculate a similarity between the word to be detected and the first sensitive word.
In the embodiment of the present application, each module uses spark (computing engine) as a distributed computing framework.
Referring to fig. 6, based on the same inventive concept, the embodiment of the present application further provides a sensitive word stock construction device 300, including: a first acquisition module 301, a conversion module 302, a second acquisition module 303, a processing module 304, and a construction module 305.
A first obtaining module 301, configured to obtain an original data text.
A conversion module 302, configured to convert the original data text into a word sequence; wherein the word sequence is a sensitive word.
A second obtaining module 303, configured to obtain the language of each word sequence.
The processing module 304 is configured to input each word sequence into a language model corresponding to a language of the word sequence, so as to obtain a paraphrase and/or a synonym of the word sequence.
A construction module 305, configured to construct a sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence.
It should be noted that, since it will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working processes of the systems, apparatuses and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.
Based on the same inventive concept, the present application also provides a storage medium having stored thereon a computer program which, when executed, performs the method provided in the above embodiments.
The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for detecting a sensitive word, comprising:
acquiring words to be detected in a text of data to be detected;
determining the language of the word to be tested;
screening a first sensitive word with the same language as the word to be detected from a preset sensitive word stock; wherein, the preset sensitive word library stores sensitive words of different languages;
calculating the similarity between the word to be detected and the first sensitive word;
the preset sensitive word stock is constructed through the following steps:
acquiring an original data text;
converting the original data text into a word sequence; wherein the word sequence is a sensitive word;
obtaining the language of each word sequence;
inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a paraphrasing and/or synonym of the word sequence;
constructing the preset sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence;
the converting the original data text into word sequence includes:
word segmentation is carried out on the original data text to obtain a first word sequence;
calculating a characteristic value of the first word sequence through a characteristic selection algorithm, wherein the characteristic selection algorithm is an expected cross entropy algorithm;
the first word sequence with the characteristic value larger than a preset threshold value is the word sequence.
2. The method for detecting sensitive words according to claim 1, wherein after said word segmentation of said original data text to obtain a first word sequence, said method further comprises:
clustering the original data text according to text content;
correspondingly, the calculating the feature value of the first word sequence through the feature selection algorithm comprises the following steps:
and respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm.
3. The method for detecting a sensitive word according to claim 1, further comprising, after said calculating a similarity between the word to be detected and the first sensitive word:
and outputting the sensitive words, of which the similarity with the word to be detected is larger than a preset sensitive threshold, in the first sensitive words.
4. The method for detecting a sensitive word according to claim 1, further comprising, after said calculating a similarity between the word to be detected and the first sensitive word:
and outputting the first N sensitive words with the similarity to the words to be tested in the first sensitive words sequenced from big to small.
5. The method for detecting a sensitive word according to claim 1, wherein the calculating the similarity between the word to be detected and the first sensitive word includes:
calculating the similarity between the word to be detected and the first sensitive word through a similarity algorithm corresponding to the language of the word to be detected; wherein, the similarity algorithm corresponding to different languages is different.
6. The sensitive word stock construction method is characterized by comprising the following steps:
acquiring an original data text;
converting the original data text into a word sequence; wherein the word sequence is a sensitive word;
obtaining the language of each word sequence;
inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a paraphrasing and/or synonym of the word sequence;
constructing a sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence;
the converting the original data text into word sequence includes:
word segmentation is carried out on the original data text to obtain a first word sequence;
calculating a characteristic value of the first word sequence through a characteristic selection algorithm, wherein the characteristic selection algorithm is an expected cross entropy algorithm;
the first word sequence with the characteristic value larger than a preset threshold value is the word sequence.
7. A sensitive word detection apparatus, comprising:
the acquisition module is used for acquiring words to be detected in the text of the data to be detected;
the determining module is used for determining the language of the word to be detected;
the screening module is used for screening first sensitive words with the same language as the words to be tested from a preset sensitive word stock; wherein, the preset sensitive word library stores sensitive words of different languages;
the computing module is used for computing the similarity between the word to be tested and the first sensitive word;
the preset sensitive word stock is constructed through the following steps:
acquiring an original data text;
converting the original data text into a word sequence; wherein the word sequence is a sensitive word;
obtaining the language of each word sequence;
inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a paraphrasing and/or synonym of the word sequence;
constructing the preset sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence;
the converting the original data text into word sequence includes:
word segmentation is carried out on the original data text to obtain a first word sequence;
calculating a characteristic value of the first word sequence through a characteristic selection algorithm, wherein the characteristic selection algorithm is an expected cross entropy algorithm;
the first word sequence with the characteristic value larger than a preset threshold value is the word sequence.
8. A sensitive word stock construction device, comprising:
the first acquisition module is used for acquiring the original data text;
the conversion module is used for converting the original data text into a word sequence; wherein the word sequence is a sensitive word;
the second acquisition module is used for acquiring the language of each word sequence;
the processing module is used for inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a hyponym and/or a synonym of the word sequence;
the construction module is used for constructing a sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence;
the sensitive word stock is constructed by the following steps:
acquiring an original data text;
converting the original data text into a word sequence; wherein the word sequence is a sensitive word;
obtaining the language of each word sequence;
inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a paraphrasing and/or synonym of the word sequence;
constructing the sensitive word stock based on the word sequences and the paraphrasing and/or synonyms of each word sequence;
the converting the original data text into word sequence includes:
word segmentation is carried out on the original data text to obtain a first word sequence;
calculating a characteristic value of the first word sequence through a characteristic selection algorithm, wherein the characteristic selection algorithm is an expected cross entropy algorithm;
the first word sequence with the characteristic value larger than a preset threshold value is the word sequence.
9. An electronic device, comprising: the device comprises a processor and a memory, wherein the processor is connected with the memory;
the memory is used for storing programs;
the processor is configured to run a program stored in the memory, to perform the method according to any one of claims 1-5, and/or to perform the method according to claim 6.
10. A storage medium, characterized in that it has stored thereon a computer program which, when run by a computer, performs the method according to any of claims 1-5 and/or performs the method according to claim 6.
CN202011384381.XA 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium Active CN112364637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011384381.XA CN112364637B (en) 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011384381.XA CN112364637B (en) 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112364637A CN112364637A (en) 2021-02-12
CN112364637B true CN112364637B (en) 2024-02-09

Family

ID=74535855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011384381.XA Active CN112364637B (en) 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112364637B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076748B (en) * 2021-04-16 2024-01-19 平安国际智慧城市科技股份有限公司 Bullet screen sensitive word processing method, device, equipment and storage medium
CN113434775B (en) * 2021-07-15 2024-03-26 北京达佳互联信息技术有限公司 Method and device for determining search content
CN114707499B (en) * 2022-01-25 2023-10-24 中国电信股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium
CN117493532B (en) * 2023-12-29 2024-03-29 深圳智汇创想科技有限责任公司 Text processing method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528731A (en) * 2016-10-27 2017-03-22 新疆大学 Sensitive word filtering method and system
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN111640420A (en) * 2020-06-10 2020-09-08 上海明略人工智能(集团)有限公司 Audio data processing method and device and storage medium
CN111859013A (en) * 2020-07-17 2020-10-30 腾讯音乐娱乐科技(深圳)有限公司 Data processing method, device, terminal and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9430463B2 (en) * 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN106528731A (en) * 2016-10-27 2017-03-22 新疆大学 Sensitive word filtering method and system
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN111640420A (en) * 2020-06-10 2020-09-08 上海明略人工智能(集团)有限公司 Audio data processing method and device and storage medium
CN111859013A (en) * 2020-07-17 2020-10-30 腾讯音乐娱乐科技(深圳)有限公司 Data processing method, device, terminal and storage medium

Also Published As

Publication number Publication date
CN112364637A (en) 2021-02-12

Similar Documents

Publication Publication Date Title
CN112364637B (en) Sensitive word detection method and device, electronic equipment and storage medium
US10628507B2 (en) Analyzing concepts over time
US10095780B2 (en) Automatically mining patterns for rule based data standardization systems
CN111291070B (en) Abnormal SQL detection method, equipment and medium
CN112866023B (en) Network detection method, model training method, device, equipment and storage medium
CN111898366A (en) Document subject word aggregation method and device, computer equipment and readable storage medium
US10528609B2 (en) Aggregating procedures for automatic document analysis
Lathabai et al. Contextual productivity assessment of authors and journals: a network scientometric approach
US20100191753A1 (en) Extracting Patterns from Sequential Data
CN111177719A (en) Address category determination method, device, computer-readable storage medium and equipment
US11676231B1 (en) Aggregating procedures for automatic document analysis
CN113761867A (en) Address recognition method and device, computer equipment and storage medium
CN108021595A (en) Examine the method and device of knowledge base triple
CN115373982A (en) Test report analysis method, device, equipment and medium based on artificial intelligence
CN111639173B (en) Epidemic situation data processing method, device, equipment and storage medium
KR20220024251A (en) Method and apparatus for building event library, electronic device, and computer-readable medium
CN115051863A (en) Abnormal flow detection method and device, electronic equipment and readable storage medium
CN114490390A (en) Test data generation method, device, equipment and storage medium
CN113204954A (en) Data detection method and device based on big data and computer readable storage medium
CN112749258A (en) Data searching method and device, electronic equipment and storage medium
CN110083817B (en) Naming disambiguation method, device and computer readable storage medium
CN108009233B (en) Image restoration method and device, computer equipment and storage medium
CN110929033A (en) Long text classification method and device, computer equipment and storage medium
CN109086363B (en) File information maintenance degree determining method, device and equipment
CN111382244B (en) Deep retrieval matching classification method and device and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant