CN112364637A - Sensitive word detection method and device, electronic equipment and storage medium - Google Patents

Sensitive word detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112364637A
CN112364637A CN202011384381.XA CN202011384381A CN112364637A CN 112364637 A CN112364637 A CN 112364637A CN 202011384381 A CN202011384381 A CN 202011384381A CN 112364637 A CN112364637 A CN 112364637A
Authority
CN
China
Prior art keywords
word
sensitive
detected
words
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011384381.XA
Other languages
Chinese (zh)
Other versions
CN112364637B (en
Inventor
潘季明
贾蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202011384381.XA priority Critical patent/CN112364637B/en
Publication of CN112364637A publication Critical patent/CN112364637A/en
Application granted granted Critical
Publication of CN112364637B publication Critical patent/CN112364637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Abstract

The application provides a sensitive word detection method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring words to be detected in a text of data to be detected; determining the language of the word to be detected; screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; sensitive words of different languages are stored in the preset sensitive word bank; and calculating the similarity between the word to be detected and the first sensitive word. Through this mode, can realize the sensitive word detection that corresponds according to the word that awaits measuring of different languages, compare with only can detect to the word of single language in the text among the prior art, detection range is more complete, flexibility, and the reliability is higher.

Description

Sensitive word detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for detecting a sensitive word, an electronic device, and a storage medium.
Background
With the rapid development of information technology, computers and networks have become essential tools and approaches for daily office work, communication and cooperative interaction. Data security is receiving increasing attention as an important issue in the field of information security. When various data needs to be transmitted and processed by a terminal, whether the current service data to be transmitted or processed is sensitive needs to be judged, and whether the service data can be subjected to various operations such as network transmission and the like is determined according to the sensitivity degree and a management strategy. Sensitive data identification is a very critical loop in sensitive data leakage prevention solutions, and only if sensitive data is accurately identified can the data be effectively protected.
However, the sensitive data required to be filtered in different industries are different, even if the sensitive words corresponding to different enterprises in the same industry are different, the existing sensitive word detection mode can only detect words in a single language in a text, and has the advantages of small detection range, low flexibility and low reliability.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method and an apparatus for detecting a sensitive word, an electronic device, and a storage medium, so as to solve the problems that an existing sensitive word detection method can only detect words in a single language in a text, and has a small detection range and low flexibility and reliability.
The invention is realized by the following steps:
in a first aspect, an embodiment of the present application provides a sensitive word detection method, including: acquiring words to be detected in a text of data to be detected; determining the language of the word to be detected; screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; sensitive words of different languages are stored in the preset sensitive word bank; and calculating the similarity between the word to be detected and the first sensitive word.
In the embodiment of the application, the first sensitive words in the same language are screened out from the preset sensitive word library by identifying the language of the word to be detected, and then the similarity between the word to be detected and the first sensitive words is calculated.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the constructing the preset sensitive word library by the following steps includes: acquiring an original data text; converting the raw data text into a word sequence; wherein the word sequence is a sensitive word; acquiring the language of each word sequence; inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a near synonym and/or a synonym of the word sequence; and constructing the preset sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
In the embodiment of the application, the synonyms and/or the near-synonyms of the word sequence are obtained by inputting the word sequences of different languages into the corresponding language models, and the number of words in the preset sensitive word bank is expanded by the method. And the method also realizes the expansion of the sensitive words of different languages, so that the sensitive word detection is conveniently carried out on the following words to be detected based on different languages.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the converting the original data text into a word sequence includes: performing word segmentation processing on the original data text to obtain a first word sequence; calculating a characteristic value of the first word sequence through a characteristic selection algorithm; and the first word sequence with the characteristic value larger than a preset threshold value is the word sequence.
In the embodiment of the application, a first word sequence is obtained by carrying out word segmentation processing on an original data text; and calculating a characteristic value of the first word sequence through a characteristic selection algorithm, and taking the first word sequence with the characteristic value larger than a preset threshold value as the word sequence. By the method, the accuracy and the reasonability of the subsequent construction of the preset sensitive word bank are improved.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, after performing word segmentation processing on the original data text to obtain a first word sequence, the method further includes: clustering the original data texts according to text contents; correspondingly, the calculating the feature value of the first word sequence by the feature selection algorithm includes: and respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm.
In the embodiment of the application, original data texts are clustered according to text contents; and respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm. By the method, the characteristic value is calculated more accurately.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, after the calculating the similarity between the word to be tested and the first sensitive word, the method further includes: and outputting the sensitive words with the similarity with the words to be detected larger than a preset sensitive threshold in the first sensitive words.
In the embodiment of the application, after the similarity between the word to be detected and the first sensitive word is calculated, the sensitive word in the first sensitive word, of which the similarity with the word to be detected is greater than the preset sensitive threshold value, is output. By the method, the user can intuitively know whether the text of the data to be detected contains the sensitive words.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, after the calculating the similarity between the word to be tested and the first sensitive word, the method further includes: and outputting the first N sensitive words in the first sensitive words, wherein the first N sensitive words are ranked from large to small in similarity with the words to be detected.
In the embodiment of the application, after the similarity between the word to be detected and the first sensitive word is calculated, the first N sensitive words in the first sensitive word, which are ranked from large to small in similarity with the word to be detected, are output. By the method, the user can visually know the content related to the to-be-detected data text, and the user can analyze and manage the to-be-detected data text conveniently.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the calculating the similarity between the word to be tested and the first sensitive word includes: calculating the similarity between the word to be detected and the first sensitive word by a similarity algorithm corresponding to the language of the word to be detected; wherein, the similarity calculation methods corresponding to different languages are different.
In the embodiment of the application, the similarity between the word to be detected and the first sensitive word is calculated by the similarity calculation method corresponding to the language of the word to be detected, so that the calculation accuracy of the similarity between the word to be detected and the first sensitive word in different languages is improved, and the detection accuracy of the sensitive word is further improved.
In a second aspect, an embodiment of the present application provides a sensitive word stock construction method, including: acquiring an original data text; converting the raw data text into a word sequence; wherein the word sequence is a sensitive word; acquiring the language of each word sequence; inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a near synonym and/or a synonym of the word sequence; and constructing a sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
In a third aspect, an embodiment of the present application provides a sensitive word detection apparatus, including: the acquisition module is used for acquiring words to be detected in the text of the data to be detected; the determining module is used for determining the language of the word to be detected; the screening module is used for screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; sensitive words of different languages are stored in the preset sensitive word bank; and the calculating module is used for calculating the similarity between the word to be detected and the first sensitive word.
In a fourth aspect, an embodiment of the present application provides a sensitive thesaurus construction apparatus, including: the first acquisition module is used for acquiring an original data text; the conversion module is used for converting the original data text into a word sequence; wherein the word sequence is a sensitive word; the second acquisition module is used for acquiring the language of each word sequence; the processing module is used for inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a near synonym and/or a synonym of the word sequence; and the construction module is used for constructing a sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory, the processor and the memory connected; the memory is used for storing programs; the processor is configured to invoke a program stored in the memory, to perform a method as provided in the embodiments of the first aspect described above, and/or to perform a method as provided in the embodiments of the second aspect described above.
In a sixth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program, which, when executed by a processor, performs the method as provided in the embodiments of the first aspect, and/or performs the method as provided in the embodiments of the second aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Fig. 2 is a flowchart illustrating steps of a sensitive word detection method according to an embodiment of the present application.
Fig. 3 is a flowchart illustrating steps of constructing a preset sensitive word library according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a process for determining a word sequence according to an embodiment of the present application.
Fig. 5 is a block diagram of a sensitive word detection apparatus according to an embodiment of the present disclosure.
Fig. 6 is a block diagram of a sensitive thesaurus construction apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
In view of the fact that the existing sensitive word detection method can only detect words in a single language in a text, and has the problems of small detection range, low flexibility and low reliability, the inventors of the present application have conducted research and propose the following embodiments to solve the above problems.
Referring to fig. 1, a schematic structural block diagram of an electronic device 100 applying a sensitive word detection method and/or a sensitive word library construction method according to an embodiment of the present application is provided. In the embodiment of the present application, the electronic device 100 may be, but is not limited to, a server, a Personal Computer (PC), a tablet PC, a Personal Digital Assistant (PDA), and the like. Structurally, electronic device 100 may include a processor 110 and a memory 120.
The processor 110 and the memory 120 are electrically connected directly or indirectly to enable data transmission or interaction, for example, the components may be electrically connected to each other via one or more communication buses or signal lines. The sensitive word detecting device and/or the sensitive word library constructing device include at least one software module which can be stored in the memory 120 in the form of software or Firmware (Firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 110 is configured to execute the executable modules stored in the memory 120, for example, execute the software functional modules and the computer programs included in the sensitive word detecting apparatus to implement the sensitive word detecting method, and for example, execute the software functional modules and the computer programs included in the sensitive word library constructing apparatus to implement the sensitive word library constructing method. The processor 110 may execute the computer program upon receiving the execution instruction.
The processor 110 may be an integrated circuit chip having signal processing capabilities. The Processor 110 may also be a general-purpose Processor, for example, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic device, or a discrete hardware component, which may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. Further, a general purpose processor may be a microprocessor or any conventional processor or the like.
The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), and an electrically Erasable Programmable Read-Only Memory (EEPROM). The memory 120 is used for storing a program, and the processor 110 executes the program after receiving the execution instruction.
It should be understood that the structure shown in fig. 1 is merely an illustration, and the electronic device 100 provided in the embodiment of the present application may have fewer or more components than those shown in fig. 1, or may have a different configuration than that shown in fig. 1. Further, the components shown in fig. 1 may be implemented by software, hardware, or a combination thereof.
Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a sensitive word detection method according to an embodiment of the present application, where the method is applied to the electronic device 100 shown in fig. 1. It should be noted that the sensitive word detection method provided in the embodiment of the present application is not limited by the sequence shown in fig. 2 and the following, and the method includes: step S101-step S104.
Step S101: and acquiring words to be detected in the text of the data to be detected.
Step S102: and determining the language of the word to be detected.
Step S103: screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; and sensitive words of different languages are stored in the preset sensitive word bank.
Step S104: and calculating the similarity between the word to be detected and the first sensitive word.
That is, after receiving the data text to be detected, the electronic device obtains the words to be detected in the data text to be detected, then determines the language of each word to be detected, and then screens out the first sensitive words having the same language as each language to be detected from the preset sensitive word library. For example, if the word to be detected is "network", all chinese sensitive words are screened out from a preset sensitive word bank as first sensitive words, and if the word to be detected is "network", all english sensitive words are screened out from the preset sensitive word bank as first sensitive words. It should be noted that the preset sensitive word stock is constructed according to words of different languages when constructed, and therefore the preset sensitive word stock stores sensitive words of different languages. And finally, the electronic equipment calculates the similarity between the words to be detected and the corresponding sensitive words, so that the detection of the sensitive words is realized.
In the embodiment of the application, the first sensitive words in the same language are screened out from the preset sensitive word library by identifying the language of the word to be detected, and then the similarity between the word to be detected and the first sensitive words is calculated.
For convenience of explanation, a construction method of the sensitive word stock preset in step S103 will be explained first. Referring to fig. 3, fig. 3 is a flowchart illustrating a possible process of constructing a preset sensitive word bank, where the step of constructing the preset sensitive word bank includes: step S201 to step S205.
Step S201: and acquiring an original data text.
When a preset sensitive word bank is constructed, an original data text containing sensitive words is firstly acquired. The raw data text may be data text obtained by a crawler, such as by the crawler automatically crawling information in the world wide web, resulting in some sensitive data text. The original data text may also be a data text collected by a network device, such as a data text intercepted by a firewall, and of course, the original data text may also be a data text specified by a user, such as a data text that a user has screened out some sensitive word sets in advance. The present application is not limited thereto.
Step S202: converting the raw data text into a word sequence; wherein the word sequence is a sensitive word.
After the original data text is acquired, it is converted into a word sequence. Specifically, the original data text can be processed by a word segmentation tool and an invalid word filtering tool to obtain a word sequence. The word sequence can be used as an original sensitive word for subsequently constructing a preset sensitive word bank.
It should be noted that the word segmentation tool may split a sentence in a text into words, for example, four words of "obtain", "original", "data", and "text" may be obtained by processing "obtain an original data text" through the word segmentation tool. The word segmentation tool can be, but is not limited to, a crust word segmentation tool and an NLPIR word segmentation tool. The invalid word filtering tool can remove invalid words in the original data text, such as common mood auxiliary words, adverbs, conjunctions, prepositions and the like. Of course, the invalid word filtering tool can also filter out numbers, characters and punctuation in the original data text. For example, if the original data text "acquired original data text" is processed by the word segmentation tool and the invalid word filtering tool, four word sequences of "acquired", "original", "data", and "text" can be obtained.
As an optional implementation mode, the accuracy and the reasonableness of the subsequent construction of the preset sensitive word stock are improved. The word sequence may be filtered once during the process of converting the raw data text into the word sequence. Specifically, the step of converting the original data text into a word sequence includes: performing word segmentation processing on an original data text to obtain a first word sequence; calculating a characteristic value of the first word sequence through a characteristic selection algorithm; and the first word sequence with the characteristic value larger than the preset threshold value is a word sequence.
The above-mentioned feature selection algorithm may be, but is not limited to, a desired cross entropy algorithm, an Information Gain (IG) algorithm.
For example, the expected cross entropy algorithm can be used to measure the importance of a word to the whole. The specific formula is as follows:
Figure BDA0002809151950000091
in formula (1), t represents a word sequence; ci represents the ith category (when the text is 1, only 1 category is included); p (t) represents the frequency of occurrence of the sequence of terms in the document; p (Ci | | t) represents the text frequency of the category Ci on condition that the word sequence appears, ece (t) represents the desired cross entropy of the word sequence t.
The preset threshold is a specific value, and may be set according to different situations, for example, 0.02, 0.04, 0.1, and the like.
For example, after calculating the feature value of the first word sequence by the feature selection algorithm, the result is as follows, "novel 0.084, network 0.048, section 0.022, and history 0.0005. ·", and if the preset threshold is 0.02, the first word sequence with the feature value greater than the preset threshold, such as "novel 0.084, network 0.048, and section 0.022", is the word sequence obtained after the filtering. That is, in this manner, the word sequence that best represents the original data text is screened out. For example, if a user uses an article introduced by a game as an original data text, the word sequence most relevant to the article can be screened out through the feature selection algorithm, so that a preset sensitive word library is constructed subsequently according to the screened word sequence.
Optionally, in order to make the calculation of the feature value more accurate, after performing word segmentation processing on the original data text to obtain a first word sequence, the method further includes: and clustering the original data texts according to the text contents.
It should be noted that the process of dividing a set of physical or abstract objects into a plurality of classes composed of similar objects is called clustering. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. In the embodiment of the present application, the clustering may use algorithms including, but not limited to, K-means clustering algorithm and mean shift clustering.
For example, taking a K-means clustering algorithm as an example, the K-means clustering algorithm is a clustering analysis algorithm for iterative solution, and the steps are that data is divided into K groups in advance, K objects are randomly selected as initial clustering centers, then the distance between each object and each seed clustering center is calculated, and each object is assigned to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal. It should be noted that, in the embodiment, text content of the original data text is clustered, for example, after clustering, texts related to network information are classified into a class of clusters, and texts related to game information are classified into a class of clusters.
If the original data text is clustered by adopting the above method, correspondingly, the characteristic value of the first word sequence is calculated by a characteristic selection algorithm, which comprises the following steps: and respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm.
That is, the first word sequence is divided into a plurality of class clusters, and the feature value of the first word sequence in the class clusters is calculated in different class clusters, so that the calculation of the feature value is more accurate.
In the following, the above process is explained with reference to fig. 4, and the original data text shown in fig. 4 includes: 4110788.txt, 4111088.txt, 697638.txt, and 892479. txt. After the four original data texts are obtained, firstly, the original data texts are subjected to word segmentation processing to obtain a first word sequence. The word segmentation result can refer to lines 7-10 in fig. 4, for example, the word segmentation process is performed on the "global science" in 4110788.tx to obtain two words, i.e., "global science" and "science". After word segmentation is finished, clustering is carried out on the four original data texts according to text contents, and clustering results refer to lines 12-13 in the figure 4, wherein 4110788.txt and 4111088.txt are class clusters 1 (related to science), and 697638.txt and 892479.txt are class clusters 2 (related to novel). And finally, performing feature extraction (for example, setting a threshold value to be 0.02) on the feature value of the first word sequence in each clustered cluster by using the feature selection algorithm, so as to obtain a word sequence, wherein the finally obtained word sequence can refer to 16-17 rows in fig. 4.
Step S203: and acquiring the language of each word sequence.
After the original data text is converted into word sequences, the language of each word sequence is obtained, for example, the language of each word sequence can be detected by using a fasttext language detection model.
Step S204: and inputting each word sequence into a language model corresponding to the language of the word sequence to obtain the similar meaning words and/or the synonyms of the word sequence.
After determining the language of each word sequence, selecting a corresponding language model (e.g., a bert pre-training language model corresponding to different languages) according to different languages to obtain a near-synonym and/or synonym of each word sequence. To avoid duplication, the synonyms and/or synonyms for each sequence of words may be merged and then words in the same set deleted.
Step S205: and constructing the preset sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
And finally, constructing a preset sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence. That is, the constructed preset sensitive word library includes the word sequences obtained in the above manner and the corresponding synonyms and/or synonyms of each word sequence. Accordingly, to avoid duplication, word sequences, synonyms and/or synonyms of each word sequence may be merged and words in the same set may be deleted.
In summary, in the embodiment of the present application, word sequences of different languages are input into corresponding language models, so as to obtain synonyms and/or near-synonyms of the word sequences, and in this way, the number of words in the preset sensitive word library is expanded. And the method also realizes the expansion of the sensitive words of different languages, so that the sensitive word detection is conveniently carried out on the following words to be detected based on different languages.
It should be noted that the preset sensitive word library may also be directly constructed only according to the sensitive word sets of each language screened out by the user, which is not limited in the present application.
The following describes in detail the flow and steps of the sensitive word detection method shown in fig. 2 with reference to specific examples.
Step S101: and acquiring words to be detected in the text of the data to be detected.
In an actual detection process, if the electronic device detects a data text to be detected, first, words to be detected in the data text to be detected are also acquired, that is, the data text to be detected needs to be converted into one word to be detected. Specifically, the original data text can be processed through a word segmentation tool and an invalid word filtering tool to obtain the word to be tested. It should be noted that the descriptions of the word segmentation tool and the invalid word filtering tool are already described in the foregoing embodiments, and are not repeated here.
Step S102: and determining the language of the word to be detected.
After the text of the data to be detected is converted into a word sequence, the language of each word to be detected is obtained, for example, the language of each word to be detected can be detected by using a fasttext language detection model.
Step S103: screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; and sensitive words of different languages are stored in the preset sensitive word bank.
After the language of each word to be detected is determined, a first sensitive word with the same language as the word to be detected is screened out from a preset sensitive word library. For example, when the word to be detected is a novel word, all chinese sensitive words are screened from the preset sensitive word bank as first sensitive words, for example, "files and texts" in the preset sensitive word bank are screened as the first sensitive words, and for example, when the word to be detected is a word, all english sensitive words are screened from the preset sensitive word bank as the first sensitive words, for example, "files and applications" in the preset sensitive word bank are screened as the first sensitive words.
Step S104: and calculating the similarity between the word to be detected and the first sensitive word.
After the first sensitive word with the same language as the word to be detected is obtained, the similarity between the word to be detected and the first sensitive word can be calculated. Wherein, the similarity calculation can adopt a similarity calculation method, such as: euclidean distance algorithm, cosine similarity algorithm, the present application is not limited.
In order to improve the accuracy of similarity calculation between the words to be detected and the first sensitive words in different languages, in the embodiment of the application, the similarity between the words to be detected and the first sensitive words is calculated by a similarity calculation method corresponding to the languages of the words to be detected; wherein, the similarity calculation methods corresponding to different languages are different.
For example, the Chinese words to be tested can be calculated by a cosine similarity algorithm, and the English words to be tested can be similar by Jaro-Winkler.
Specifically, when the word to be detected is detected as a Chinese word, all Chinese sensitive words (namely first sensitive words) in a preset sensitive word library are obtained, then a BERT Chinese model is adopted to vectorize the word to be detected and the first sensitive words, and then the similarity between the word to be detected and each first sensitive word is calculated according to a cosine similarity algorithm. The formula of the cosine similarity algorithm is as follows:
Figure BDA0002809151950000131
in the formula (2), AiThe components of the vector representing the word A to be measured, BiThe components of the vector representing the first sensitive word B; similarity represents the cosine similarity between the word A to be detected and the first sensitive word B.
Specifically, when the word to be detected is English, all English sensitive words (namely first sensitive words) in a preset sensitive word bank are obtained, and the similarity between the word to be detected and each first sensitive word is calculated according to a Jaro-Winkler similarity algorithm. The formula of the Jaro-Winkler similarity algorithm is as follows:
simw=simj+lp(1-simj) (3)
Figure BDA0002809151950000132
in the formula (3), | s1| represents the character string length of the word to be tested, | s2| represents the character string length of the first sensitive word, m represents the number of matched characters of two character strings, and t represents half of the number of transposition; l represents the number of common prefix characters of two character strings (the maximum number is not more than 4), p is a scaling factor constant, the contribution of the common prefix to the similarity is described, the larger p is, the larger the common prefix weight is, the maximum p is not more than 0.25, and the default value of p is 0.1.
In summary, in the embodiment of the present application, the similarity between the word to be detected and the first sensitive word is calculated by the similarity algorithm corresponding to the language of the word to be detected, so that the accuracy of calculating the similarity between the word to be detected and the first sensitive word in different languages is improved, and the accuracy of detecting the sensitive word is further improved.
Optionally, after the similarity between the word to be tested and the first sensitive word is calculated in step S104, the method further includes: and outputting the sensitive words with the similarity greater than a preset sensitive threshold value with the words to be detected in the first sensitive words.
The preset sensitivity threshold may be determined according to actual situations, for example, the preset sensitivity threshold may be 40%, 60%, 0.5, 0.7, etc. That is, in this manner, whether the data text to be detected includes the sensitive word or not can be conveniently checked by setting the preset sensitive threshold, for example, the preset sensitive threshold is 0.6, if the first sensitive word whose similarity to the word to be detected is greater than 0.6 is calculated through the above steps, the first sensitive word is output, and if the first sensitive word whose similarity to the word to be detected is not calculated through the above steps is greater than 0.6, the first sensitive word is output as 0, that is, the currently detected data text to be detected does not include the sensitive word. By the method, the user can intuitively know whether the text of the data to be detected contains the sensitive words.
Optionally, after the similarity between the word to be tested and the first sensitive word is calculated in step S104, the method further includes: and outputting the first N sensitive words in the first sensitive words after the similarity of the first sensitive words and the words to be detected is sorted from large to small.
The above N may be determined according to actual conditions, for example, N may be 3, 5, 6, 8, and so on. That is, in this manner, the first N sensitive words in the first sensitive words, which are sorted from the greater similarity to the word to be detected, are output so that the user can know the content related to the current data text to be detected, and the output N sensitive words can also be understood as the sensitive words closest to the data text to be detected.
Based on the same concept, the embodiment of the present application further provides a sensitive word stock construction method, which can be independently applied to the electronic device 100 shown in fig. 1. The method comprises the following steps: acquiring an original data text; converting the raw data text into a word sequence; wherein the word sequence is a sensitive word; acquiring the language of each word sequence; inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a near synonym and/or a synonym of the word sequence; and constructing a sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
It should be noted that the method for constructing the sensitive thesaurus provided in the embodiment of the present application is the same as the method for constructing the preset sensitive thesaurus in the foregoing embodiment, and therefore, for the specific process of the method for constructing the sensitive thesaurus, reference may be made to the specific process of the method for constructing the preset sensitive thesaurus in the foregoing embodiment, and in order to avoid redundancy, a repeated explanation is not provided here.
Referring to fig. 5, based on the same inventive concept, an embodiment of the present application further provides a sensitive word detection apparatus 200, which includes an obtaining module 201, a determining module 202, a screening module 203, and a calculating module 204.
The obtaining module 201 is configured to obtain a word to be tested in the data text to be tested.
And the determining module 202 is configured to determine the language of the word to be detected.
The screening module 203 is used for screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; and sensitive words of different languages are stored in the preset sensitive word bank.
A calculating module 204, configured to calculate a similarity between the word to be detected and the first sensitive word.
In the embodiment of the present application, each module uses spark (computation engine) as a distributed computation framework.
Referring to fig. 6, based on the same inventive concept, an embodiment of the present application further provides a sensitive word library constructing apparatus 300, including: a first obtaining module 301, a converting module 302, a second obtaining module 303, a processing module 304, and a building module 305.
The first obtaining module 301 is configured to obtain an original data text.
A conversion module 302, configured to convert the original data text into a word sequence; wherein the word sequence is a sensitive word.
A second obtaining module 303, configured to obtain a language of each word sequence.
And the processing module 304 is configured to input each word sequence into a language model corresponding to the language of the word sequence, so as to obtain a near-synonym and/or a synonym of the word sequence.
A building module 305, configured to build a sensitive word library based on the word sequences and the synonyms and/or synonyms of each word sequence.
It should be noted that, as those skilled in the art can clearly understand, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Based on the same inventive concept, the present application further provides a storage medium, on which a computer program is stored, and when the computer program is executed, the computer program performs the method provided in the foregoing embodiments.
The storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (12)

1. A sensitive word detection method, comprising:
acquiring words to be detected in a text of data to be detected;
determining the language of the word to be detected;
screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; sensitive words of different languages are stored in the preset sensitive word bank;
and calculating the similarity between the word to be detected and the first sensitive word.
2. The sensitive word detection method according to claim 1, wherein the preset sensitive word bank is constructed by the following steps:
acquiring an original data text;
converting the raw data text into a word sequence; wherein the word sequence is a sensitive word;
acquiring the language of each word sequence;
inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a near synonym and/or a synonym of the word sequence;
and constructing the preset sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
3. The sensitive word detection method of claim 2, wherein the converting the raw data text into a word sequence comprises:
performing word segmentation processing on the original data text to obtain a first word sequence;
calculating a characteristic value of the first word sequence through a characteristic selection algorithm;
and the first word sequence with the characteristic value larger than a preset threshold value is the word sequence.
4. The method according to claim 3, wherein after the segmenting the original data text into words to obtain a first word sequence, the method further comprises:
clustering the original data texts according to text contents;
correspondingly, the calculating the feature value of the first word sequence by the feature selection algorithm includes:
and respectively calculating the characteristic value of the first word sequence in each clustered class through a characteristic selection algorithm.
5. The sensitive word detection method according to claim 1, wherein after the calculating the similarity between the word to be detected and the first sensitive word, the method further comprises:
and outputting the sensitive words with the similarity with the words to be detected larger than a preset sensitive threshold in the first sensitive words.
6. The sensitive word detection method according to claim 1, wherein after the calculating the similarity between the word to be detected and the first sensitive word, the method further comprises:
and outputting the first N sensitive words in the first sensitive words, wherein the first N sensitive words are ranked from large to small in similarity with the words to be detected.
7. The sensitive word detection method according to claim 1, wherein the calculating the similarity between the word to be detected and the first sensitive word includes:
calculating the similarity between the word to be detected and the first sensitive word by a similarity algorithm corresponding to the language of the word to be detected; wherein, the similarity calculation methods corresponding to different languages are different.
8. A sensitive word stock construction method is characterized by comprising the following steps:
acquiring an original data text;
converting the raw data text into a word sequence; wherein the word sequence is a sensitive word;
acquiring the language of each word sequence;
inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a near synonym and/or a synonym of the word sequence;
and constructing a sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
9. A sensitive word detection apparatus, comprising:
the acquisition module is used for acquiring words to be detected in the text of the data to be detected;
the determining module is used for determining the language of the word to be detected;
the screening module is used for screening out a first sensitive word with the same language as the word to be detected from a preset sensitive word bank; sensitive words of different languages are stored in the preset sensitive word bank;
and the calculating module is used for calculating the similarity between the word to be detected and the first sensitive word.
10. A sensitive word stock construction device is characterized by comprising:
the first acquisition module is used for acquiring an original data text;
the conversion module is used for converting the original data text into a word sequence; wherein the word sequence is a sensitive word;
the second acquisition module is used for acquiring the language of each word sequence;
the processing module is used for inputting each word sequence into a language model corresponding to the language of the word sequence to obtain a near synonym and/or a synonym of the word sequence;
and the construction module is used for constructing a sensitive word library based on the word sequences and the similar words and/or synonyms of each word sequence.
11. An electronic device, comprising: a processor and a memory, the processor and the memory connected;
the memory is used for storing programs;
the processor is configured to run a program stored in the memory, to perform the method of any of claims 1-7, and/or to perform the method of claim 8.
12. A storage medium having stored thereon a computer program which, when executed by a computer, performs the method of any one of claims 1-7 and/or performs the method of claim 8.
CN202011384381.XA 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium Active CN112364637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011384381.XA CN112364637B (en) 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011384381.XA CN112364637B (en) 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112364637A true CN112364637A (en) 2021-02-12
CN112364637B CN112364637B (en) 2024-02-09

Family

ID=74535855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011384381.XA Active CN112364637B (en) 2020-11-30 2020-11-30 Sensitive word detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112364637B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113434775A (en) * 2021-07-15 2021-09-24 北京达佳互联信息技术有限公司 Method and device for determining search content
CN114707499A (en) * 2022-01-25 2022-07-05 中国电信股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium
CN117493532A (en) * 2023-12-29 2024-02-02 深圳智汇创想科技有限责任公司 Text processing method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150347393A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Exemplar-based natural language processing
CN106528731A (en) * 2016-10-27 2017-03-22 新疆大学 Sensitive word filtering method and system
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN111640420A (en) * 2020-06-10 2020-09-08 上海明略人工智能(集团)有限公司 Audio data processing method and device and storage medium
CN111859013A (en) * 2020-07-17 2020-10-30 腾讯音乐娱乐科技(深圳)有限公司 Data processing method, device, terminal and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150347393A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Exemplar-based natural language processing
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN106528731A (en) * 2016-10-27 2017-03-22 新疆大学 Sensitive word filtering method and system
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN111640420A (en) * 2020-06-10 2020-09-08 上海明略人工智能(集团)有限公司 Audio data processing method and device and storage medium
CN111859013A (en) * 2020-07-17 2020-10-30 腾讯音乐娱乐科技(深圳)有限公司 Data processing method, device, terminal and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113076748B (en) * 2021-04-16 2024-01-19 平安国际智慧城市科技股份有限公司 Bullet screen sensitive word processing method, device, equipment and storage medium
CN113434775A (en) * 2021-07-15 2021-09-24 北京达佳互联信息技术有限公司 Method and device for determining search content
CN113434775B (en) * 2021-07-15 2024-03-26 北京达佳互联信息技术有限公司 Method and device for determining search content
CN114707499A (en) * 2022-01-25 2022-07-05 中国电信股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium
CN114707499B (en) * 2022-01-25 2023-10-24 中国电信股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium
CN117493532A (en) * 2023-12-29 2024-02-02 深圳智汇创想科技有限责任公司 Text processing method, device, equipment and storage medium
CN117493532B (en) * 2023-12-29 2024-03-29 深圳智汇创想科技有限责任公司 Text processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112364637B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN112364637B (en) Sensitive word detection method and device, electronic equipment and storage medium
CN110826648B (en) Method for realizing fault detection by utilizing time sequence clustering algorithm
US10163063B2 (en) Automatically mining patterns for rule based data standardization systems
CN111291070B (en) Abnormal SQL detection method, equipment and medium
CN111898366B (en) Document subject word aggregation method and device, computer equipment and readable storage medium
RU2722692C1 (en) Method and system for detecting malicious files in a non-isolated medium
US10528609B2 (en) Aggregating procedures for automatic document analysis
CN112433874A (en) Fault positioning method, system, electronic equipment and storage medium
Lathabai et al. Contextual productivity assessment of authors and journals: a network scientometric approach
CN110609952A (en) Data acquisition method and system and computer equipment
US20100191753A1 (en) Extracting Patterns from Sequential Data
WO2007007410A1 (en) Message analyzing device, message analyzing method and message analyzing program
CN114584377A (en) Flow anomaly detection method, model training method, device, equipment and medium
CN114722794A (en) Data extraction method and data extraction device
CN111177719A (en) Address category determination method, device, computer-readable storage medium and equipment
CN111738009A (en) Method and device for generating entity word label, computer equipment and readable storage medium
US11676231B1 (en) Aggregating procedures for automatic document analysis
JP4979637B2 (en) Compound word break estimation device, method, and program for estimating compound word break position
CN113761867A (en) Address recognition method and device, computer equipment and storage medium
CN117113247A (en) Drainage system abnormality monitoring method, equipment and storage medium based on two-classification and clustering algorithm
CN115237721A (en) Method, device and storage medium for predicting fault based on frequent sequence of windows
CN115051863A (en) Abnormal flow detection method and device, electronic equipment and readable storage medium
CN114499408A (en) Information testing method and system based on battery big data
CN114513341A (en) Malicious traffic detection method, device, terminal and computer readable storage medium
CN113204954A (en) Data detection method and device based on big data and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant