CN114707499A

CN114707499A - Sensitive word recognition method and device, electronic equipment and storage medium

Info

Publication number: CN114707499A
Application number: CN202210086774.5A
Authority: CN
Inventors: 马兆铭; 王铮; 任华; 杨迪; 汪少敏
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-07-05
Anticipated expiration: 2042-01-25
Also published as: CN114707499B

Abstract

The disclosure provides a sensitive word recognition method, a sensitive word recognition device, an electronic device and a storage medium, wherein the sensitive word recognition method comprises the following steps: respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library; respectively preprocessing the first character string and the second character string to obtain a first character vector of a word to be recognized and a second character vector of a sensitive sample word; calculating cosine similarity of the first character vector and the second character vector; and determining whether the word to be recognized is a sensitive word or not according to the calculation result. According to the method and the device, the first character string and the second character string which have a mapping relation with the to-be-recognized word and the sensitive sample word are obtained, the first character string and the second character string are subjected to vectorization processing, the cosine similarity of the first character vector and the second character vector is calculated, whether the to-be-recognized word is the sensitive word is determined according to the obtained cosine similarity, and the accuracy and the efficiency of sensitive word recognition are improved.

Description

Sensitive word recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of network information identification technologies, and in particular, to a sensitive word identification method and apparatus, an electronic device, and a storage medium.

Background

With the development of communication networks, people can freely publish words on the networks, and then some malicious users publish bad information. In order to avoid the review of the network platform, the means for malicious users to issue malicious information become various and the form becomes complicated, for example, if the split characters or the similar characters are used to represent the corresponding sensitive words, the way of issuing the malicious information not only increases the difficulty of the network platform in filtering the sensitive words, but also causes the leakage of the malicious information or even needs to perform secondary review manually.

In the prior art, a locational coding and KMP algorithm (an improved character string matching algorithm provided by D.E.Knuth, J.H.Morris and V.R.Pratt) is generally adopted to solve the problem of Chinese character split bodies.

Therefore, the technical problem of how to improve the accuracy and efficiency of sensitive word recognition becomes an urgent need to be solved.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides a sensitive word recognition method, apparatus, electronic device and storage medium, which overcome the problems of low accuracy and low efficiency of sensitive word recognition in the related art at least to some extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, there is provided a sensitive word recognition method including: respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, wherein a plurality of words and character strings with mapping relations are stored in the preset coding library; respectively preprocessing the first character string and the second character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word; calculating cosine similarity of the first character vector and the second character vector; and determining whether the word to be recognized is a sensitive word or not according to a calculation result.

In an embodiment of the present disclosure, preprocessing the first character string and the second character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word includes: merging and de-duplicating the first character string and the second character string to obtain a characteristic character string; and vectorizing the first character string and the second character string by using the characteristic character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word.

In an embodiment of the present disclosure, performing vectorization processing on the first character string and the second character string by using the feature character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word, includes: traversing and searching the first character string by using each character in the characteristic character string, if the character in the characteristic character string is searched in the first character string, marking as 1, otherwise, marking as 0, and obtaining a first character vector of the word to be recognized; and traversing and searching in the second character string by using each character in the characteristic character string, if the character in the characteristic character string is searched in the second character string, marking as 1, otherwise, marking as 0, and obtaining a second character vector of the sensitive sample word.

In an embodiment of the present disclosure, before obtaining a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset encoding library, respectively, the method further includes: and acquiring a sensitive sample word corresponding to the word to be recognized from the preset coding library according to the word to be recognized, wherein the preset coding library also stores a plurality of sensitive words with mapping relations and deformed words of the sensitive words, the word to be recognized belongs to the deformed words of the sensitive words, the deformed words of the sensitive words are words which are similar to the shape of the sensitive words or are split and combined with the sensitive words, and the sensitive sample word belongs to the sensitive words.

In one embodiment of the present disclosure, the cosine similarity of the first character vector and the second character vector is calculated by the following formula:

wherein ,

is the first character vector and is the second character vector,

the first character vector is a first character vector, n is the number of elements in the first character vector and the second character vector, n is a positive integer, similar (P, Q) is the target similarity of the first character vector and the second character vector, and i is the element serial number.

In an embodiment of the present disclosure, determining whether the word to be recognized is a sensitive word according to a calculation result includes: if the target similarity is greater than or equal to the preset similarity threshold, determining the word to be recognized as a sensitive word; and if the target similarity is smaller than the preset similarity threshold, determining that the word to be recognized is a non-sensitive word.

In one embodiment of the present disclosure, the method further comprises: acquiring a plurality of groups of sample words; generating a character string corresponding to each group of sample words according to the five-stroke coding rule; generating a mapping table according to the multiple groups of sample words and the corresponding character strings; and storing the mapping table into the preset coding library.

According to another aspect of the present disclosure, there is provided a sensitive word recognition apparatus including: the character string acquisition module is used for respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, wherein the preset coding library stores a plurality of words and character strings with mapping relations; the preprocessing module is used for preprocessing the first character string and the second character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word; the similarity calculation module is used for calculating cosine similarity of the first character vector and the second character vector; and the result determining module is used for determining whether the word to be recognized is a sensitive word or not according to the calculation result.

In an embodiment of the present disclosure, the preprocessing module is further configured to perform merging and deduplication processing on the first character string and the second character string to obtain a feature character string; and vectorizing the first character string and the second character string by using the characteristic character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word.

In an embodiment of the present disclosure, the preprocessing module is further configured to perform traversal search on the first character string by using each character in the feature character string, and if a character in the feature character string is found in the first character string, the character is marked as 1, otherwise, the character is marked as 0, so as to obtain a first character vector of the word to be recognized; and traversing and searching in the second character string by using each character in the characteristic character string, if the character in the characteristic character string is searched in the second character string, marking as 1, otherwise, marking as 0, and obtaining a second character vector of the sensitive sample word.

In an embodiment of the present disclosure, the apparatus further includes a sample word obtaining module, where the sample word obtaining module is configured to obtain, according to a word to be recognized, a sensitive sample word corresponding to the word to be recognized from a preset coding library before respectively obtaining a first character string corresponding to the word to be recognized and a second character string corresponding to the sensitive sample word from the preset coding library, where the preset coding library further stores a plurality of sensitive words having a mapping relationship and a plurality of deformed words of the sensitive words, the word to be recognized belongs to the deformed words of the sensitive words, the deformed words of the sensitive words are words that are similar to the shape of the sensitive words or are split and combined with the sensitive words, and the sensitive sample word belongs to the sensitive words.

In an embodiment of the disclosure, the preprocessing module calculates the cosine similarity between the first character vector and the second character vector according to the following formula:

wherein ,

is the first character vector and is the second character vector,

In an embodiment of the present disclosure, the result determining module is further configured to determine that the word to be recognized is a sensitive word if the target similarity is greater than or equal to the preset similarity threshold; and if the target similarity is smaller than the preset similarity threshold, determining that the word to be recognized is a non-sensitive word.

In an embodiment of the present disclosure, the apparatus further includes a mapping table storage module, where the mapping table storage module is configured to obtain a plurality of groups of sample words; generating a character string corresponding to each group of sample words according to the five-stroke coding rule; generating a mapping table according to the multiple groups of sample words and the corresponding character strings; and storing the mapping table into the preset coding library.

According to yet another aspect of the present disclosure, there is provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the sensitive word recognition method described above via execution of the executable instructions.

According to yet another aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the above-mentioned sensitive word recognition method.

According to the sensitive word recognition method, the sensitive word recognition device, the electronic equipment and the storage medium, the first character string and the second character string which have a mapping relation with the word to be recognized and the sensitive sample word are obtained, the first character string and the second character string are subjected to vectorization processing, the cosine similarity of the first character vector and the second character vector is calculated, whether the word to be recognized is the sensitive word is determined according to the obtained cosine similarity, and the accuracy and the efficiency of sensitive word recognition are improved.

Further, before a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word are respectively obtained from a preset coding library, the sensitive sample word corresponding to the word to be recognized is obtained from the preset coding library according to the word to be recognized, wherein the preset coding library also stores a plurality of sensitive words with mapping relations and deformed words of the sensitive words, the word to be recognized belongs to the deformed words of the sensitive words, the deformed words of the sensitive words are words which are similar to the shape of the sensitive words or are split and combined with the sensitive words, and the sensitive sample words belong to the sensitive words. According to the method and the device, the sensitive sample words which have the mapping relation with the words to be recognized are collected and obtained from the preset coding library, the complexity that the sensitive words are split and exhaustively combined by adopting zone bit coding and KMP matching algorithms in the prior art is reduced, the split words and the combined words can be accurately matched, and the anti-interference capability is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 is a diagram illustrating a sensitive word matching method in an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a sensitive word recognition method in an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating another sensitive word recognition method in an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating another sensitive word recognition method in an embodiment of the present disclosure;

FIG. 5 is a flow diagram illustrating another method for sensitive word recognition in an embodiment of the present disclosure;

FIG. 6 is a flow diagram illustrating another method for sensitive word recognition in an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating yet another sensitive word recognition method in an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a sensitive word recognition apparatus in an embodiment of the present disclosure; and

fig. 9 shows a block diagram of an electronic device in an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

As mentioned in the background art, in recent years, malicious information has changed greatly, and means for malicious users to issue the malicious information are diversified and complicated in form in order to avoid review by the network platform. Especially variants of chinese sensitive words, such as: bad words such as 'poison-selling' can be written into the combinations of characters with similar shapes and split deformed characters such as 'rice poison', 'shellfish anti-poison' or 'shellfish anti-foggy mother', and the like, and the network platforms cannot recognize the combinations of the deformed characters of the sensitive words without influencing the reading and understanding of users, so that the difficulty of filtering the sensitive words is greatly increased, and the missing inspection of bad information is often caused or the secondary reinspection is needed manually.

In the prior art, a location coding and KMP matching algorithm is usually adopted to solve the problem of splitting Chinese characters, and referring to a schematic diagram of a sensitive word matching method shown in fig. 1, the sensitive word matching method includes the following steps:

carrying out Chinese character splitting on the sensitive word sample fonts and the text to be tested in the sensitive word sample library: splitting according to rules of upper and lower, left and right, inner and outer, frames and the like of the Chinese characters to obtain suspected sensitive words and split sensitive words;

carrying out encoding conversion of the zone bit codes on the suspected sensitive words and the split sensitive words: respectively converting the text with suspected sensitive words and the Chinese character split sensitive words into corresponding Chinese character zone codes S and A according to a standard Chinese character coding character set;

matching by adopting a pattern matching algorithm: and aligning the font splitting zone bit code A of the sample sensitive word with the font splitting zone bit code S of the suspected sensitive word at the left side, then moving the A bit by bit to the right side, and comparing the A bit by bit with the S until the A bit code is equal to the zone bit code in the S, wherein the matching is successful, and the pattern matching algorithm is a KMP (K Markov model) matching algorithm.

The inventor analyzes the prior art and finds that the prior art has the following problems:

1. in the prior art, the existing sensitive word samples need to be subjected to Chinese character splitting and exhaustive combination, so that the complexity of pretreatment is increased.

2. The matching method of comparing the zone bit codes bit by bit has higher calculation amount, if the word to be detected is a longer word group, the calculation complexity is increased, and the matching efficiency is lower.

3. The method uses the zone bit code to carry out the bit-by-bit matching, belongs to an accurate matching method, and cannot be successfully matched if a certain disturbance is added into the suspected sensitive words, such as part of normal sensitive words, so that the anti-interference capability of the prior art is weak.

Based on this, the present disclosure provides a sensitive word recognition method, an apparatus, an electronic device, and a storage medium, which aim to solve the above technical problems in the prior art, and obtain a first character string and a second character string having a mapping relationship with a to-be-recognized word and a sensitive sample word, perform vectorization processing on the first character string and the second character string to obtain a first character vector and a second character vector, calculate cosine similarity of the first character vector and the second character vector, determine whether the to-be-recognized word is a sensitive word according to the obtained cosine similarity, and improve accuracy and efficiency of sensitive word recognition. Meanwhile, the existing sensitive word samples do not need to be split and exhaustively combined in advance, the extensible recognition of the split font sensitive words can be realized on the basis of the existing sensitive words, and the method has better robustness and anti-interference performance.

The present exemplary embodiment will be described in detail below with reference to the drawings and examples.

First, the embodiment of the present disclosure provides a sensitive word recognition method, which can be executed by any electronic device with computing processing capability.

Fig. 2 shows a flowchart of a sensitive word recognition method in an embodiment of the present disclosure, and as shown in fig. 1, the sensitive word recognition method provided in the embodiment of the present disclosure includes the following steps:

s202, respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, wherein a plurality of words and character strings with mapping relations are stored in the preset coding library;

in this step, the word to be recognized may be a split word, a shape near word or an homophone word of the sensitive word, the sensitive sample word is a combined word corresponding to the word to be recognized, for example, the sensitive sample word is a "poison vending", and the word to be recognized may be a shape near and split deformed font combination such as "rice poison", "shell anti-poison" or "shell anti-feng mother". The method comprises the steps of storing a plurality of words and character strings with mapping relations in a preset coding library, wherein the mapping relations can be five-stroke coding rules, generating character strings corresponding to Chinese characters, storing all the Chinese characters and the character strings corresponding to the Chinese characters in the preset coding library, and respectively obtaining a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from the preset coding library according to the corresponding relation between each Chinese character and the character strings.

S204, respectively preprocessing the first character string and the second character string to obtain a first character vector of a word to be recognized and a second character vector of a sensitive sample word;

in this step, the first character string and the second character string may be processed to obtain a feature character string, and the first character string and the second character string are respectively processed by using the feature character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word.

S206, calculating cosine similarity of the first character vector and the second character vector;

in the step, cosine similarity between a first character vector and a second character vector is calculated according to a cosine theorem, wherein the cosine similarity is measured by cosine values of included angles of the two vectors, the cosine value of an angle of 0 degree is 1, and the cosine values of any other angles are not more than 1; and its minimum value is-1. The cosine of the angle between the two vectors thus determines whether the two vectors point in approximately the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the value of the cosine similarity is 0; the cosine similarity has a value of-1 when the two vectors point in completely opposite directions.

And S208, determining whether the word to be recognized is a sensitive word or not according to the calculation result.

In the step, cosine similarity of the first character vector and the second character vector is calculated to obtain target similarity, the similarity between the word to be recognized and the sensitive sample word is reflected by the target similarity, and the larger the numerical value is, the higher the similarity is, a similarity threshold value can be set, and the target similarity and the similarity threshold value are compared to determine whether the word to be recognized is the sensitive word.

According to the sensitive word identification method, the first character string and the second character string which have a mapping relation with the word to be identified and the sensitive sample word are obtained, the first character string and the second character string are subjected to vectorization processing to obtain the first character vector and the second character vector, the cosine similarity of the first character vector and the second character vector is calculated, whether the word to be identified is the sensitive word is determined according to the obtained cosine similarity, and the accuracy and the efficiency of sensitive word identification are improved.

In an embodiment of the present disclosure, the first character string and the second character string may be preprocessed through the steps disclosed in fig. 3 to obtain a first character vector of a word to be recognized and a second character vector of a sensitive sample word, which refer to a flowchart of another sensitive word recognition method shown in fig. 3, and specifically may include:

s302, merging and de-duplicating the first character string and the second character string to obtain a characteristic character string;

in this step, a first character string and a second character string are merged, for example, the first character string of the word to be recognized is "iyygfcugnghwxbaltn", the second character string is "ifcylwxnalin", the merged character string is "iyygfcuggnhwxbaitnilxnalin", and the merged character string is subjected to deduplication processing to obtain a characteristic character string "abcfghilntnwxy".

S304, vectorizing the first character string and the second character string by using the characteristic character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word.

In this step, as described in S302, a feature string "abcfghilgungwxy" is used to perform vectorization processing on a first string "iyygfcugnghwxbaltn" and a second string "ifcylwxnaln", each character in the feature string is searched in the first string, if the same character is found, the character is marked as 1, if the character that is not the same is not marked as 0, if the character a in the feature string is searched in the first string, the character is found as 1, a statistical result is found after traversing the search is finished, so as to obtain a first character vector "11111111111111" of a word to be recognized, and similarly, each character in the feature string is searched in the second string, and a statistical result is obtained after traversing the search is finished, so as to obtain a second character vector "10110011100111" of a sensitive sample word.

In an embodiment of the present disclosure, vectorization processing may be performed on the first character string and the second character string through the steps disclosed in fig. 4 to obtain a first character vector of a word to be recognized and a second character vector of a sensitive sample word, which refer to a flowchart of another sensitive word recognition method shown in fig. 4, and specifically may include:

s402, traversing and searching in a first character string by using each character in the characteristic character string, if the character in the characteristic character string is searched in the first character string, marking as 1, otherwise, marking as 0, and obtaining a first character vector of the word to be recognized;

s404, traversing and searching in the second character string by using each character in the characteristic character string, if the character in the characteristic character string is searched in the second character string, marking as 1, otherwise, marking as 0, and obtaining a second character vector of the sensitive sample word.

It should be noted that, during traversal search, each character of the feature character string is searched in the first character string, if the same character is searched in the first character string, the character is marked as 1, otherwise, the character is marked as 0, and after traversal search is finished, the search result is counted to obtain the first character vector of the word to be recognized

The same traversal search operation is carried out in the second character string to obtain a second character vector of the sensitive sample word

The method and the device have the advantages that the character strings are converted into character vectors, so that the words to be recognized and the sensitive sample words can be judged according to cosine similarity between the vectorsAnd (4) relation, thereby determining whether the word to be recognized is a sensitive word.

In an embodiment of the present disclosure, before a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word are respectively obtained from a preset coding library, a step of obtaining a sensitive sample word corresponding to the word to be recognized may be further included, where the step of obtaining the sensitive sample word corresponding to the word to be recognized from the preset coding library according to the word to be recognized, the preset coding library further stores a plurality of sensitive words having a mapping relationship and deformed words of the sensitive words, the word to be recognized belongs to the deformed words of the sensitive words, the deformed words of the sensitive words are words that are similar to the shape of the sensitive words or are split and combined with the sensitive words, and the sensitive sample word belongs to the sensitive words.

According to the method, the sensitive sample words which have the mapping relation with the words to be recognized are collected and obtained from the preset coding library, splitting and exhaustive combination processing is not required to be performed on the existing sensitive word samples in advance, and then the extensible recognition of the split font sensitive words can be achieved on the basis of the existing sensitive words, the complexity of splitting and exhaustive combination of the sensitive words by adopting zone bit coding and KMP matching algorithms in the prior art is reduced, the split characters and the combined characters can be accurately matched, and the anti-interference capacity of the recognition method is improved.

In one embodiment of the present disclosure, the cosine similarity of the first character vector and the second character vector may be calculated by the following formula:

wherein ,

is the first character vector and is the second character vector,

is the second character vector, n is the number of elements in the first character vector and the second character vector, is a positive integer, similar (P, Q) is the target similarity of the first character vector and the second character vector, and i is the element serial number.

For example, the first character string is "iyygfculgnhwxbaltn", the second character string is "ifcylwxnalin", the characteristic character string is "abcfghiluntuwxy", and the first character vector is

Second character vector

Calculating the pre-similarity between the first character vector and the second character vector by using the formula, wherein n is 14, and obtaining:

similar(P,Q)＝0.80

the similarity of the character vectors of the word to be recognized and the sensitive sample word is obtained to be 0.80 through calculation, and here, if the word to be recognized is required to be screened, a preset similarity threshold value can be set to be smaller than 0.80.

In an embodiment of the present disclosure, determining whether the word to be recognized is a sensitive word through the steps in fig. 5 may include, referring to a flowchart of another sensitive word recognition method shown in fig. 5, determining whether the word to be recognized is a sensitive word according to a calculation result, where the determining includes:

s502, if the target similarity is larger than or equal to a preset similarity threshold, determining that the word to be recognized is a sensitive word;

s504, if the target similarity is smaller than a preset similarity threshold, determining that the word to be recognized is a non-sensitive word.

It should be noted that the preset similarity threshold may be a preset threshold, for example, it may be set to 0.9, if it is desired to improve the accuracy of screening the sensitive words, it may be set to be higher than the preset similarity threshold, such as 0.93, 0.95, or 0.97, etc., if the range of the sensitive words to be screened is wider, it may be set to be lower than the preset similarity threshold, such as 0.85, 0.8, 0.75, etc., and the present disclosure may change the range and the accuracy of screening the sensitive words by adjusting the size of the preset similarity threshold.

In an embodiment of the present disclosure, the step disclosed in fig. 6 may be implemented to store the mapping relationship between the multiple groups of sample words and the character strings in a preset encoding library, referring to a flowchart of another sensitive word recognition method shown in fig. 6, which may specifically include:

s602, acquiring a plurality of groups of sample words;

in this step, the sample word is the word determined as the sensitive word, the sensitive word may be a word with a sensitive political tendency, a violent tendency, an unhealthy color or an uncivilized word, or some special sensitive words set according to the actual situation of the user, and the words which may be the sensitive words are used as the sample word, and the sample word is obtained.

S604, generating character strings corresponding to each group of sample words according to the five-stroke coding rule;

in this step, a character string corresponding to the sample word is generated according to the five-stroke encoding rule of the sample word, and a character string corresponding to each sample word in the plurality of groups of sample words is obtained.

S606, generating a mapping table according to the multiple groups of sample words and the corresponding character strings;

in this step, a mapping table is generated according to the one-to-one correspondence relationship between the sample word and the character string corresponding to the sample word.

S608, storing the mapping table into a preset code library.

Through the embodiment, the first character string corresponding to the word to be recognized and the second character string corresponding to the sensitive sample word can be directly obtained from the preset coding library, so that the process of recognizing the sensitive word is simplified, and the efficiency of recognizing the sensitive word is improved.

In an embodiment of the present disclosure, the present disclosure further provides another sensitive word identification method, referring to the sensitive word identification method shown in fig. 7, in specific implementation, the method mainly includes:

mapping the five-stroke coding character string: according to the five-stroke font coding rule set, respectively mapping the sensitive word sample and the suspected sensitive word into corresponding coding character strings to obtain a combined character string of the sensitive word sample and a split character string of the suspected sensitive word, wherein the sensitive word sample is a combined character, and the suspected sensitive word is a split character combination.

And (3) generating a characteristic character string: merging the two generated character strings, and then carrying out de-duplication processing on the merged character string to generate a characteristic character string;

character vectorization: vectorizing the combined character string and the split character string respectively by using the generated characteristic character string to generate a sensitive word character vector and a suspected sensitive word character vector;

and (3) similarity calculation of cosine values: performing cosine similarity calculation on the generated sensitive word character vector and the suspected sensitive word character vector to obtain cosine similarity values of the sensitive word character vector and the suspected sensitive word character vector;

and (3) threshold judgment: and (4) carrying out threshold judgment on the similarity result, judging the suspected sensitive word as the sensitive word if the similarity result is larger than the threshold, and otherwise, judging the suspected sensitive word as the sensitive word.

The technical problems solved by the method for recognizing the sensitive words provided by the present disclosure are the same, and the technical effects achieved by the method for recognizing the sensitive words are the same, which are not repeated herein.

Based on the same inventive concept, the embodiment of the present disclosure further provides a sensitive word recognition apparatus, such as the following embodiments. Because the principle of the embodiment of the apparatus for solving the problem is similar to that of the embodiment of the method, the embodiment of the apparatus can be implemented by referring to the implementation of the embodiment of the method, and repeated details are not described again.

Fig. 8 is a schematic diagram of a sensitive word recognition apparatus in an embodiment of the present disclosure, and as shown in fig. 8, the apparatus includes:

the character string obtaining module 810 is configured to obtain a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, where the preset coding library stores a plurality of words and character strings having mapping relationships;

the preprocessing module 820 is configured to preprocess the first character string and the second character string to obtain a first character vector of a word to be recognized and a second character vector of a sensitive sample word;

a similarity calculation module 830, configured to calculate cosine similarities of the first character vector and the second character vector;

and the result determining module 840 is used for determining whether the word to be recognized is a sensitive word according to the calculation result.

In an embodiment of the present disclosure, the preprocessing module is further configured to perform traversal search on a first character string by using each character in the feature character string, and if a character in the feature character string is found in the first character string, the character is marked as 1, otherwise, the character is marked as 0, so as to obtain a first character vector of a word to be recognized; and traversing and searching each character in the characteristic character string in the second character string, if the character in the characteristic character string is searched in the second character string, marking as 1, otherwise, marking as 0, and obtaining a second character vector of the sensitive sample word.

In an embodiment of the present disclosure, the apparatus further includes a sample word obtaining module, where the sample word obtaining module is configured to obtain a sensitive sample word corresponding to a word to be recognized from a preset coding library according to the word to be recognized before obtaining a first character string corresponding to the word to be recognized and a second character string corresponding to the sensitive sample word from the preset coding library respectively, where the preset coding library further stores a plurality of sensitive words having a mapping relationship and deformed words of the sensitive words, the word to be recognized belongs to the deformed words of the sensitive words, the deformed words of the sensitive words are words that are similar to the shape of the sensitive words or are split and combined with the sensitive words, and the sensitive sample word belongs to the sensitive words.

wherein ,

is a first character vector of the character string,

In an embodiment of the present disclosure, the result determining module is further configured to determine that the word to be recognized is a sensitive word if the target similarity is greater than or equal to a preset similarity threshold; and if the target similarity is smaller than a preset similarity threshold, determining that the word to be recognized is a non-sensitive word.

In an embodiment of the present disclosure, the apparatus further includes a mapping table storage module, where the mapping table storage module is configured to obtain a plurality of groups of sample words; generating a character string corresponding to each group of sample words according to the five-stroke coding rule; generating a mapping table according to the multiple groups of sample words and the corresponding character strings; and storing the mapping table into a preset coding library.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to this embodiment of the disclosure is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one memory unit 920, and a bus 930 that couples various system components including the memory unit 920 and the processing unit 910.

Where the storage unit stores program code, the program code may be executed by the processing unit 910 to cause the processing unit 910 to perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "exemplary methods" section of this specification. For example, the processing unit 910 may perform the following steps of the above-described method embodiments: respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, wherein a plurality of words and character strings with mapping relations are stored in the preset coding library; respectively preprocessing the first character string and the second character string to obtain a first character vector of a word to be recognized and a second character vector of a sensitive sample word; calculating cosine similarity of the first character vector and the second character vector; and determining whether the word to be recognized is a sensitive word or not according to the calculation result.

In some embodiments, in the electronic device provided in the embodiments of the present disclosure, the processing unit 910 is further configured to: merging and de-duplicating the first character string and the second character string to obtain a characteristic character string; and vectorizing the first character string and the second character string by using the characteristic character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word.

In some embodiments, in the electronic device provided in the embodiments of the present disclosure, the processing unit 910 is further configured to: traversing and searching in the first character string by utilizing each character in the characteristic character string, recording as 1 if the character in the characteristic character string is searched in the first character string, and recording as 0 if not, and obtaining a first character vector of the word to be recognized; and traversing and searching each character in the characteristic character string in the second character string, if the character in the characteristic character string is searched in the second character string, marking as 1, otherwise, marking as 0, and obtaining a second character vector of the sensitive sample word.

In some embodiments, in the electronic device provided in the embodiments of the present disclosure, the processing unit 910 is further configured to: the method comprises the steps of obtaining sensitive sample words corresponding to the words to be recognized from a preset coding library according to the words to be recognized, wherein the preset coding library also stores a plurality of sensitive words with mapping relations and deformed words of the sensitive words, the words to be recognized belong to the deformed words of the sensitive words, the deformed words of the sensitive words are words which are similar to the shapes of the sensitive words or are split and combined with the sensitive words, and the sensitive sample words belong to the sensitive words.

In some embodiments, in the electronic device provided in the embodiments of the present disclosure, the processing unit 910 is further configured to: calculating the cosine similarity of the first character vector and the second character vector by the following formula:

wherein ,

is the first character vector and is the second character vector,

In some embodiments, in the electronic device provided in the embodiments of the present disclosure, the processing unit 910 is further configured to: if the target similarity is larger than or equal to a preset similarity threshold, determining the word to be recognized as a sensitive word; and if the target similarity is smaller than a preset similarity threshold, determining that the word to be recognized is a non-sensitive word.

In some embodiments, in the electronic device provided in the embodiments of the present disclosure, the processing unit 910 is further configured to: acquiring a plurality of groups of sample words; generating a character string corresponding to each group of sample words according to the five-stroke coding rule; generating a mapping table according to the multiple groups of sample words and the corresponding character strings; and storing the mapping table into a preset coding library.

The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access memory unit (RAM)9201 and/or a cache memory unit 9202, and may further include a read only memory unit (ROM) 9203.

Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 940 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium, which may be a readable signal medium or a readable storage medium. On which a program product capable of implementing the above-described method of the present disclosure is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device. For example, when the computer program stored on the computer readable storage medium in the embodiment of the present disclosure is executed by the processor, the following steps of the following method can be realized: respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, wherein a plurality of words and character strings with mapping relations are stored in the preset coding library; respectively preprocessing the first character string and the second character string to obtain a first character vector of a word to be recognized and a second character vector of a sensitive sample word; calculating cosine similarity of the first character vector and the second character vector; and determining whether the word to be recognized is a sensitive word or not according to the calculation result.

More specific examples of the computer-readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may include a propagated data signal with readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Alternatively, program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

In particular implementations, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A sensitive word recognition method, comprising:

respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, wherein a plurality of words and character strings with mapping relations are stored in the preset coding library;

respectively preprocessing the first character string and the second character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word;

calculating cosine similarity of the first character vector and the second character vector;

and determining whether the word to be recognized is a sensitive word or not according to a calculation result.

2. The sensitive word recognition method of claim 1, wherein preprocessing the first character string and the second character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word comprises:

merging and de-duplicating the first character string and the second character string to obtain a characteristic character string;

and vectorizing the first character string and the second character string by using the characteristic character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word.

3. The sensitive word recognition method of claim 2, wherein vectorizing the first character string and the second character string with the feature character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word comprises:

traversing and searching the first character string by using each character in the characteristic character string, if the character in the characteristic character string is searched in the first character string, marking as 1, otherwise, marking as 0, and obtaining a first character vector of the word to be recognized;

and traversing and searching in the second character string by using each character in the characteristic character string, if the character in the characteristic character string is searched in the second character string, marking as 1, otherwise, marking as 0, and obtaining a second character vector of the sensitive sample word.

4. The sensitive word recognition method according to claim 1, wherein before the first character string corresponding to the word to be recognized and the second character string corresponding to the sensitive sample word are respectively obtained from a preset code library, the method further comprises:

and acquiring a sensitive sample word corresponding to the word to be recognized from the preset coding library according to the word to be recognized, wherein the preset coding library also stores a plurality of sensitive words with mapping relations and deformed words of the sensitive words, the word to be recognized belongs to the deformed words of the sensitive words, the deformed words of the sensitive words are words which are similar to the shape of the sensitive words or are split and combined with the sensitive words, and the sensitive sample word belongs to the sensitive words.

5. The sensitive word recognition method of claim 1, wherein the cosine similarity of the first character vector and the second character vector is calculated by the following formula:

wherein ,

is the first character vector and is the second character vector,

6. The sensitive word recognition method of claim 5, wherein determining whether the word to be recognized is a sensitive word according to the calculation result comprises:

if the target similarity is greater than or equal to the preset similarity threshold, determining the word to be recognized as a sensitive word;

and if the target similarity is smaller than the preset similarity threshold, determining that the word to be recognized is a non-sensitive word.

7. The sensitive word recognition method of claim 1, further comprising:

acquiring a plurality of groups of sample words;

generating a character string corresponding to each group of sample words according to the five-stroke coding rule;

generating a mapping table according to the multiple groups of sample words and the corresponding character strings;

and storing the mapping table into the preset coding library.

8. A sensitive word recognition apparatus, comprising:

the character string acquisition module is used for respectively acquiring a first character string corresponding to a word to be recognized and a second character string corresponding to a sensitive sample word from a preset coding library, wherein the preset coding library stores a plurality of words and character strings with mapping relations;

the preprocessing module is used for preprocessing the first character string and the second character string to obtain a first character vector of the word to be recognized and a second character vector of the sensitive sample word;

the similarity calculation module is used for calculating cosine similarity of the first character vector and the second character vector;

and the result determining module is used for determining whether the word to be recognized is a sensitive word or not according to the calculation result.

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the sensitive word recognition method of any one of claims 1 to 7 via execution of the executable instructions.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the sensitive word recognition method according to any one of claims 1 to 7.