CN114386385A - Method, device, system and storage medium for discovering sensitive word derived vocabulary - Google Patents

Method, device, system and storage medium for discovering sensitive word derived vocabulary Download PDF

Info

Publication number
CN114386385A
CN114386385A CN202210281857.XA CN202210281857A CN114386385A CN 114386385 A CN114386385 A CN 114386385A CN 202210281857 A CN202210281857 A CN 202210281857A CN 114386385 A CN114386385 A CN 114386385A
Authority
CN
China
Prior art keywords
similarity
word
character
chinese character
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210281857.XA
Other languages
Chinese (zh)
Inventor
孟令铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Developer Technology Co ltd
Beijing Innovation Lezhi Network Technology Co ltd
Original Assignee
Changsha Developer Technology Co ltd
Beijing Innovation Lezhi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Developer Technology Co ltd, Beijing Innovation Lezhi Network Technology Co ltd filed Critical Changsha Developer Technology Co ltd
Priority to CN202210281857.XA priority Critical patent/CN114386385A/en
Publication of CN114386385A publication Critical patent/CN114386385A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method, a device, a system and a storage medium for discovering a sensitive word derivative vocabulary are provided. The method comprises the following steps: comprehensively judging the similarity between the Chinese characters in the character library based on one or more similarity judgment indexes and based on computer image vision; dynamically setting a selection threshold corresponding to each similarity judgment index in the one or more similarity judgment indexes, selecting a similar Chinese character set of each Chinese character based on the set similarity judgment condition, and creating a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set; and traversing the similarity word library dictionary aiming at each input sensitive word to generate a derivative word corresponding to the sensitive word. According to the method and the device, the mass derived sensitive words corresponding to the sensitive words are emphatically generated and stored in the sensitive word library in one key, the condition that the derived deformed words of the sensitive words are difficult to exhaust is improved and avoided, and the huge manual receiving and recording amount of auditors is reduced.

Description

Method, device, system and storage medium for discovering sensitive word derived vocabulary
Technical Field
The present disclosure relates to the field of network information technologies, and in particular, to a method, an apparatus, a system, and a computer-readable storage medium for discovering a sensitive word derived vocabulary based on a Chinese character sound order font.
Background
Nowadays, network information technology has been developed, and numerous netizens can make their own insights and opinions anytime and anywhere through the network on aspects of modern politics, literature and art, historical discipline and the like. The information, learning, communication, interaction and sharing platform like CSDN belongs to a typical network information application platform, and on the platform, users can comprehensively communicate and interact information and share thoughts and discussion topics. The method provides convenience for information sharing and brings hidden danger. Because not everyone can obey the internet management laws and regulations established by the state, some people can release harmful opinions on the network, including the opinions of pornography, violence, political sensitivity and the like, which greatly damages the network security of the internet and brings adverse factors to social stability. The harmful speech is mainly composed of harmful sensitive words, and in order to purify the network environment, effective measures are urgently needed to detect and filter the appearing sensitive words and characters so as to create a healthy network space. Most current detection methods for sensitive words use simple string matching, such as the KMP algorithm, which is based on matching of exact character strings to find the position of occurrence of a pattern string in a given target string, and this requires that each character of the pattern string is matched with the found target string. With the rapid increase of the information quantity, the content needing to be audited is also proliferated, wherein the key point of auditing is whether the blog and the comment contain various sensitive forbidden words and various derivative words with sensitive forbidden words, so that it is important to expand the sensitive words derived based on the sound order and the font and quickly identify whether sensitive words and spam are present in the information such as blogs and comments. Based on the above, accurately finding the derivative words of the sensitive words and the forbidden words is a technical problem to be solved urgently, and is the basis of network content examination.
In the prior art, when standard sensitive words are collected and a sensitive word text base is constructed, nearly thousands of popular writing methods can be derived from a single sensitive word, the workload of drawing up the sensitive word and recording the sensitive word in the word base is huge, and various forms of the sensitive word derived word such as harmonic sounds and similar characters are large, so that the prior art is difficult to exhaust deformed words of the sensitive word from the derived word.
Disclosure of Invention
In view of the above technical problems, the present disclosure provides a method, an apparatus, a system, and a computer-readable storage medium for discovering a sensitive word derived vocabulary.
In a first aspect, a method for discovering a word derived from a sensitive word includes:
judging the similarity between the Chinese characters in the character library based on a plurality of similarity judgment indexes;
dynamically setting a selection threshold corresponding to each similarity judgment index in the similarity judgment indexes, selecting a similar Chinese character set of each Chinese character based on the set similarity judgment condition, and creating a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set;
and traversing the similarity word library dictionary by a DFS (depth-first traversal) algorithm aiming at each input sensitive word to generate a derivative word corresponding to the sensitive word.
Further, the similarity judgment index includes font detail similarity, font overall similarity, and sound order similarity.
Further, the font detail similarity comprises stroke number similarity, structure similarity and four-corner code similarity among Chinese characters.
Further, the overall similarity of the character patterns is determined by converting Chinese characters into a gray-scale image by using a model based on computer image vision, extracting the characteristics of the Chinese characters from the gray-scale image to form characteristic vectors, and calculating the cosine similarity and the matrix similarity among the characteristic vectors.
Further, the initial consonant and the final of each Chinese character are extracted, and the phonetic sequence similarity is determined according to any one or combination of several of the following three evaluation criteria: the initial consonant and the final are similar; the initials are the same, the finals are similar, or the finals are the same, and the initials are similar.
In a second aspect, an apparatus for discovering a vocabulary derived from sensitive words includes:
the inter-character similarity judging module is used for judging the similarity between the Chinese characters in the character library based on one or more similarity judging indexes;
a similarity word stock dictionary creating module, configured to dynamically set a selection threshold corresponding to each similarity determination index of the one or more similarity determination indexes, select a similar Chinese character set of each Chinese character based on a set similarity determination condition, and create a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set;
and the derived vocabulary generating module is used for traversing the similarity word stock dictionary through a DFS (depth-first traversal) algorithm aiming at each input sensitive word to generate a derived vocabulary corresponding to the sensitive word.
In a third aspect, a system for discovering a word derived from a sensitive word, the system comprising a processor and a memory, the processor executing computer instructions stored in the memory to implement the method of any of the first aspect.
In a fourth aspect, a computer-readable storage medium stores computer instructions for causing a computer system to perform the method of any of the preceding first aspects.
The invention discloses a method, a device and a system for discovering a sensitive word derivative vocabulary and a computer readable storage medium. The method for discovering the sensitive word derivative vocabulary based on the Chinese character sound sequence and the character pattern comprises the steps of judging the similarity between Chinese characters in a word stock based on one or more similarity judgment indexes; dynamically setting a selection threshold corresponding to each similarity judgment index in the one or more similarity judgment indexes, selecting a similar Chinese character set of each Chinese character based on the set similarity judgment condition, and creating a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set; and traversing the similarity word library dictionary aiming at each input sensitive word to generate a derivative word corresponding to the sensitive word. By the scheme for discovering the sensitive word derivative vocabulary based on the Chinese character sound sequence and the character pattern, the problems of high judgment error, low accuracy, difficult recording and large workload in human co-recording in the discovery of the derivative vocabulary of the sensitive word in the prior art are solved. According to the set visual similarity of the Chinese characters in the form, the sound and the computer image and the similarity judgment in multiple angles and multiple aspects, the similarity relation between the Chinese characters is established, and the derivative vocabulary corresponding to the sensitive words is generated based on the similarity relation, so that the comprehensiveness and the accuracy of the recording of the derivative words are improved.
The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.
FIG. 1: the method comprises the following steps of (1) discovering sensitive word derivative vocabularies based on Chinese character sound sequence font according to one embodiment of the disclosure;
FIG. 2: the structure of the device for discovering the sensitive word derivative vocabulary based on the Chinese character sound sequence font of one embodiment of the disclosure;
FIG. 3: the structure of the system for discovering the sensitive word derivative vocabulary based on the Chinese character sound sequence font of one embodiment of the disclosure;
FIG. 4: the structure of a computer-readable storage medium structure for discovering sensitive word derived vocabulary based on Chinese character sound sequence font of one embodiment of the present disclosure;
FIG. 5: a partial view of a similarity dictionary of one embodiment of the present disclosure;
FIG. 6: the similarity dictionary of one embodiment of the present disclosure traverses the example graph;
FIG. 7: the local graph of the Chinese character visual similarity threshold value based on the computer visual image model of one embodiment of the disclosure;
FIG. 8: the method includes the steps that a Runtime framework technology based on the Runtime framework technology enables python codes to be embedded into a Java back-end running local graph in a cross-platform mode;
FIG. 9: another specific flowchart of the discovery of sensitive word-derived vocabulary based on the phonetic sequence of Chinese characters is disclosed.
Detailed Description
The embodiments of the present disclosure are described below with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure in the specification. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. The disclosure may be embodied or carried out in various other specific embodiments, and various modifications and changes may be made in the details within the description without departing from the spirit of the disclosure. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present disclosure, and the drawings only show the components related to the present disclosure rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
Fig. 1 is a flowchart illustrating a method for discovering a sensitive word derived vocabulary based on a chinese character sound-order character pattern according to an embodiment of the present disclosure, where the method may be performed by a device for discovering a sensitive word derived vocabulary based on a chinese character sound-order character pattern, the device for discovering a sensitive word derived vocabulary based on a chinese character sound-order character pattern may be implemented as software, or as a combination of software and hardware, and the device for video preloading based on behavior prediction may be integrated in an electronic device, such as a server or a terminal device, in a data processing system.
As shown in fig. 1, a method for discovering a sensitive word-derived vocabulary based on a phonetic sequence font of a chinese character includes the following steps:
step 1, judging the similarity between Chinese characters in a character library based on one or more similarity judgment indexes.
In the prior art, corresponding sensitive word derivative words are determined by converting Chinese characters in a text library and a dictionary into pictures and calculating cosine similarity of the pictures, and similar Chinese characters are found only from the shape of the Chinese characters in the image comparison mode. Therefore, in order to find and determine the similarity relationship between the Chinese characters more accurately and improve the determination accuracy of the derived word, the scheme of the disclosure judges the similarity between the Chinese characters from at least three dimensions:
(1) similarity of font details between Chinese characters: according to the characteristic that the Chinese character belongs to ideographic characters, similar Chinese characters are determined from the character form details, wherein the character form details comprise stroke numbers, Chinese character structures and four-corner codes capable of embodying the shape characteristics of square characters of the Chinese characters;
further, each aspect sets a respective weight, in one embodiment, for example, glyph detail similarity = stroke number similarity 0.3 + kanji structure similarity 0.4 + four-corner code similarity 0.3.
(2) Overall font similarity between Chinese characters: in one embodiment, based on a computer image vision model, Chinese characters in a character library are converted into gray images by calling pygame and opencv function packages in an image processing function package. Acquiring vector representations of all Chinese characters, storing the vector representations in a dictionary form, setting two parameters, namely cosine _ similar (recording cosine similarity of vectors between the Chinese characters) and similar _ index (recording matrix similarity of the Chinese characters on a gray level image), and judging that the overall font of the two Chinese characters is similar when the threshold values of the two parameters exceed a given threshold value, such as 0.9. In one embodiment, other image comparison methods may be used to determine overall font similarity between Chinese characters.
(3) Pronunciation and phonetic sequence similarity between Chinese characters: in one embodiment, the pronunciation of each Chinese character is found by calling a pypinyin function packet in a Chinese character pinyin processing packet, and the pronunciation sequence similarity between the Chinese characters is judged according to at least three evaluation indexes by extracting the vowel and the initial consonant of each Chinese character: the consonants and the finals are both close (for example, the pronunciation phonetic sequence similarity is 0.7), the consonants are the same, the finals are close (for example, the pronunciation phonetic sequence similarity is 0.8), the finals are the same, and the consonants are close (for example, the pronunciation phonetic sequence similarity is 0.8).
In one embodiment, the similarity relationship between the Chinese characters in the character library is calculated and judged through the similarity of the character pattern details between the Chinese characters, the similarity of the whole character pattern between the Chinese characters and the similarity of the pronunciation and sound sequence between the Chinese characters, wherein the similarity relationship is identified from the three angles, namely the similarity of the character pattern details between the Chinese characters, the similarity of the whole character pattern between the Chinese characters and the similarity of the pronunciation and sound sequence between the Chinese characters.
In one embodiment, a machine learning model is employed, and a similarity relationship calculation judgment model is trained based on sample data. The sample data comprises similarity relation between the determined marked Chinese characters, and the Chinese character pairs or Chinese character sets with the similarity relation to be calculated are input into the trained similarity relation calculation judgment model to calculate and judge the similarity relation of the Chinese character pairs or Chinese character sets. And further training the similarity relation calculation judgment model according to a calculation judgment result.
Step 2, dynamically setting a selection threshold corresponding to each similarity judgment index in the one or more similarity judgment indexes, selecting a similar Chinese character set of each Chinese character based on the set similarity judgment condition, and creating a similarity word stock dictionary; and the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set.
In one embodiment, a selection threshold of each of the similarity determination indexes is dynamically set, and similarity determination conditions are set, such as the similarity of the font details between the Chinese characters is greater than a threshold of 0.9, the similarity of the overall font similarity between the Chinese characters is greater than a threshold of 0.9, and the similarity of the pronunciation and the sequence of sound between the Chinese characters is greater than a threshold of 0.9. The threshold value is dynamically set, and the judgment condition is dynamically set, so that the judgment precision can be adjusted in a targeted manner.
Traversing the Chinese characters with similar pronunciations and similar character patterns in the character library according to the similarity judgment condition, selecting a similar Chinese character set of each Chinese character, and creating a similarity character library dictionary; further, for example, a data structure storage manner of a dictionary in python is adopted, a similar Chinese character set with a near sound shape of each character is stored in a specified file, and finally a similarity word library dictionary of each Chinese character is generated, as shown in fig. 5.
In one embodiment, the traversal process is performed in a DFS (depth-first traversal) o (n ^2) two-layer loop nesting mode with algorithm time complexity, and thousands of derived sensitive words derived from a single sensitive word are completely included.
And 3, traversing the similarity word library dictionary aiming at each input sensitive word to generate a derivative vocabulary corresponding to the sensitive word.
In one embodiment, based on the generated similarity word library dictionary, sensitive words are input, the similarity word library dictionary is searched through DFS depth first, various combinations of near-sound-shape Chinese characters of each Chinese character in the sensitive words are traversed through a permutation and combination mode, so that derived vocabularies for generating the sensitive words and forbidden words are found, derived vocabulary sets of the sensitive words and the forbidden words are generated, and the derived vocabulary sets are stored in a designated file. The way in which the sensitive word derivation vocabulary is discovered and generated is shown in fig. 6.
In one embodiment, another specific operation manner of generating the similarity word library dictionary and discovering and generating the sensitive word derivative vocabulary is as shown in fig. 7, in which the chinese characters in the word library are converted into pictures and stored as a picture library; establishing a Chinese character stroke number dictionary; establishing a character library list of simplified characters and traditional characters; establishing a Chinese character four-corner code dictionary; establishing a similar sound and character pattern structure dictionary; judging the phonetic near word according to the phonetic pronunciation by adopting a probability. Taking Judge similarity function in a Chinese character corpus as a main function for calculating the similarity between characters; and establishing a similarity word library dictionary, and finding and generating a derivative vocabulary corresponding to the sensitive words by using a DFS depth-first search method by using a Judge similarity.
By the method for finding the sensitive word derivative vocabulary based on the Chinese character sound sequence and the character pattern, the similarity between the Chinese characters in the character library is judged based on one or more similarity judgment indexes; dynamically setting a selection threshold corresponding to each similarity judgment index in the one or more similarity judgment indexes, selecting a similar Chinese character set of each Chinese character based on the set similarity judgment condition, and creating a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set; and traversing the similarity word library dictionary aiming at each input sensitive word to generate a derivative word corresponding to the sensitive word. The method borrows the thought of the computer vision field in the generation of the similarity word stock dictionary, fully considers whether the Chinese characters are similar to each other in the visual sense or not in the form-near character angle of each Chinese character, fully considers the four-corner code of the balanced radicals, fully considers the pronunciation of the initial consonant and the final sound in the sound-near character angle, finds out the Chinese characters with similar pronunciation corresponding to each character, and generates the similarity word stock dictionary with good universality and high software portability. In the discovery and generation of the sensitive word derivative words, the generated similarity word library dictionary is used for DFS depth-first traversal, derivative words of the sensitive words which can appear in the network are exhausted in one network, the coverage is wide, the discovered result is stored in a specified file, the method can be flexibly added into each large auditing system to serve as the sensitive words to be intercepted, and the workload of content auditing service personnel is greatly reduced.
Fig. 2 is a device for discovering a vocabulary derived from sensitive words based on a phonetic sequence of a chinese character according to an embodiment of the present disclosure, including:
the similarity judging module is used for judging the similarity between the Chinese characters in the character library based on one or more similarity judging indexes;
a similarity word stock dictionary generating module, configured to dynamically set a selection threshold corresponding to each similarity determination index in the one or more similarity determination indexes, select a similar Chinese character set of each Chinese character based on the set similarity determination condition, and create a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set;
and the derived vocabulary generating module is used for traversing the similarity word stock dictionary aiming at each input sensitive word and generating a derived vocabulary corresponding to the sensitive word.
The similarity judging module is used for judging the similarity between the Chinese characters in the character library based on one or more similarity judging indexes;
in the prior art, corresponding sensitive word derivative words are determined by converting Chinese characters in a text library and a dictionary into pictures and calculating cosine similarity of the pictures, and similar Chinese characters are found only from the shape of the Chinese characters in the image comparison mode. Therefore, in order to find and determine the similarity relationship between the Chinese characters more accurately and improve the determination accuracy of the derived word, the scheme of the disclosure judges the similarity between the Chinese characters from at least three dimensions:
(1) similarity of font details between Chinese characters: according to the characteristic that the Chinese character belongs to ideographic characters, similar Chinese characters are determined from the character form details, wherein the character form details comprise stroke numbers, Chinese character structures and four-corner codes capable of embodying the shape characteristics of square characters of the Chinese characters;
further, each aspect sets a respective weight, in one embodiment, for example, glyph detail similarity = stroke number similarity 0.3 + kanji structure similarity 0.4 + four-corner code similarity 0.3.
(2) Overall font similarity between Chinese characters: in one embodiment, by using the idea of computer vision, the Chinese characters in the character library are converted into grayscale images by calling pygame and opencv function packages in the image processing function package, vector representations of all the Chinese characters are obtained and stored in a dictionary form, two parameters, namely cosine _ similar (recording cosine similarity of vectors between the Chinese characters) and similar _ index (recording matrix similarity on the grayscale images between the Chinese characters), are set, and when the threshold values of the two parameters exceed a given threshold value, for example 0.9, the overall font similarity between the two Chinese characters is judged. In one embodiment, other image comparison methods may be used to determine overall font similarity between Chinese characters.
(3) Pronunciation and phonetic sequence similarity between Chinese characters: in one embodiment, the pronunciation of each Chinese character is found by calling a pypinyin function packet in a Chinese character pinyin processing packet, and the pronunciation sequence similarity between the Chinese characters is judged according to at least three evaluation indexes by extracting the vowel and the initial consonant of each Chinese character: the consonants and the finals are both close (for example, the pronunciation phonetic sequence similarity is 0.7), the consonants are the same, the finals are close (for example, the pronunciation phonetic sequence similarity is 0.8), the finals are the same, and the consonants are close (for example, the pronunciation phonetic sequence similarity is 0.8).
In one embodiment, the similarity relationship between the Chinese characters in the character library is calculated and judged through the similarity of the character pattern details between the Chinese characters, the similarity of the whole character pattern between the Chinese characters and the similarity of the pronunciation and sound sequence between the Chinese characters, wherein the similarity relationship is identified from the three angles, namely the similarity of the character pattern details between the Chinese characters, the similarity of the whole character pattern between the Chinese characters and the similarity of the pronunciation and sound sequence between the Chinese characters.
A similarity word stock dictionary generating module, configured to dynamically set a selection threshold corresponding to each similarity determination index in the one or more similarity determination indexes, select a similar Chinese character set of each Chinese character based on the set similarity determination condition, and create a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set;
in one embodiment, a selection threshold of each of the similarity determination indexes is dynamically set, and similarity determination conditions are set, such as the similarity of the font details between the Chinese characters is greater than a threshold of 0.9, the similarity of the overall font similarity between the Chinese characters is greater than a threshold of 0.9, and the similarity of the pronunciation and the sequence of sound between the Chinese characters is greater than a threshold of 0.9. The threshold value is dynamically set, and the judgment condition is dynamically set, so that the judgment precision can be adjusted in a targeted manner.
Traversing the Chinese characters with similar pronunciations and similar character patterns in the character library according to the similarity judgment condition, selecting a similar Chinese character set of each Chinese character, and creating a similarity character library dictionary; further, for example, a data structure storage manner of a dictionary in python is adopted, a similar Chinese character set with a near sound shape of each character is stored in a specified file, and finally a similarity word library dictionary of each Chinese character is generated, as shown in fig. 5.
In one embodiment, the traversal process is performed in a two-layer loop nesting mode with the algorithm time complexity of o (n ^ 2).
And the derived vocabulary generating module is used for traversing the similarity word stock dictionary aiming at each input sensitive word and generating a derived vocabulary corresponding to the sensitive word.
In one embodiment, based on the generated similarity word library dictionary, sensitive words are input, the similarity word library dictionary is searched through DFS depth first, various combinations of near-sound-shape Chinese characters of each Chinese character in the sensitive words are traversed through a permutation and combination mode, so that derived vocabularies for generating the sensitive words and forbidden words are found, derived vocabulary sets of the sensitive words and the forbidden words are generated, and the derived vocabulary sets are stored in a designated file. The way in which the sensitive word derivation vocabulary is discovered and generated is shown in fig. 6.
In one embodiment, another specific operation manner of generating the similarity word library dictionary and discovering and generating the sensitive word derivative vocabulary is as shown in fig. 7, in which the chinese characters in the word library are converted into pictures and stored as a picture library; establishing a Chinese character stroke number dictionary; establishing a character library list of simplified characters and traditional characters; establishing a Chinese character four-corner code dictionary; establishing a similar sound and character pattern structure dictionary; judging the phonetic near word according to the phonetic pronunciation by adopting a probability. Taking Judge similarity function in a Chinese character corpus as a main function for calculating the similarity between characters; and establishing a similarity word library dictionary, and finding and generating a derivative vocabulary corresponding to the sensitive words by using a DFS depth-first search method by using a Judge similarity.
In one example, while an algorithm derived from sensitive words based on the pronunciation and the font is prepared, the algorithm function based on a pycharm framework is written by Java in combination with a Runtime framework technology and Python realized on JVM in combination with specific services of an audit department of CSDN company. The Python source code is compiled into JVM bytecode, and the JVM executes the corresponding bytecode. Therefore, the method can be well integrated with the JVM, specified functions or object methods in a Python program are directly called through Runtime, the granularity is finer, Jython dependence is added and configured to a pom.
The method is realized by the device for discovering the sensitive word derivative vocabulary based on the Chinese character sound sequence and the character pattern, and the similarity between the Chinese characters in the character library is judged based on one or more similarity judgment indexes; dynamically setting a selection threshold corresponding to each similarity judgment index in the one or more similarity judgment indexes, selecting a similar Chinese character set of each Chinese character based on the set similarity judgment condition, and creating a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set; and traversing the similarity word library dictionary aiming at each input sensitive word to generate a derivative word corresponding to the sensitive word. The method borrows the thought of the computer vision field in the generation of the similarity word stock dictionary, fully considers whether the Chinese characters are similar to each other in the visual sense or not in the form-near character angle of each Chinese character, fully considers the four-corner code of the balanced radicals, fully considers the pronunciation of the initial consonant and the final sound in the sound-near character angle, finds out the Chinese characters with similar pronunciation corresponding to each character, and generates the similarity word stock dictionary with good universality and high software portability. In the discovery and generation of the sensitive word derivative words, the generated similarity word library dictionary is used for DFS depth-first traversal, derivative words of the sensitive words which can appear in the network are exhausted in one network, the coverage is wide, the discovered result is stored in a specified file, the method can be flexibly added into each large auditing system to serve as the sensitive words to be intercepted, and the workload of content auditing service personnel is greatly reduced.
Fig. 3 is a system block diagram according to an embodiment of the present disclosure. As shown in fig. 3, the system 30 includes a processor 31 and a memory 32, and the processor executes computer instructions stored in the memory to implement all or part of the steps of the method for discovering the sensitive word-derived vocabulary based on the phonetic sequence of chinese characters according to the embodiments of the present disclosure.
Fig. 4 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present disclosure. As shown in fig. 4, a computer-readable storage medium 40, having non-transitory computer-readable instructions 41 stored thereon, in accordance with an embodiment of the present disclosure. When executed by a processor, the non-transitory computer readable instructions 41 perform all or part of the steps of the method for discovering a sensitive word-derived vocabulary based on kanji sound-order glyphs according to the embodiments of the disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: constructing a basic page, wherein a page code of the basic page is used for constructing an environment required by the operation of the business page and/or realizing the same abstract workflow in the same business scene; constructing one or more page templates, wherein the page templates are used for providing code templates for realizing service functions in service scenes; generating a final page code of each page of the business scene through code conversion of a specific function of each page of the business scene based on the corresponding page template; and combining the generated final page code of each page into the page code of the basic page to generate the code of the service page.
Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: constructing a basic page, wherein a page code of the basic page is used for constructing an environment required by the operation of the business page and/or realizing the same abstract workflow in the same business scene; constructing one or more page templates, wherein the page templates are used for providing code templates for realizing service functions in service scenes; generating a final page code of each page of the business scene through code conversion of a specific function of each page of the business scene based on the corresponding page template; and combining the generated final page code of each page into the page code of the basic page to generate the code of the service page.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (20)

1. A method for discovering a vocabulary derived from sensitive words, comprising:
judging the similarity between the Chinese characters in the character library based on one or more similarity judgment indexes;
dynamically setting a selection threshold corresponding to each similarity judgment index in the one or more similarity judgment indexes, selecting a similar Chinese character set of each Chinese character based on the set similarity judgment condition, and creating a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set;
and traversing the similarity word library dictionary aiming at each input sensitive word to generate a derivative word corresponding to the sensitive word.
2. The method of claim 1, wherein the similarity determination indicators comprise font detail similarity, font overall similarity, and pronunciation sequence similarity.
3. The method of claim 2, wherein the font-detail similarity comprises stroke-number similarity, structural similarity and four-corner-code similarity between Chinese characters.
4. The method of claim 2, wherein the overall similarity of the character patterns is determined by converting Chinese characters into gray-scale images, extracting the characteristics of the Chinese characters from the gray-scale images based on computer image vision, generating a matrix of the images of the gray-scale images corresponding to each character, forming characteristic vectors, and calculating cosine similarity and matrix similarity between the characteristic vectors of each character.
5. The method of claim 2, wherein the initial consonant and the final of each Chinese character are extracted, and the pronunciation and phonetic sequence similarity is determined according to any one or a combination of three evaluation criteria: the initial consonant and the final are similar; the initials are the same, the finals are similar, or the finals are the same, and the initials are similar.
6. The method as claimed in claim 3, wherein the weight value corresponding to each of the stroke number similarity, the structure similarity and the four-corner code similarity among the Chinese characters is dynamically set, and the font detail similarity is comprehensively calculated.
7. The method of claim 1, wherein the similarity between the Chinese characters in the word stock is calculated using a machine learning model.
8. The method of claim 1, computing a derived vocabulary for the sensitive words using a machine learning model.
9. The method of claim 7 or 8, the machine learning model being a neural network model.
10. An apparatus for discovering a vocabulary derived from sensitive words, comprising: the inter-character similarity judging module is used for judging the similarity between the Chinese characters in the character library based on one or more similarity judging indexes; a similarity word stock dictionary creating module, configured to dynamically set a selection threshold corresponding to each similarity determination index of the one or more similarity determination indexes, select a similar Chinese character set of each Chinese character based on a set similarity determination condition, and create a similarity word stock dictionary; the similarity word stock dictionary stores each Chinese character and a corresponding similar Chinese character set; and the derivative vocabulary generating module is used for traversing the similarity word stock dictionary through a DFS algorithm aiming at each input sensitive word and generating a derivative vocabulary corresponding to the sensitive word in the sensitive word stock emphatically.
11. The apparatus of claim 10, wherein the similarity determination index comprises a font detail similarity, a font overall similarity based on computer vision images, and a pronunciation sequence similarity.
12. The apparatus of claim 11, wherein the font detail similarity comprises stroke number similarity between Chinese characters, structural similarity, and similarity of four-corner codes and similarity of feature matrices based on computer vision images.
13. The apparatus of claim 11, wherein the overall similarity of the character patterns is determined by converting a Chinese character into a gray-scale image, extracting the Chinese character features from the gray-scale image to form feature vectors, and calculating cosine similarity and matrix similarity between the feature vectors.
14. The device of claim 11, extracting the initial consonant and the final of each Chinese character, and determining the pronunciation sequence similarity according to any one or a combination of three evaluation criteria: the initial consonant and the final are similar; the initials are the same, the finals are similar, or the finals are the same, and the initials are similar.
15. The apparatus of claim 12, wherein weight values corresponding to each of the stroke number similarity, the structural similarity and the four-corner code similarity between the Chinese characters are dynamically set, and the font detail similarity is comprehensively calculated.
16. The apparatus of claim 10, wherein the similarity between the Chinese characters in the word stock is calculated based on a feature matrix of computer vision images by using a machine learning model.
17. The apparatus of claim 10, wherein a machine learning model is used to calculate the similarity between the Chinese characters in the word stock based on the feature matrix of the computer vision image, and the derived sensitive words for searching the sensitive words are calculated by the DFS algorithm.
18. The apparatus of claim 16 or 17, the machine learning model being a neural network model.
19. A system for discovery of a vocabulary derived from sensitive words, the system comprising a processor and a memory, the processor executing computer instructions stored in the memory to implement the method of any of claims 1-9.
20. A computer-readable storage medium storing non-transitory computer-readable instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1-9.
CN202210281857.XA 2022-03-22 2022-03-22 Method, device, system and storage medium for discovering sensitive word derived vocabulary Pending CN114386385A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210281857.XA CN114386385A (en) 2022-03-22 2022-03-22 Method, device, system and storage medium for discovering sensitive word derived vocabulary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210281857.XA CN114386385A (en) 2022-03-22 2022-03-22 Method, device, system and storage medium for discovering sensitive word derived vocabulary

Publications (1)

Publication Number Publication Date
CN114386385A true CN114386385A (en) 2022-04-22

Family

ID=81206197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210281857.XA Pending CN114386385A (en) 2022-03-22 2022-03-22 Method, device, system and storage medium for discovering sensitive word derived vocabulary

Country Status (1)

Country Link
CN (1) CN114386385A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407179A (en) * 2016-08-26 2017-02-15 福建网龙计算机网络信息技术有限公司 Chinese character pattern similarity calculation method and system thereof
CN111079379A (en) * 2019-12-03 2020-04-28 微梦创科网络科技(中国)有限公司 Shape and proximity character acquisition method and device, electronic equipment and storage medium
CN111209447A (en) * 2019-02-27 2020-05-29 山东大学 Chinese character string similarity calculation method and device based on sound-shape codes
CN112001170A (en) * 2020-05-29 2020-11-27 中国人民大学 Method and system for recognizing deformed sensitive words
CN112329390A (en) * 2020-09-30 2021-02-05 海南大学 Chinese word similarity detection algorithm based on sound, shape and meaning
CN112990353A (en) * 2021-04-14 2021-06-18 中南大学 Chinese character confusable set construction method based on multi-mode model
CN113822059A (en) * 2021-09-18 2021-12-21 北京云上曲率科技有限公司 Chinese sensitive text recognition method and device, storage medium and equipment
CN113988061A (en) * 2021-10-22 2022-01-28 平安国际智慧城市科技股份有限公司 Sensitive word detection method, device and equipment based on deep learning and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407179A (en) * 2016-08-26 2017-02-15 福建网龙计算机网络信息技术有限公司 Chinese character pattern similarity calculation method and system thereof
CN111209447A (en) * 2019-02-27 2020-05-29 山东大学 Chinese character string similarity calculation method and device based on sound-shape codes
CN111079379A (en) * 2019-12-03 2020-04-28 微梦创科网络科技(中国)有限公司 Shape and proximity character acquisition method and device, electronic equipment and storage medium
CN112001170A (en) * 2020-05-29 2020-11-27 中国人民大学 Method and system for recognizing deformed sensitive words
CN112329390A (en) * 2020-09-30 2021-02-05 海南大学 Chinese word similarity detection algorithm based on sound, shape and meaning
CN112990353A (en) * 2021-04-14 2021-06-18 中南大学 Chinese character confusable set construction method based on multi-mode model
CN113822059A (en) * 2021-09-18 2021-12-21 北京云上曲率科技有限公司 Chinese sensitive text recognition method and device, storage medium and equipment
CN113988061A (en) * 2021-10-22 2022-01-28 平安国际智慧城市科技股份有限公司 Sensitive word detection method, device and equipment based on deep learning and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周昊等: "基于改进音形码的中文敏感词检测算法", 《南京大学学报(自然科学)》 *
山阴少年: "NLP(四十七)文本纠错之获取形近字", 《HTTPS://BLOG.CSDN.NET/JCLIAN91/ARTICLE/DETAILS/118345177》 *

Similar Documents

Publication Publication Date Title
US8364470B2 (en) Text analysis method for finding acronyms
US11409642B2 (en) Automatic parameter value resolution for API evaluation
CN111324743A (en) Text relation extraction method and device, computer equipment and storage medium
CN111198939B (en) Statement similarity analysis method and device and computer equipment
JP5809381B1 (en) Natural language processing system, natural language processing method, and natural language processing program
CN112199473A (en) Multi-turn dialogue method and device in knowledge question-answering system
CN108228567B (en) Method and device for extracting short names of organizations
US10963647B2 (en) Predicting probability of occurrence of a string using sequence of vectors
US20230306209A1 (en) Learned Evaluation Model For Grading Quality of Natural Language Generation Outputs
CN111259262A (en) Information retrieval method, device, equipment and medium
CN113656763B (en) Method and device for determining feature vector of applet and electronic equipment
CN114741468B (en) Text deduplication method, device, equipment and storage medium
Wahab et al. Dibert: Dependency injected bidirectional encoder representations from transformers
CN108268443B (en) Method and device for determining topic point transfer and acquiring reply text
US11481547B2 (en) Framework for chinese text error identification and correction
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN110287487B (en) Master predicate identification method, apparatus, device, and computer-readable storage medium
KR101440887B1 (en) Method and apparatus of recognizing business card using image and voice information
US20230130662A1 (en) Method and apparatus for analyzing multimodal data
WO2023137903A1 (en) Reply statement determination method and apparatus based on rough semantics, and electronic device
CN115840808A (en) Scientific and technological project consultation method, device, server and computer-readable storage medium
CN114386385A (en) Method, device, system and storage medium for discovering sensitive word derived vocabulary
CN115186647A (en) Text similarity detection method and device, electronic equipment and storage medium
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
US11605006B2 (en) Deep-learning model catalog creation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220422