CN115081440A - Method, device and equipment for recognizing variant words in text and extracting original sensitive words - Google Patents
Method, device and equipment for recognizing variant words in text and extracting original sensitive words Download PDFInfo
- Publication number
- CN115081440A CN115081440A CN202210860492.6A CN202210860492A CN115081440A CN 115081440 A CN115081440 A CN 115081440A CN 202210860492 A CN202210860492 A CN 202210860492A CN 115081440 A CN115081440 A CN 115081440A
- Authority
- CN
- China
- Prior art keywords
- words
- text
- variant
- word
- recognized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Discrimination (AREA)
Abstract
The application relates to a method, a device and equipment for recognizing variant words in texts and extracting original sensitive words. The method comprises the following steps: performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, performing homophone and variant word verification on the text to be recognized according to a matching result, if the text to be recognized contains variant words, converting each Chinese in the sensitive word bank and the variant words into pinyin, traversing and comparing character strings, and connecting the pinyin corresponding to the Chinese in the variant words with the pinyin corresponding to the original sensitive word by middle-stroke lines to obtain the position of the original sensitive word; traversing and comparing character strings of variant words according to a sensitive word bank, adding space division to the left and right of pinyin of original sensitive words in the variant words, and performing regularized expression processing on the divided variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words. The method can improve the accuracy of extracting the original sensitive words.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for identifying variant words in a text and extracting original sensitive words.
Background
With the development of the internet field technology, live video barrage, community forum messages and APP privates appear, text information such as user comments and privates needs to be detected, and once illegal contents submitted by users are found, automatic auditing and real-time filtering are performed, so that good user experience of products is guaranteed.
However, in the existing known sensitive word detection technology patents, matching verification is performed based on original words in a sensitive word bank, and there is no technology for detecting pinyin, homophone and complex Chinese-English mixed input. However, users of different application software on the internet have been prompted to generate various variant vocabularies, such as that the sensitive word is 'micro signal code', and possibly input variant vocabularies which are not 'micro signal code' or 'heart only and ma' (homophone), and if the user inputs variant vocabularies of the class of 'wei heart x ha horse', the variant vocabularies cannot be recognized and extracted by using the traditional recognition scheme, because the sensitive word is separated or the Chinese and pinyin are mixed, and the accuracy rate of extracting the original sensitive word is low.
Disclosure of Invention
In view of the foregoing, there is a need to provide a method, an apparatus, a computer device, and a storage medium for identifying variant words in a text and extracting original sensitive words, which can improve the accuracy of extracting original sensitive words.
A method for identifying variant words in text and extracting original sensitive words, the method comprising:
acquiring a text to be identified;
performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese characters in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the positions of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word;
traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
In one embodiment, the extracting of the original sensitive word from the array according to the position of the original sensitive word includes:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
In one embodiment, the method for performing original word search and matching on a text to be recognized by using a pre-constructed sensitive word bank, and performing homophone word and variant word verification on the text to be recognized according to a matching result to obtain a verification result comprises the following steps:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word library, if the sensitive words exist, successfully outputting the matching, and if the matching fails, performing homophone and variant word verification on the text to be recognized to obtain a verification result.
In one embodiment, the method for verifying the homonym and the variant word of the text to be recognized to obtain a verification result includes:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words from the text to be recognized after each Chinese character is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; variant words represent chinese-english mixed phrases that contain sensitive words.
In one embodiment, the original sensitive word extraction is performed on each text to be recognized after Chinese conversion pinyin through middle-drawn line segmentation, and the method comprises the following steps:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after the pinyin is converted in Chinese, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines in front of the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
In one embodiment, if the text to be recognized includes the special symbol, the text to be recognized is divided into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and the recognition of the sensitive word and the extraction of the original sensitive word are respectively performed on the first text to be recognized and the second text to be recognized.
An apparatus for identifying variant words in text and extracting original sensitive words, the apparatus comprising:
the sensitive word verification module is used for acquiring a text to be recognized; performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
the sensitive word traversing and comparing module is used for traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin if the text to be recognized contains variant words, and connecting the pinyin corresponding to the Chinese character in the variant word with the pinyin corresponding to the original sensitive word through middle-drawn lines to obtain the position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
the original sensitive word extraction module is used for traversing the variant words and comparing character strings according to the sensitive word bank, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a text to be identified;
performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese characters in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the positions of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word;
traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a text to be identified;
performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese characters in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the positions of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word;
traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
The method comprises the steps of firstly, utilizing a pre-constructed sensitive word library to search and match original words of a text to be recognized, carrying out homophone and variant word verification on the text to be recognized according to a matching result, recognizing Chinese and English mixed variant words containing the sensitive words, then converting each Chinese in the sensitive word library and the variant words into pinyin, traversing and comparing character strings, and connecting the pinyin corresponding to the Chinese in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the position of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word; traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; the original sensitive words are extracted from the array according to the positions of the original sensitive words, the defect that the traditional mode can only detect the original words and homophones of the sensitive words is overcome, the sensitive words mixed by Chinese and pinyin letters can be matched, and the original sensitive words in the text to be recognized can be extracted according to the matching under the scene.
Drawings
FIG. 1 is a flowchart illustrating a method for identifying variant words in a text and extracting original sensitive words according to an embodiment;
FIG. 2 is a schematic diagram of the overall flow of scheme matching of the present application in one embodiment;
FIG. 3 is a diagram illustrating variant word recognition and original sensitive word extraction performed in the present application, according to an embodiment;
FIG. 4 is a block diagram of an apparatus for identifying variant words in text and extracting original sensitive words according to an embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a method for identifying a variant word in a text and extracting an original sensitive word is provided, which includes the following steps:
102, acquiring a text to be identified; performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words.
As shown in fig. 2, the pre-constructed sensitive word bank includes all sensitive word chinese to be recognized, the sensitive word bank is firstly used for performing original word matching on a text to be recognized, if the matching is successful, the original sensitive word is directly output, if the matching is failed, whether homophone chinese exists or not is verified, the text to be recognized and the pre-constructed sensitive word bank are both converted into pinyin through ASCII codes for matching, if the matching is successful, the homophone chinese exists in the text to be recognized, if the matching is failed, variant word recognition is performed, all the chinese in the text to be recognized, the original sensitive word to be matched and the sensitive word bank are converted into pinyin through ASCII codes, and each chinese conversion pinyin does not need a separator, character string search is performed on the converted pinyin in the sensitive word bank, and if the existence of a variant word which indicates that chinese and english mixing exists is present.
As shown in fig. 3, the text to be recognized is laozidasi, the sensitive word is killed (the sensitive word recognized by the present application is a violent word, so this example is selected), if the original word matching and the chinese matching of the homophone fail, and the variant word recognition succeeds, if the converted pinyin is divided as the homophone extracting the original sensitive word, the variant word will become xiao-xin-l-a-o-z-i-d-a-s-i-ni, and the sensitive word extraction will fail, the present application obtains the position xiao-xin-laozi-da-si-ni of the original sensitive word by connecting the pinyin corresponding to the chinese in the variant word and the pinyin corresponding to the original sensitive word by middle-stroke lines, and then the variant word is traversed and compared with the character string, adding spaces to the left and right of pinyin of an original sensitive word in a variant word for segmentation to obtain 'careful laozi da si you', and then performing regularization expression processing to obtain an array [ 'small', 'heart', 'laozi', 'da', 'si', 'ni' ], wherein the position of the original sensitive word can be known to have an initial position of 16, a middle-drawn line is 3, an end position is 24, three middle-drawn lines are formed between the positions 16 and 24, the position of the original sensitive word at 4-6 of the array can be known, and the original sensitive word 'dai you' is obtained by extraction.
In the method for identifying the variant words in the text and extracting the original sensitive words, firstly, a pre-constructed sensitive word bank is used for carrying out original word searching and matching on the text to be identified, homophone and variant word verification is carried out on the text to be identified according to a matching result, Chinese and English mixed variant words containing the sensitive words are identified, then traversal and character string comparison are carried out after each Chinese in the sensitive word bank and the variant words is converted into pinyin, and the pinyin corresponding to the Chinese in the variant words and the pinyin corresponding to the original sensitive words are connected through middle-drawn lines to obtain the position of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word; traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; the original sensitive words are extracted from the array according to the positions of the original sensitive words, the defect that the traditional mode can only detect the original words and homophones of the sensitive words is overcome, the sensitive words mixed by Chinese and pinyin letters can be matched, and the original sensitive words in the text to be recognized can be extracted according to the matching under the scene.
In one embodiment, the extracting of the original sensitive word from the array according to the position of the original sensitive word includes:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
In one embodiment, the method for performing original word search and matching on a text to be recognized by using a pre-constructed sensitive word bank, and performing homophone word and variant word verification on the text to be recognized according to a matching result to obtain a verification result comprises the following steps:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word library, if the sensitive words exist, successfully outputting the matching, and if the matching fails, performing homophone and variant word verification on the text to be recognized to obtain a verification result.
In one embodiment, the method for verifying the homonym and the variant word of the text to be recognized to obtain a verification result includes:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words from the text to be recognized after each Chinese character is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; variant words represent chinese-english mixed phrases that contain sensitive words.
In one embodiment, the original sensitive word extraction is performed on each text to be recognized after Chinese conversion pinyin through middle-drawn line segmentation, and the method comprises the following steps:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after the pinyin is converted in Chinese, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines in front of the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
In a specific embodiment, the input examples are Pinyin of the Jia My satellite bar: jia-wo-wei-xin-ba
Example of an original sensitive word: wei-xin as Wei-xin phonetic alphabet
The extraction method comprises the following steps:
the position of the pinyin of the original sensitive word in the input pinyin is 8, the number of the characters of the original sensitive word is 2, two middle-drawn lines appear before the position 8, and then the sensitive word is the 3 rd to 4 th Chinese in the input: and (5) WeChat.
In one embodiment, if the text to be recognized includes the special symbol, the text to be recognized is divided into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and the recognition of the sensitive word and the extraction of the original sensitive word are respectively performed on the first text to be recognized and the second text to be recognized.
In one embodiment, as shown in fig. 3, the text to be recognized is laozidasi, and the sensitive word is dead, and the processes of sensitive word recognition and original sensitive word extraction using the present application are as follows:
step 1: it is verified whether a sensitive word is present in the input.
Step 1.1: verifying whether the original word exists;
carrying out character string searching and matching on the user input and the source sensitive words, if the character string searching and matching exists, carrying out an extraction step 2.1, and if the character string searching and matching does not exist, continuing the step 1.2;
example results: the matching fails.
Step 1.2: and verifying whether homophone Chinese exists.
Step 1.2.1: converting all Chinese input by a user and source sensitive words needing to be matched into pinyin through ASCII codes, and dividing each Chinese converted pinyin by a middle-drawn line (-);
step 1.2.2: carrying out character string searching and matching on the converted pinyin, if the converted pinyin shows that homophone Chinese exists, carrying out an extraction step 2.2, and if the converted pinyin does not show that homophone Chinese exists, continuing the step 1.3;
example input pinyin: xiao-xin-laozidasi-ni;
example source sensitive word pinyin: da-si-ni;
example results: the matching fails.
Step 1.3: and verifying whether the variant vocabulary of Chinese-English mixture exists.
Step 1.3.1: the Chinese input by the user and the source sensitive words needing to be matched are all converted into pinyin through ASCII codes, and each Chinese conversion pinyin does not need separators;
step 1.3.2, character string searching is carried out on the converted pinyin, if the converted pinyin has variant words which indicate that Chinese and English are mixed, the step 2.3 is carried out, if the converted pinyin does not have variant words which indicate that no sensitive words are input, the step 3 of returning data is carried out;
example input pinyin: xiaoxinlaozidassini;
example source sensitive word pinyin: dasini;
example results: matching is successful;
step 2: source sensitive words in the input are extracted.
Step 2.1: if the verification is passed through the step 1.1, the step is carried out, the fact that the sensitive words exist in the input is shown, extra extraction is not needed, the sensitive words are source sensitive words, and a data processing step 3 is carried out;
step 2.2, if the verification of the step 1.2 is passed, the step is carried out, which indicates that sensitive word homophones exist in the input, and a data processing step 3 is carried out;
step 2.3, if the verification passes the step 1.2, the step is carried out, and the situation that variant words of Chinese-English mixture exist in the input is described;
and 2.3.1, processing the pinyin input by the user, traversing each pinyin of the original sensitive word, comparing and processing character strings of the pinyin input by the user, and connecting the pinyin of the original sensitive word appearing in the input by a middle-drawn line. (because step 1.3 succeeds, each pinyin of the original sensitive word must exist in the input pinyin, and the function of this step is that the computer can not identify whether the continuous letters are pinyin corresponding to Chinese, and can not know the specific segmentation position, so that the segmentation process needs to be performed throughout)
Inputting pinyin: xiao-xin-laozi-da-si-ni.
Step 2.3.2, processing the Chinese input by the user, traversing each pinyin of the original sensitive word pinyin, comparing and processing character strings with the input by the user, and adding space segmentation to the left and right of the pinyin of the original sensitive word appearing in the input;
inputting Chinese: care was taken by laozi da si you.
Step 2.3.3, the processed input Chinese in the step 2.3.2 is processed through a regular expression, the Chinese and continuous letters are divided and converted into arrays, and the following steps are convenient to extract
Inputting a Chinese array: [ "small", "heart", "laozi", "da", "si", "you" ].
Step 2.3.4: formally extracting the sensitive words, firstly calculating the positions of the original sensitive words appearing in the input pinyin, wherein the first appearing position is 16 in the example, and the number of times of appearing middle-drawn lines before the first appearing is 3. Therefore, the sensitive word should be the fourth position in the array, and the length of the sensitive word is three characters as the length of the original sensitive word, so the hit sensitive word should be [ "da", "si", "you" ] in the Chinese array, and converted into a character string through the array: daii you, then proceed to data processing step 3.
And step 3: and (3) data processing, namely acquiring the sensitive words, the sensitive word pinyin, the original sensitive words and the original sensitive word pinyin which appear in the input in the step 2, and performing data processing according to self requirements by technical personnel.
Compared with the traditional method, the method can greatly reduce the number of the sensitive word banks, does not need to configure source sensitivity of various different sensitive words and variant sensitive words, simultaneously, the finally returned data of the technology contains the sensitive words, sensitive word pinyin, original sensitive words, original sensitive word pinyin and other data needed by a technical user, and the user can perform various processing such as filtering, color marking, replacement and the like according to the requirement.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 4, there is provided an apparatus for identifying a variant word in a text and extracting an original sensitive word, including: a sensitive word verification module 402, a sensitive word traversal and comparison module 404, and an original sensitive word extraction module 406, wherein:
a sensitive word verification module 402, configured to obtain a text to be recognized; performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
a sensitive word traversal and comparison module 404, configured to, if the text to be recognized contains a variant word, perform traversal and character string comparison after converting each chinese character in the sensitive word library and the variant word into a pinyin, and connect the pinyin corresponding to the chinese character in the variant word with the pinyin corresponding to the original sensitive word by middle-drawn lines to obtain a position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
the original sensitive word extraction module 406 is used for traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words.
In one embodiment, the original sensitive word extracting module 406 is further configured to extract the original sensitive word from the array according to the position of the original sensitive word, and includes:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
In one embodiment, the sensitive word verification module 402 is further configured to perform original word search and matching on the text to be recognized by using a pre-constructed sensitive word library, and perform homonym and variant word verification on the text to be recognized according to a matching result to obtain a verification result, where the verification result includes:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word library, if the sensitive words exist, successfully outputting the matching, and if the matching fails, performing homophone and variant word verification on the text to be recognized to obtain a verification result.
In one embodiment, the sensitive word verification module 402 is further configured to perform homonym and variant word verification on the text to be recognized, and obtain a verification result, where the verification result includes:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words from the text to be recognized after each Chinese is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; variant words represent chinese-english mixed phrases that contain sensitive words.
In one embodiment, the sensitive word verification module 402 is further configured to extract an original sensitive word from each text to be recognized after the chinese conversion pinyin is divided by a middle-dashed line, where the method includes:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after Chinese pinyin conversion, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines before the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
In one embodiment, if the text to be recognized includes the special symbol, the text to be recognized is divided into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and the recognition of the sensitive word and the extraction of the original sensitive word are respectively performed on the first text to be recognized and the second text to be recognized.
For the specific limitation of the apparatus for identifying the variant words in the text and extracting the original sensitive words, reference may be made to the above limitation on the method for identifying the variant words in the text and extracting the original sensitive words, and details are not described herein again. All modules in the device for identifying the variant words in the text and extracting the original sensitive words can be completely or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a method for identifying variant words in the text and extracting original sensitive words. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.
In an embodiment, a computer storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (8)
1. A method for identifying variant words in a text and extracting original sensitive words is characterized by comprising the following steps:
acquiring a text to be identified;
performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises that the text to be recognized contains variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese character in the variant words with the pinyin corresponding to the original sensitive word through middle-drawn lines to obtain the position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
traversing and comparing character strings of the variant words according to the sensitive word bank, and adding spaces to the left and right of pinyin of the original sensitive words in the variant words for segmentation to obtain segmented variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
2. The method of claim 1, wherein extracting the original sensitive word from the array according to the position of the original sensitive word comprises:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle drawn line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
3. The method of claim 1, wherein performing original word search matching on the text to be recognized by using a pre-constructed sensitive word bank, and performing homonym and variant word verification on the text to be recognized according to a matching result to obtain a verification result, comprises:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word bank, outputting the matching successfully if the sensitive words exist, and performing homophone and variant word verification on the text to be recognized to obtain a verification result if the matching fails.
4. The method of claim 3, wherein the verification of homonyms and variant words of the text to be recognized to obtain a verification result comprises:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words of the text to be recognized after each Chinese is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; the variant words represent Chinese-English mixed phrases containing sensitive words.
5. The method as claimed in claim 4, wherein the extracting of the original sensitive word from the text to be recognized after each Chinese conversion pinyin through the middle-drawn line segmentation comprises:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after the pinyin is converted in Chinese, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines in front of the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
6. The method of claim 5, further comprising:
and if the text to be recognized comprises the special symbol, dividing the text to be recognized into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and respectively recognizing the sensitive words and extracting the original sensitive words.
7. An apparatus for identifying variant words in a text and extracting original sensitive words, the apparatus comprising:
the sensitive word verification module is used for acquiring a text to be recognized; performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises that the text to be recognized contains variant words and the lengths of the variant words;
the sensitive word traversing and comparing module is used for traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin if the text to be recognized contains variant words, and connecting the pinyin corresponding to the Chinese character in the variant words with the pinyin corresponding to the original sensitive word through middle-drawn lines to obtain the position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
the original sensitive word extraction module is used for traversing the variant words according to the sensitive word bank and comparing character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words.
8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210860492.6A CN115081440B (en) | 2022-07-22 | 2022-07-22 | Method, device and equipment for recognizing variant words in text and extracting original sensitive words |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210860492.6A CN115081440B (en) | 2022-07-22 | 2022-07-22 | Method, device and equipment for recognizing variant words in text and extracting original sensitive words |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115081440A true CN115081440A (en) | 2022-09-20 |
CN115081440B CN115081440B (en) | 2022-11-01 |
Family
ID=83243778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210860492.6A Active CN115081440B (en) | 2022-07-22 | 2022-07-22 | Method, device and equipment for recognizing variant words in text and extracting original sensitive words |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115081440B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116996216A (en) * | 2023-09-25 | 2023-11-03 | 湖南马栏山视频先进技术研究院有限公司 | Data security processing method and system applied to artificial intelligent content generation |
CN117592473A (en) * | 2024-01-18 | 2024-02-23 | 武汉杏仁桉科技有限公司 | Harmonic splitting processing method and device for multiple Chinese phrases |
CN117725161A (en) * | 2023-12-21 | 2024-03-19 | 伟金投资有限公司 | Method and system for identifying variant words in text and extracting sensitive words |
CN117892724A (en) * | 2024-03-15 | 2024-04-16 | 成都赛力斯科技有限公司 | Text detection method, device, equipment and storage medium |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101047606A (en) * | 2006-03-28 | 2007-10-03 | 腾讯科技(深圳)有限公司 | Method for data transmission |
US20090132651A1 (en) * | 2007-11-15 | 2009-05-21 | Target Brands, Inc. | Sensitive Information Handling On a Collaboration System |
JP2011158947A (en) * | 2010-01-29 | 2011-08-18 | Casio Computer Co Ltd | Electronic apparatus and information display program |
US20160179774A1 (en) * | 2014-12-18 | 2016-06-23 | International Business Machines Corporation | Orthographic Error Correction Using Phonetic Transcription |
CN107463666A (en) * | 2017-08-02 | 2017-12-12 | 成都德尔塔信息科技有限公司 | A kind of filtering sensitive words method based on content of text |
US20190164539A1 (en) * | 2017-11-28 | 2019-05-30 | International Business Machines Corporation | Automatic blocking of sensitive data contained in an audio stream |
CN111259151A (en) * | 2020-01-20 | 2020-06-09 | 广州多益网络股份有限公司 | Method and device for recognizing mixed text sensitive word variants |
CN112464667A (en) * | 2020-11-18 | 2021-03-09 | 北京华彬立成科技有限公司 | Text entity identification method and device, electronic equipment and storage medium |
WO2021139268A1 (en) * | 2020-07-16 | 2021-07-15 | 平安科技(深圳)有限公司 | Sensitive word detection method and apparatus, computer device, and storage medium |
CN113822059A (en) * | 2021-09-18 | 2021-12-21 | 北京云上曲率科技有限公司 | Chinese sensitive text recognition method and device, storage medium and equipment |
CN114118065A (en) * | 2021-10-28 | 2022-03-01 | 国网江苏省电力有限公司电力科学研究院 | Chinese text error correction method and device in electric power field, storage medium and computing equipment |
WO2022063133A1 (en) * | 2020-09-27 | 2022-03-31 | 深圳前海微众银行股份有限公司 | Sensitive information detection method and apparatus, and device and computer-readable storage medium |
-
2022
- 2022-07-22 CN CN202210860492.6A patent/CN115081440B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101047606A (en) * | 2006-03-28 | 2007-10-03 | 腾讯科技(深圳)有限公司 | Method for data transmission |
US20090132651A1 (en) * | 2007-11-15 | 2009-05-21 | Target Brands, Inc. | Sensitive Information Handling On a Collaboration System |
JP2011158947A (en) * | 2010-01-29 | 2011-08-18 | Casio Computer Co Ltd | Electronic apparatus and information display program |
US20160179774A1 (en) * | 2014-12-18 | 2016-06-23 | International Business Machines Corporation | Orthographic Error Correction Using Phonetic Transcription |
CN107463666A (en) * | 2017-08-02 | 2017-12-12 | 成都德尔塔信息科技有限公司 | A kind of filtering sensitive words method based on content of text |
US20190164539A1 (en) * | 2017-11-28 | 2019-05-30 | International Business Machines Corporation | Automatic blocking of sensitive data contained in an audio stream |
CN111259151A (en) * | 2020-01-20 | 2020-06-09 | 广州多益网络股份有限公司 | Method and device for recognizing mixed text sensitive word variants |
WO2021139268A1 (en) * | 2020-07-16 | 2021-07-15 | 平安科技(深圳)有限公司 | Sensitive word detection method and apparatus, computer device, and storage medium |
WO2022063133A1 (en) * | 2020-09-27 | 2022-03-31 | 深圳前海微众银行股份有限公司 | Sensitive information detection method and apparatus, and device and computer-readable storage medium |
CN112464667A (en) * | 2020-11-18 | 2021-03-09 | 北京华彬立成科技有限公司 | Text entity identification method and device, electronic equipment and storage medium |
CN113822059A (en) * | 2021-09-18 | 2021-12-21 | 北京云上曲率科技有限公司 | Chinese sensitive text recognition method and device, storage medium and equipment |
CN114118065A (en) * | 2021-10-28 | 2022-03-01 | 国网江苏省电力有限公司电力科学研究院 | Chinese text error correction method and device in electric power field, storage medium and computing equipment |
Non-Patent Citations (1)
Title |
---|
涂晴宇: "面向人机交互的语音情感识别与文本敏感词检测", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116996216A (en) * | 2023-09-25 | 2023-11-03 | 湖南马栏山视频先进技术研究院有限公司 | Data security processing method and system applied to artificial intelligent content generation |
CN116996216B (en) * | 2023-09-25 | 2023-12-01 | 湖南马栏山视频先进技术研究院有限公司 | Data security processing method and system applied to artificial intelligent content generation |
CN117725161A (en) * | 2023-12-21 | 2024-03-19 | 伟金投资有限公司 | Method and system for identifying variant words in text and extracting sensitive words |
CN117592473A (en) * | 2024-01-18 | 2024-02-23 | 武汉杏仁桉科技有限公司 | Harmonic splitting processing method and device for multiple Chinese phrases |
CN117592473B (en) * | 2024-01-18 | 2024-04-09 | 武汉杏仁桉科技有限公司 | Harmonic splitting processing method and device for multiple Chinese phrases |
CN117892724A (en) * | 2024-03-15 | 2024-04-16 | 成都赛力斯科技有限公司 | Text detection method, device, equipment and storage medium |
CN117892724B (en) * | 2024-03-15 | 2024-06-04 | 成都赛力斯科技有限公司 | Text detection method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115081440B (en) | 2022-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115081440B (en) | Method, device and equipment for recognizing variant words in text and extracting original sensitive words | |
CN109753653B (en) | Entity name recognition method, entity name recognition device, computer equipment and storage medium | |
CN108595695B (en) | Data processing method, data processing device, computer equipment and storage medium | |
CN108427707B (en) | Man-machine question and answer method, device, computer equipment and storage medium | |
CN111444723B (en) | Information extraction method, computer device, and storage medium | |
CN111176996A (en) | Test case generation method and device, computer equipment and storage medium | |
CN111680634B (en) | Document file processing method, device, computer equipment and storage medium | |
KR20190085098A (en) | Keyword extraction method, computer device, and storage medium | |
CN110334179B (en) | Question-answer processing method, device, computer equipment and storage medium | |
CN108664595B (en) | Domain knowledge base construction method and device, computer equipment and storage medium | |
CN109766072B (en) | Information verification input method and device, computer equipment and storage medium | |
CN111352907A (en) | Method and device for analyzing pipeline file, computer equipment and storage medium | |
CN111444349B (en) | Information extraction method, information extraction device, computer equipment and storage medium | |
CN110427612B (en) | Entity disambiguation method, device, equipment and storage medium based on multiple languages | |
KR101509727B1 (en) | Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof | |
CN113961768B (en) | Sensitive word detection method and device, computer equipment and storage medium | |
CN110750984A (en) | Command line character string processing method, terminal, device and readable storage medium | |
CN111178064A (en) | Information pushing method and device based on field word segmentation processing and computer equipment | |
CN112307172A (en) | Semantic parsing equipment, method, terminal and storage medium | |
CN112749639B (en) | Model training method and device, computer equipment and storage medium | |
Yasin et al. | Transformer-Based Neural Machine Translation for Post-OCR Error Correction in Cursive Text | |
CN109344385B (en) | Natural language processing method, device, computer equipment and storage medium | |
CN111291535A (en) | Script processing method and device, electronic equipment and computer readable storage medium | |
CN116225956A (en) | Automated testing method, apparatus, computer device and storage medium | |
CN113065360B (en) | Word semantic model construction method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |