CN115081440A - Method, device and equipment for recognizing variant words in text and extracting original sensitive words - Google Patents

Method, device and equipment for recognizing variant words in text and extracting original sensitive words Download PDF

Info

Publication number
CN115081440A
CN115081440A CN202210860492.6A CN202210860492A CN115081440A CN 115081440 A CN115081440 A CN 115081440A CN 202210860492 A CN202210860492 A CN 202210860492A CN 115081440 A CN115081440 A CN 115081440A
Authority
CN
China
Prior art keywords
words
text
variant
word
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210860492.6A
Other languages
Chinese (zh)
Other versions
CN115081440B (en
Inventor
钟正阳
李一文
李顺
周渝雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Xiangsheng Network Information Co ltd
Original Assignee
Hunan Xiangsheng Network Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Xiangsheng Network Information Co ltd filed Critical Hunan Xiangsheng Network Information Co ltd
Priority to CN202210860492.6A priority Critical patent/CN115081440B/en
Publication of CN115081440A publication Critical patent/CN115081440A/en
Application granted granted Critical
Publication of CN115081440B publication Critical patent/CN115081440B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)

Abstract

The application relates to a method, a device and equipment for recognizing variant words in texts and extracting original sensitive words. The method comprises the following steps: performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, performing homophone and variant word verification on the text to be recognized according to a matching result, if the text to be recognized contains variant words, converting each Chinese in the sensitive word bank and the variant words into pinyin, traversing and comparing character strings, and connecting the pinyin corresponding to the Chinese in the variant words with the pinyin corresponding to the original sensitive word by middle-stroke lines to obtain the position of the original sensitive word; traversing and comparing character strings of variant words according to a sensitive word bank, adding space division to the left and right of pinyin of original sensitive words in the variant words, and performing regularized expression processing on the divided variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words. The method can improve the accuracy of extracting the original sensitive words.

Description

Method, device and equipment for recognizing variant words in text and extracting original sensitive words
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for identifying variant words in a text and extracting original sensitive words.
Background
With the development of the internet field technology, live video barrage, community forum messages and APP privates appear, text information such as user comments and privates needs to be detected, and once illegal contents submitted by users are found, automatic auditing and real-time filtering are performed, so that good user experience of products is guaranteed.
However, in the existing known sensitive word detection technology patents, matching verification is performed based on original words in a sensitive word bank, and there is no technology for detecting pinyin, homophone and complex Chinese-English mixed input. However, users of different application software on the internet have been prompted to generate various variant vocabularies, such as that the sensitive word is 'micro signal code', and possibly input variant vocabularies which are not 'micro signal code' or 'heart only and ma' (homophone), and if the user inputs variant vocabularies of the class of 'wei heart x ha horse', the variant vocabularies cannot be recognized and extracted by using the traditional recognition scheme, because the sensitive word is separated or the Chinese and pinyin are mixed, and the accuracy rate of extracting the original sensitive word is low.
Disclosure of Invention
In view of the foregoing, there is a need to provide a method, an apparatus, a computer device, and a storage medium for identifying variant words in a text and extracting original sensitive words, which can improve the accuracy of extracting original sensitive words.
A method for identifying variant words in text and extracting original sensitive words, the method comprising:
acquiring a text to be identified;
performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese characters in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the positions of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word;
traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
In one embodiment, the extracting of the original sensitive word from the array according to the position of the original sensitive word includes:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
In one embodiment, the method for performing original word search and matching on a text to be recognized by using a pre-constructed sensitive word bank, and performing homophone word and variant word verification on the text to be recognized according to a matching result to obtain a verification result comprises the following steps:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word library, if the sensitive words exist, successfully outputting the matching, and if the matching fails, performing homophone and variant word verification on the text to be recognized to obtain a verification result.
In one embodiment, the method for verifying the homonym and the variant word of the text to be recognized to obtain a verification result includes:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words from the text to be recognized after each Chinese character is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; variant words represent chinese-english mixed phrases that contain sensitive words.
In one embodiment, the original sensitive word extraction is performed on each text to be recognized after Chinese conversion pinyin through middle-drawn line segmentation, and the method comprises the following steps:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after the pinyin is converted in Chinese, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines in front of the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
In one embodiment, if the text to be recognized includes the special symbol, the text to be recognized is divided into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and the recognition of the sensitive word and the extraction of the original sensitive word are respectively performed on the first text to be recognized and the second text to be recognized.
An apparatus for identifying variant words in text and extracting original sensitive words, the apparatus comprising:
the sensitive word verification module is used for acquiring a text to be recognized; performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
the sensitive word traversing and comparing module is used for traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin if the text to be recognized contains variant words, and connecting the pinyin corresponding to the Chinese character in the variant word with the pinyin corresponding to the original sensitive word through middle-drawn lines to obtain the position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
the original sensitive word extraction module is used for traversing the variant words and comparing character strings according to the sensitive word bank, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a text to be identified;
performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese characters in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the positions of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word;
traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a text to be identified;
performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese characters in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the positions of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word;
traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
The method comprises the steps of firstly, utilizing a pre-constructed sensitive word library to search and match original words of a text to be recognized, carrying out homophone and variant word verification on the text to be recognized according to a matching result, recognizing Chinese and English mixed variant words containing the sensitive words, then converting each Chinese in the sensitive word library and the variant words into pinyin, traversing and comparing character strings, and connecting the pinyin corresponding to the Chinese in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the position of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word; traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; the original sensitive words are extracted from the array according to the positions of the original sensitive words, the defect that the traditional mode can only detect the original words and homophones of the sensitive words is overcome, the sensitive words mixed by Chinese and pinyin letters can be matched, and the original sensitive words in the text to be recognized can be extracted according to the matching under the scene.
Drawings
FIG. 1 is a flowchart illustrating a method for identifying variant words in a text and extracting original sensitive words according to an embodiment;
FIG. 2 is a schematic diagram of the overall flow of scheme matching of the present application in one embodiment;
FIG. 3 is a diagram illustrating variant word recognition and original sensitive word extraction performed in the present application, according to an embodiment;
FIG. 4 is a block diagram of an apparatus for identifying variant words in text and extracting original sensitive words according to an embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a method for identifying a variant word in a text and extracting an original sensitive word is provided, which includes the following steps:
102, acquiring a text to be identified; performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words.
As shown in fig. 2, the pre-constructed sensitive word bank includes all sensitive word chinese to be recognized, the sensitive word bank is firstly used for performing original word matching on a text to be recognized, if the matching is successful, the original sensitive word is directly output, if the matching is failed, whether homophone chinese exists or not is verified, the text to be recognized and the pre-constructed sensitive word bank are both converted into pinyin through ASCII codes for matching, if the matching is successful, the homophone chinese exists in the text to be recognized, if the matching is failed, variant word recognition is performed, all the chinese in the text to be recognized, the original sensitive word to be matched and the sensitive word bank are converted into pinyin through ASCII codes, and each chinese conversion pinyin does not need a separator, character string search is performed on the converted pinyin in the sensitive word bank, and if the existence of a variant word which indicates that chinese and english mixing exists is present.
Step 104, if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese characters in the variant words with the pinyin corresponding to the original sensitive words through middle-drawn lines to obtain the positions of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word.
Step 106, traversing the variant words according to the sensitive word bank and comparing character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words.
As shown in fig. 3, the text to be recognized is laozidasi, the sensitive word is killed (the sensitive word recognized by the present application is a violent word, so this example is selected), if the original word matching and the chinese matching of the homophone fail, and the variant word recognition succeeds, if the converted pinyin is divided as the homophone extracting the original sensitive word, the variant word will become xiao-xin-l-a-o-z-i-d-a-s-i-ni, and the sensitive word extraction will fail, the present application obtains the position xiao-xin-laozi-da-si-ni of the original sensitive word by connecting the pinyin corresponding to the chinese in the variant word and the pinyin corresponding to the original sensitive word by middle-stroke lines, and then the variant word is traversed and compared with the character string, adding spaces to the left and right of pinyin of an original sensitive word in a variant word for segmentation to obtain 'careful laozi da si you', and then performing regularization expression processing to obtain an array [ 'small', 'heart', 'laozi', 'da', 'si', 'ni' ], wherein the position of the original sensitive word can be known to have an initial position of 16, a middle-drawn line is 3, an end position is 24, three middle-drawn lines are formed between the positions 16 and 24, the position of the original sensitive word at 4-6 of the array can be known, and the original sensitive word 'dai you' is obtained by extraction.
In the method for identifying the variant words in the text and extracting the original sensitive words, firstly, a pre-constructed sensitive word bank is used for carrying out original word searching and matching on the text to be identified, homophone and variant word verification is carried out on the text to be identified according to a matching result, Chinese and English mixed variant words containing the sensitive words are identified, then traversal and character string comparison are carried out after each Chinese in the sensitive word bank and the variant words is converted into pinyin, and the pinyin corresponding to the Chinese in the variant words and the pinyin corresponding to the original sensitive words are connected through middle-drawn lines to obtain the position of the original sensitive words; the original sensitive word is a sensitive word contained in the variant word; traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; the original sensitive words are extracted from the array according to the positions of the original sensitive words, the defect that the traditional mode can only detect the original words and homophones of the sensitive words is overcome, the sensitive words mixed by Chinese and pinyin letters can be matched, and the original sensitive words in the text to be recognized can be extracted according to the matching under the scene.
In one embodiment, the extracting of the original sensitive word from the array according to the position of the original sensitive word includes:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
In one embodiment, the method for performing original word search and matching on a text to be recognized by using a pre-constructed sensitive word bank, and performing homophone word and variant word verification on the text to be recognized according to a matching result to obtain a verification result comprises the following steps:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word library, if the sensitive words exist, successfully outputting the matching, and if the matching fails, performing homophone and variant word verification on the text to be recognized to obtain a verification result.
In one embodiment, the method for verifying the homonym and the variant word of the text to be recognized to obtain a verification result includes:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words from the text to be recognized after each Chinese character is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; variant words represent chinese-english mixed phrases that contain sensitive words.
In one embodiment, the original sensitive word extraction is performed on each text to be recognized after Chinese conversion pinyin through middle-drawn line segmentation, and the method comprises the following steps:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after the pinyin is converted in Chinese, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines in front of the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
In a specific embodiment, the input examples are Pinyin of the Jia My satellite bar: jia-wo-wei-xin-ba
Example of an original sensitive word: wei-xin as Wei-xin phonetic alphabet
The extraction method comprises the following steps:
the position of the pinyin of the original sensitive word in the input pinyin is 8, the number of the characters of the original sensitive word is 2, two middle-drawn lines appear before the position 8, and then the sensitive word is the 3 rd to 4 th Chinese in the input: and (5) WeChat.
In one embodiment, if the text to be recognized includes the special symbol, the text to be recognized is divided into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and the recognition of the sensitive word and the extraction of the original sensitive word are respectively performed on the first text to be recognized and the second text to be recognized.
In one embodiment, as shown in fig. 3, the text to be recognized is laozidasi, and the sensitive word is dead, and the processes of sensitive word recognition and original sensitive word extraction using the present application are as follows:
step 1: it is verified whether a sensitive word is present in the input.
Step 1.1: verifying whether the original word exists;
carrying out character string searching and matching on the user input and the source sensitive words, if the character string searching and matching exists, carrying out an extraction step 2.1, and if the character string searching and matching does not exist, continuing the step 1.2;
example results: the matching fails.
Step 1.2: and verifying whether homophone Chinese exists.
Step 1.2.1: converting all Chinese input by a user and source sensitive words needing to be matched into pinyin through ASCII codes, and dividing each Chinese converted pinyin by a middle-drawn line (-);
step 1.2.2: carrying out character string searching and matching on the converted pinyin, if the converted pinyin shows that homophone Chinese exists, carrying out an extraction step 2.2, and if the converted pinyin does not show that homophone Chinese exists, continuing the step 1.3;
example input pinyin: xiao-xin-laozidasi-ni;
example source sensitive word pinyin: da-si-ni;
example results: the matching fails.
Step 1.3: and verifying whether the variant vocabulary of Chinese-English mixture exists.
Step 1.3.1: the Chinese input by the user and the source sensitive words needing to be matched are all converted into pinyin through ASCII codes, and each Chinese conversion pinyin does not need separators;
step 1.3.2, character string searching is carried out on the converted pinyin, if the converted pinyin has variant words which indicate that Chinese and English are mixed, the step 2.3 is carried out, if the converted pinyin does not have variant words which indicate that no sensitive words are input, the step 3 of returning data is carried out;
example input pinyin: xiaoxinlaozidassini;
example source sensitive word pinyin: dasini;
example results: matching is successful;
step 2: source sensitive words in the input are extracted.
Step 2.1: if the verification is passed through the step 1.1, the step is carried out, the fact that the sensitive words exist in the input is shown, extra extraction is not needed, the sensitive words are source sensitive words, and a data processing step 3 is carried out;
step 2.2, if the verification of the step 1.2 is passed, the step is carried out, which indicates that sensitive word homophones exist in the input, and a data processing step 3 is carried out;
step 2.3, if the verification passes the step 1.2, the step is carried out, and the situation that variant words of Chinese-English mixture exist in the input is described;
and 2.3.1, processing the pinyin input by the user, traversing each pinyin of the original sensitive word, comparing and processing character strings of the pinyin input by the user, and connecting the pinyin of the original sensitive word appearing in the input by a middle-drawn line. (because step 1.3 succeeds, each pinyin of the original sensitive word must exist in the input pinyin, and the function of this step is that the computer can not identify whether the continuous letters are pinyin corresponding to Chinese, and can not know the specific segmentation position, so that the segmentation process needs to be performed throughout)
Inputting pinyin: xiao-xin-laozi-da-si-ni.
Step 2.3.2, processing the Chinese input by the user, traversing each pinyin of the original sensitive word pinyin, comparing and processing character strings with the input by the user, and adding space segmentation to the left and right of the pinyin of the original sensitive word appearing in the input;
inputting Chinese: care was taken by laozi da si you.
Step 2.3.3, the processed input Chinese in the step 2.3.2 is processed through a regular expression, the Chinese and continuous letters are divided and converted into arrays, and the following steps are convenient to extract
Inputting a Chinese array: [ "small", "heart", "laozi", "da", "si", "you" ].
Step 2.3.4: formally extracting the sensitive words, firstly calculating the positions of the original sensitive words appearing in the input pinyin, wherein the first appearing position is 16 in the example, and the number of times of appearing middle-drawn lines before the first appearing is 3. Therefore, the sensitive word should be the fourth position in the array, and the length of the sensitive word is three characters as the length of the original sensitive word, so the hit sensitive word should be [ "da", "si", "you" ] in the Chinese array, and converted into a character string through the array: daii you, then proceed to data processing step 3.
And step 3: and (3) data processing, namely acquiring the sensitive words, the sensitive word pinyin, the original sensitive words and the original sensitive word pinyin which appear in the input in the step 2, and performing data processing according to self requirements by technical personnel.
Compared with the traditional method, the method can greatly reduce the number of the sensitive word banks, does not need to configure source sensitivity of various different sensitive words and variant sensitive words, simultaneously, the finally returned data of the technology contains the sensitive words, sensitive word pinyin, original sensitive words, original sensitive word pinyin and other data needed by a technical user, and the user can perform various processing such as filtering, color marking, replacement and the like according to the requirement.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 4, there is provided an apparatus for identifying a variant word in a text and extracting an original sensitive word, including: a sensitive word verification module 402, a sensitive word traversal and comparison module 404, and an original sensitive word extraction module 406, wherein:
a sensitive word verification module 402, configured to obtain a text to be recognized; performing original word searching and matching on a text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises the text to be recognized, including the variant words and the lengths of the variant words;
a sensitive word traversal and comparison module 404, configured to, if the text to be recognized contains a variant word, perform traversal and character string comparison after converting each chinese character in the sensitive word library and the variant word into a pinyin, and connect the pinyin corresponding to the chinese character in the variant word with the pinyin corresponding to the original sensitive word by middle-drawn lines to obtain a position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
the original sensitive word extraction module 406 is used for traversing the variant words according to the sensitive word bank and comparing the character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words.
In one embodiment, the original sensitive word extracting module 406 is further configured to extract the original sensitive word from the array according to the position of the original sensitive word, and includes:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
In one embodiment, the sensitive word verification module 402 is further configured to perform original word search and matching on the text to be recognized by using a pre-constructed sensitive word library, and perform homonym and variant word verification on the text to be recognized according to a matching result to obtain a verification result, where the verification result includes:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word library, if the sensitive words exist, successfully outputting the matching, and if the matching fails, performing homophone and variant word verification on the text to be recognized to obtain a verification result.
In one embodiment, the sensitive word verification module 402 is further configured to perform homonym and variant word verification on the text to be recognized, and obtain a verification result, where the verification result includes:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words from the text to be recognized after each Chinese is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; variant words represent chinese-english mixed phrases that contain sensitive words.
In one embodiment, the sensitive word verification module 402 is further configured to extract an original sensitive word from each text to be recognized after the chinese conversion pinyin is divided by a middle-dashed line, where the method includes:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after Chinese pinyin conversion, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines before the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
In one embodiment, if the text to be recognized includes the special symbol, the text to be recognized is divided into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and the recognition of the sensitive word and the extraction of the original sensitive word are respectively performed on the first text to be recognized and the second text to be recognized.
For the specific limitation of the apparatus for identifying the variant words in the text and extracting the original sensitive words, reference may be made to the above limitation on the method for identifying the variant words in the text and extracting the original sensitive words, and details are not described herein again. All modules in the device for identifying the variant words in the text and extracting the original sensitive words can be completely or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a method for identifying variant words in the text and extracting original sensitive words. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.
In an embodiment, a computer storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A method for identifying variant words in a text and extracting original sensitive words is characterized by comprising the following steps:
acquiring a text to be identified;
performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises that the text to be recognized contains variant words and the lengths of the variant words;
if the text to be recognized contains variant words, traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin, and connecting the pinyin corresponding to the Chinese character in the variant words with the pinyin corresponding to the original sensitive word through middle-drawn lines to obtain the position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
traversing and comparing character strings of the variant words according to the sensitive word bank, and adding spaces to the left and right of pinyin of the original sensitive words in the variant words for segmentation to obtain segmented variant words;
carrying out regularization expression processing on the segmented variant words to obtain an array;
and extracting the original sensitive words from the array according to the positions of the original sensitive words.
2. The method of claim 1, wherein extracting the original sensitive word from the array according to the position of the original sensitive word comprises:
and determining the position of the original sensitive word in the array according to the position of the original sensitive word in the variant word pinyin for the first time and the position of the middle drawn line, and extracting the original sensitive word from the array by using the length of the identified original sensitive word to obtain the original sensitive word.
3. The method of claim 1, wherein performing original word search matching on the text to be recognized by using a pre-constructed sensitive word bank, and performing homonym and variant word verification on the text to be recognized according to a matching result to obtain a verification result, comprises:
and performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word bank, outputting the matching successfully if the sensitive words exist, and performing homophone and variant word verification on the text to be recognized to obtain a verification result if the matching fails.
4. The method of claim 3, wherein the verification of homonyms and variant words of the text to be recognized to obtain a verification result comprises:
converting the text to be recognized and the sensitive word bank into pinyin through ASCII codes to perform sensitive word matching, and if the matching is successful, extracting the original sensitive words of the text to be recognized after each Chinese is converted into pinyin through middle-drawn line segmentation;
if the matching fails, performing character string searching on the text to be recognized after each Chinese character is converted into pinyin to obtain variant words in the text to be recognized; the variant words represent Chinese-English mixed phrases containing sensitive words.
5. The method as claimed in claim 4, wherein the extracting of the original sensitive word from the text to be recognized after each Chinese conversion pinyin through the middle-drawn line segmentation comprises:
calculating the position and the number of words of the original sensitive word pinyin in the text to be recognized after the pinyin is converted in Chinese, and judging the position of the original sensitive word in the text to be recognized according to the position of the original sensitive word pinyin and the number of middle-drawn lines in front of the position;
and extracting the original sensitive words by using the positions of the original sensitive words in the text to be recognized and the word number of the original sensitive words.
6. The method of claim 5, further comprising:
and if the text to be recognized comprises the special symbol, dividing the text to be recognized into a first text to be recognized and a second text to be recognized by taking the special symbol as a boundary, and respectively recognizing the sensitive words and extracting the original sensitive words.
7. An apparatus for identifying variant words in a text and extracting original sensitive words, the apparatus comprising:
the sensitive word verification module is used for acquiring a text to be recognized; performing original word searching and matching on the text to be recognized by utilizing a pre-constructed sensitive word bank, and performing homophone and variant word verification on the text to be recognized according to a matching result to obtain a verification result; the verification result comprises that the text to be recognized contains variant words and the lengths of the variant words;
the sensitive word traversing and comparing module is used for traversing and comparing character strings after converting each Chinese character in the sensitive word bank and the variant words into pinyin if the text to be recognized contains variant words, and connecting the pinyin corresponding to the Chinese character in the variant words with the pinyin corresponding to the original sensitive word through middle-drawn lines to obtain the position of the original sensitive word; the original sensitive word is a sensitive word contained in the variant word;
the original sensitive word extraction module is used for traversing the variant words according to the sensitive word bank and comparing character strings, and adding spaces to divide the pinyin of the original sensitive words in the variant words to obtain divided variant words; carrying out regularization expression processing on the segmented variant words to obtain an array; and extracting the original sensitive words from the array according to the positions of the original sensitive words.
8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
CN202210860492.6A 2022-07-22 2022-07-22 Method, device and equipment for recognizing variant words in text and extracting original sensitive words Active CN115081440B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210860492.6A CN115081440B (en) 2022-07-22 2022-07-22 Method, device and equipment for recognizing variant words in text and extracting original sensitive words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210860492.6A CN115081440B (en) 2022-07-22 2022-07-22 Method, device and equipment for recognizing variant words in text and extracting original sensitive words

Publications (2)

Publication Number Publication Date
CN115081440A true CN115081440A (en) 2022-09-20
CN115081440B CN115081440B (en) 2022-11-01

Family

ID=83243778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210860492.6A Active CN115081440B (en) 2022-07-22 2022-07-22 Method, device and equipment for recognizing variant words in text and extracting original sensitive words

Country Status (1)

Country Link
CN (1) CN115081440B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116996216A (en) * 2023-09-25 2023-11-03 湖南马栏山视频先进技术研究院有限公司 Data security processing method and system applied to artificial intelligent content generation
CN117592473A (en) * 2024-01-18 2024-02-23 武汉杏仁桉科技有限公司 Harmonic splitting processing method and device for multiple Chinese phrases
CN117725161A (en) * 2023-12-21 2024-03-19 伟金投资有限公司 Method and system for identifying variant words in text and extracting sensitive words
CN117892724A (en) * 2024-03-15 2024-04-16 成都赛力斯科技有限公司 Text detection method, device, equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101047606A (en) * 2006-03-28 2007-10-03 腾讯科技(深圳)有限公司 Method for data transmission
US20090132651A1 (en) * 2007-11-15 2009-05-21 Target Brands, Inc. Sensitive Information Handling On a Collaboration System
JP2011158947A (en) * 2010-01-29 2011-08-18 Casio Computer Co Ltd Electronic apparatus and information display program
US20160179774A1 (en) * 2014-12-18 2016-06-23 International Business Machines Corporation Orthographic Error Correction Using Phonetic Transcription
CN107463666A (en) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 A kind of filtering sensitive words method based on content of text
US20190164539A1 (en) * 2017-11-28 2019-05-30 International Business Machines Corporation Automatic blocking of sensitive data contained in an audio stream
CN111259151A (en) * 2020-01-20 2020-06-09 广州多益网络股份有限公司 Method and device for recognizing mixed text sensitive word variants
CN112464667A (en) * 2020-11-18 2021-03-09 北京华彬立成科技有限公司 Text entity identification method and device, electronic equipment and storage medium
WO2021139268A1 (en) * 2020-07-16 2021-07-15 平安科技(深圳)有限公司 Sensitive word detection method and apparatus, computer device, and storage medium
CN113822059A (en) * 2021-09-18 2021-12-21 北京云上曲率科技有限公司 Chinese sensitive text recognition method and device, storage medium and equipment
CN114118065A (en) * 2021-10-28 2022-03-01 国网江苏省电力有限公司电力科学研究院 Chinese text error correction method and device in electric power field, storage medium and computing equipment
WO2022063133A1 (en) * 2020-09-27 2022-03-31 深圳前海微众银行股份有限公司 Sensitive information detection method and apparatus, and device and computer-readable storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101047606A (en) * 2006-03-28 2007-10-03 腾讯科技(深圳)有限公司 Method for data transmission
US20090132651A1 (en) * 2007-11-15 2009-05-21 Target Brands, Inc. Sensitive Information Handling On a Collaboration System
JP2011158947A (en) * 2010-01-29 2011-08-18 Casio Computer Co Ltd Electronic apparatus and information display program
US20160179774A1 (en) * 2014-12-18 2016-06-23 International Business Machines Corporation Orthographic Error Correction Using Phonetic Transcription
CN107463666A (en) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 A kind of filtering sensitive words method based on content of text
US20190164539A1 (en) * 2017-11-28 2019-05-30 International Business Machines Corporation Automatic blocking of sensitive data contained in an audio stream
CN111259151A (en) * 2020-01-20 2020-06-09 广州多益网络股份有限公司 Method and device for recognizing mixed text sensitive word variants
WO2021139268A1 (en) * 2020-07-16 2021-07-15 平安科技(深圳)有限公司 Sensitive word detection method and apparatus, computer device, and storage medium
WO2022063133A1 (en) * 2020-09-27 2022-03-31 深圳前海微众银行股份有限公司 Sensitive information detection method and apparatus, and device and computer-readable storage medium
CN112464667A (en) * 2020-11-18 2021-03-09 北京华彬立成科技有限公司 Text entity identification method and device, electronic equipment and storage medium
CN113822059A (en) * 2021-09-18 2021-12-21 北京云上曲率科技有限公司 Chinese sensitive text recognition method and device, storage medium and equipment
CN114118065A (en) * 2021-10-28 2022-03-01 国网江苏省电力有限公司电力科学研究院 Chinese text error correction method and device in electric power field, storage medium and computing equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
涂晴宇: "面向人机交互的语音情感识别与文本敏感词检测", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116996216A (en) * 2023-09-25 2023-11-03 湖南马栏山视频先进技术研究院有限公司 Data security processing method and system applied to artificial intelligent content generation
CN116996216B (en) * 2023-09-25 2023-12-01 湖南马栏山视频先进技术研究院有限公司 Data security processing method and system applied to artificial intelligent content generation
CN117725161A (en) * 2023-12-21 2024-03-19 伟金投资有限公司 Method and system for identifying variant words in text and extracting sensitive words
CN117592473A (en) * 2024-01-18 2024-02-23 武汉杏仁桉科技有限公司 Harmonic splitting processing method and device for multiple Chinese phrases
CN117592473B (en) * 2024-01-18 2024-04-09 武汉杏仁桉科技有限公司 Harmonic splitting processing method and device for multiple Chinese phrases
CN117892724A (en) * 2024-03-15 2024-04-16 成都赛力斯科技有限公司 Text detection method, device, equipment and storage medium
CN117892724B (en) * 2024-03-15 2024-06-04 成都赛力斯科技有限公司 Text detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115081440B (en) 2022-11-01

Similar Documents

Publication Publication Date Title
CN115081440B (en) Method, device and equipment for recognizing variant words in text and extracting original sensitive words
CN109753653B (en) Entity name recognition method, entity name recognition device, computer equipment and storage medium
CN108595695B (en) Data processing method, data processing device, computer equipment and storage medium
CN108427707B (en) Man-machine question and answer method, device, computer equipment and storage medium
CN111444723B (en) Information extraction method, computer device, and storage medium
CN111176996A (en) Test case generation method and device, computer equipment and storage medium
CN111680634B (en) Document file processing method, device, computer equipment and storage medium
KR20190085098A (en) Keyword extraction method, computer device, and storage medium
CN110334179B (en) Question-answer processing method, device, computer equipment and storage medium
CN108664595B (en) Domain knowledge base construction method and device, computer equipment and storage medium
CN109766072B (en) Information verification input method and device, computer equipment and storage medium
CN111352907A (en) Method and device for analyzing pipeline file, computer equipment and storage medium
CN111444349B (en) Information extraction method, information extraction device, computer equipment and storage medium
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
KR101509727B1 (en) Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
CN110750984A (en) Command line character string processing method, terminal, device and readable storage medium
CN111178064A (en) Information pushing method and device based on field word segmentation processing and computer equipment
CN112307172A (en) Semantic parsing equipment, method, terminal and storage medium
CN112749639B (en) Model training method and device, computer equipment and storage medium
Yasin et al. Transformer-Based Neural Machine Translation for Post-OCR Error Correction in Cursive Text
CN109344385B (en) Natural language processing method, device, computer equipment and storage medium
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN116225956A (en) Automated testing method, apparatus, computer device and storage medium
CN113065360B (en) Word semantic model construction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant