CN110457428B - Sensitive word detection and filtering method and device and electronic equipment - Google Patents

Sensitive word detection and filtering method and device and electronic equipment Download PDF

Info

Publication number
CN110457428B
CN110457428B CN201910561689.8A CN201910561689A CN110457428B CN 110457428 B CN110457428 B CN 110457428B CN 201910561689 A CN201910561689 A CN 201910561689A CN 110457428 B CN110457428 B CN 110457428B
Authority
CN
China
Prior art keywords
text
sensitive
poetry
sentence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910561689.8A
Other languages
Chinese (zh)
Other versions
CN110457428A (en
Inventor
游福成
王少梅
赵帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Graphic Communication
Original Assignee
Beijing Institute of Graphic Communication
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Graphic Communication filed Critical Beijing Institute of Graphic Communication
Priority to CN201910561689.8A priority Critical patent/CN110457428B/en
Publication of CN110457428A publication Critical patent/CN110457428A/en
Application granted granted Critical
Publication of CN110457428B publication Critical patent/CN110457428B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a sensitive word detection and filtration method, a sensitive word detection and filtration device and electronic equipment, which can accurately, efficiently and rapidly detect and filter sensitive words hidden in Tibetan poems. The sensitive word detection and filtration method comprises the following steps: constructing a dynamic sensitive word stock; obtaining a text to be screened and preprocessing the text to be screened to obtain the text to be screened; judging the text to be screened according to the poetry form characteristics and screening out poetry sentence fragments; extracting a target sentence from the poetry sentence according to the characteristics of the Tibetan poetry form and obtaining a keyword through word segmentation; and carrying out matching detection on the dynamic sensitive word stock of the keywords, determining a sensitivity value according to a matching detection result, and filtering the keywords according to the sensitivity value. The device comprises: the system comprises a sensitive word stock module, a preprocessing module, a poetry sentence segment module, a target sentence segment module, a word segmentation module, a matching detection module and a filtering module. The electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor.

Description

Sensitive word detection and filtering method and device and electronic equipment
Technical Field
The invention relates to the field of network information security, in particular to a sensitive word detection and filtration method, a device and electronic equipment.
Background
In the self-media era, the self-media publishing and user comment processes, the information transmission speed is high, the range is wide and the influence is large due to the openness and the universality of the self-media network. Sensitive information is inevitably generated in the information transmission process, and once the sensitive information is transmitted, the public opinion control difficulty is high and the negative influence is high. The best method is to detect and filter the sensitive information before the content with the sensitive information is released, and kill the negative influence of public opinion in the sprouting state.
The inventor finds that at least the following problems exist in the prior art through analyzing the existing detection and filtration method for sensitive words:
because of the complexity of Chinese language, some special existing forms of sensitive words are difficult to detect, one of the special existing forms is the sensitive word hidden in the Tibetan poetry, and aiming at the detection of the sensitive word, the prior art usually only depends on manual detection, and the problems of low detection efficiency and unreliable results exist. In the existing sensitive word detection and filtration work, the update of the detection means is focused more, and aiming at the sensitive words in the Tibetan poems, the existing sensitive word detection and filtration method cannot realize the detection of the sensitive words.
Disclosure of Invention
Therefore, the invention aims to provide a sensitive word detection and filtration method, a device and electronic equipment capable of accurately, efficiently and rapidly detecting and filtering sensitive words hidden in the Tibetan poetry.
Based on the above object, the present invention provides a method for detecting and filtering sensitive words, which is characterized by comprising:
constructing a periodically updated dynamic sensitive word stock;
obtaining a text to be tested, and preprocessing the text to be tested to obtain a text to be screened;
judging the text to be screened according to the poetry form characteristics, and screening to obtain poetry sentence fragments according to the judging result;
extracting a target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan head poetry form;
word segmentation is carried out on the target sentence segment, so that a plurality of keywords forming the target sentence segment are obtained;
matching and detecting a plurality of keywords with the dynamic sensitive word stock;
and calculating a sensitivity value of the text to be detected according to the matching detection result, and filtering the keywords in the text to be detected according to the sensitivity value.
Optionally, the constructing a periodically updated dynamic sensitive word stock includes:
collecting sensitive words in a current network, determining sensitive level factors of the sensitive words, and recording the sensitive words and the corresponding sensitive level factors into the dynamic sensitive word stock;
Setting an updating period, adding new sensitive words into the dynamic sensitive word stock according to sensitive words in a network at the updating time when updating each time, determining sensitive level factors of the new sensitive words, and recording the sensitive level factors of the new sensitive words into the dynamic sensitive word stock;
the first characters of the sensitive words are Chinese characters, and the sensitive words are classified according to the pinyin first letters of the first characters;
the first characters of the sensitive words are pinyin or English words, and are classified according to the first letters.
Optionally, the preprocessing the text to be screened to obtain the text to be screened includes:
determining a theme of the text to be tested according to semantic content and data sources of the text to be tested, classifying the text to be tested according to the theme of the text to be tested, and adding a classification mark for the text to be tested;
and removing nonsensical marks and connecting characters in the text to be screened to obtain the text to be screened, wherein the nonsensical marks comprise HTML labels and notes, and the connecting characters comprise language aid words, special symbols and numbers.
Optionally, the determining the text to be screened according to the poetry form feature, and screening to obtain the poetry sentence segment according to the determination result, includes:
Performing clause processing on the text to be screened according to punctuation marks in the text to be screened by using a semantic analysis technology;
if at least four sentences with the same length appear continuously, then the sentences with the same length of the at least four sentences continuously are cut and selected as equal-length sentence fragments;
judging whether the equal-length sentence segments accord with the rule of level and rhyme of poems according to the text pinyin information of the equal-length sentence segments, and screening the equal-length sentence segments accord with the rule of level and rhyme of poems to obtain the poems.
Optionally, the extracting the target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan style includes:
extracting the first character of each sentence in the poetry section, and sequentially connecting to obtain the target sentence section corresponding to the Tibetan-first-type Tibetan-first poetry;
extracting tail characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the tail hiding type head hiding poetry;
extracting intermediate characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the culvert-medium-sized hidden-head poetry;
extracting the head and tail characters of a first sentence and the head and tail characters of a last sentence in the poetry section, and sequentially connecting to obtain the target section corresponding to the culvert angle type hiding head poetry;
And arranging the poetry sections side by side according to rows, extracting diagonal characters in the poetry sections, and sequentially connecting the diagonal characters to obtain the target sections corresponding to the inclined ladder type hiding poetry.
Optionally, word segmentation is performed on the target sentence segment to obtain a plurality of keywords forming the target sentence segment, including:
performing word segmentation on the target sentence segment by adopting a dictionary-based Chinese word segmentation technology to obtain a plurality of keywords used for forming the target sentence segment; the dictionary used in the Chinese word segmentation technology comprises all sensitive words in the dynamic sensitive word stock.
Optionally, the matching detection of the plurality of keywords with the dynamic sensitive word stock includes:
selecting corresponding sensitive word classification from the dynamic sensitive word stock according to the pinyin initial of the first character of the keyword;
starting from the first character of the keyword, screening out a sensitive word matched with the first character of the keyword in the corresponding sensitive word classification, and continuing to screen out a sensitive word matched with the next character of the keyword from the screened sensitive word until the sensitive word matched with the last character of the keyword is screened out;
Detecting whether the screened sensitive words matched with the last character of the key word have the sensitive words with the same number as the key word characters, and if so, indicating that the key word is the sensitive word.
Optionally, the calculating the sensitivity value of the text to be detected according to the matching detection result, and filtering the keyword in the text to be detected according to the sensitivity value includes:
calculating word frequencies of all the keywords which are detected in the text to be detected and are sensitive words, and determining word frequency factors of the keywords according to the word frequencies of the keywords:
Figure BDA0002108449450000041
wherein W is i,j Representing word frequency, w, of the keyword i in the text j to be tested i,j Representing the number of times of occurrence of the keyword i in the text j to be tested, and sigma x w x,j Representing the total keyword number in the text j to be tested;
the word frequency factors of the keywords are as follows:
Figure BDA0002108449450000042
wherein wf i The word frequency factor representing the keyword i;
determining the theme factors of the keywords according to the classification marks of the text to be tested;
determining azimuth factors of the keywords according to the positions of the poetry segments of the keywords in the text to be detected;
calculating a sensitivity weight of the keyword according to the word frequency factor, the theme factor, the azimuth factor and the sensitivity level factor of the matched sensitive word of the keyword, and calculating a sensitivity value of the text to be detected according to the sensitivity weight of the keyword:
value i =α×wf i +β×pos i +λ×lev i +θ×top i
Wherein value is i The sensitivity weight, wf, representing the keyword i i The word frequency factor, pos, representing the keyword i i The orientation factor, lev, representing the keyword i i The sensitivity level factor, top, representing the matching sensitive word of the keyword i i A topic factor representing the keyword i; alpha, beta, lambda and theta are word frequency adjusting parameters, azimuth adjusting parameters, sensitivity level adjusting parameters and theme adjusting parameters of the keyword i respectively;
the sensitivity value of the text to be tested is as follows:
Figure BDA0002108449450000043
wherein V represents the sensitivity value of the text to be tested, and k represents the total number of the keywords which are screened out from the text to be tested and are all sensitive words;
comparing the sensitivity value V of the text to be tested with a text sensitivity threshold mu;
if V is more than or equal to mu, filtering and shielding all the keywords which are detected in the text to be detected and are sensitive words one by one;
if V<μ, the word frequency W of all the keywords detected in the text to be detected as sensitive words is calculated i,j Comparing the word frequency W with a sensitive frequency threshold value l i,j The keyword filtering mask is smaller than the sensitivity frequency threshold l.
Based on the above object, the present invention further provides a sensitive word detection and filtration device, which is characterized by comprising:
The sensitive word stock module is configured to construct a dynamic sensitive word stock which is updated periodically;
the preprocessing module is configured to acquire a text to be detected, and preprocess the text to be detected to obtain a text to be screened;
the poem sentence segment module is configured to judge the text to be screened according to the poem form characteristics, and screen the poem sentence segment according to the judging result;
the target sentence segment module is configured to extract a target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan head poetry form;
the word segmentation module is configured to segment the target sentence segment to obtain a plurality of keywords forming the target sentence segment;
the matching detection module is configured to perform matching detection on the keywords and the dynamic sensitive word stock;
and the filtering module is configured to calculate a sensitivity value of the text to be detected according to the matching detection result, and filter the keywords in the text to be detected according to the sensitivity value.
Based on the above object, the present invention further provides an electronic device for detecting and filtering sensitive words, where the electronic device includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the method for detecting and filtering sensitive words when executing the program.
From the above, it can be seen that the method, the device and the electronic equipment for detecting and filtering the sensitive words provided by the invention can screen the poetry sentence segments from the text to be detected according to the poetry form characteristics, further extract the target sentence segments possibly containing the sensitive words from the poetry sentence segments according to the Tibetan head poetry form characteristics, and detect the sensitive words from the target sentence segments by using the periodically updated dynamic sensitive word library, thereby realizing the detection of the sensitive words hidden in the Tibetan head poetry. The method comprises the steps of constructing a periodically updated dynamic sensitive word library as a detection basis, recording newly-appearing sensitive words in the dynamic sensitive word library in real time, detecting a target sentence segment more comprehensively and completely, avoiding omission under the condition that the sensitive words are continuously updated and changed in actual work, and having stronger robustness; the topic classification and redundancy elimination operation are carried out on the text to be detected before the poetry sentence segments are screened, so that the situation of misjudgment can be reduced, the problem of manual interference can be avoided, and the accuracy of detection and filtration of sensitive words can be improved; in the process of constructing the dynamic sensitive word library, all the sensitive words are classified, in the process of matching detection, the screened data range is reduced step by step according to the keywords obtained by word segmentation of the target sentence segment, then matching is carried out, the sensitivity value of the text to be detected is calculated according to the matching detection result, different coping modes are adopted according to different situations of the sensitivity value, and the detection filtering speed can be greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a method for detecting and filtering sensitive words according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a method for screening poetry segments in a method for detecting and filtering sensitive words according to an embodiment of the present invention;
FIG. 3-a is a schematic diagram of a Tibetan head poem for five language;
FIG. 3-b is a schematic diagram of a Tibetan head poem for seven languages;
FIG. 4-a is a schematic diagram of a tail hiding type five-language head hiding poem;
FIG. 4-b is a schematic diagram of a seven-language tibetan head poem of the tibetan tail type;
FIG. 5-a is a schematic diagram of a culvert-medium-sized five-language Tibetan head poem;
FIG. 5-b is a schematic diagram of a culvert-medium seven-language Tibetan head poem;
FIG. 6-a is a schematic diagram of a corner-culvert type five-language Tibetan poem;
FIG. 6-b is a schematic diagram of a corner-culvert type seven-language Tibetan poem;
FIG. 7-a is a schematic diagram of a trapezoid-type five-language Tibetan head poem;
FIG. 7-b is a schematic diagram of a trapezoid seven-language tibetan head poem;
FIG. 8 is a schematic diagram of a sensitive word detection filter device according to an embodiment of the present invention;
fig. 9 is a schematic diagram of an electronic device for detecting and filtering sensitive words according to an embodiment of the present invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.
In one aspect, some alternative embodiments of the present invention provide a sensitive word detection filtering method.
As shown in fig. 1, some alternative embodiments of the present invention provide a method for detecting and filtering sensitive words, which includes:
s1: constructing a periodically updated dynamic sensitive word stock;
s2: obtaining a text to be tested, and preprocessing the text to be tested to obtain a text to be screened;
s3: judging the text to be screened according to the poetry form characteristics, and screening to obtain poetry sentence fragments according to the judging result;
S4: extracting a target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan head poetry form;
s5: word segmentation is carried out on the target sentence segment, so that a plurality of keywords forming the target sentence segment are obtained;
s6: matching and detecting a plurality of keywords with the dynamic sensitive word stock;
s7: and calculating a sensitivity value of the text to be detected according to the matching detection result, and filtering the keywords in the text to be detected according to the sensitivity value.
According to the sensitive word detection and filtration method, poetry segments are screened from texts to be detected according to poetry form characteristics, target sentence segments possibly containing sensitive words are further extracted from the poetry segments according to Tibetan head poetry form characteristics, the periodically updated dynamic sensitive word library is utilized for detecting the sensitive words of the target sentence segments, and the detection of the sensitive words hidden in Tibetan head poetry can be achieved in the mode. The method comprises the steps of constructing a periodically updated dynamic sensitive word library as a detection basis, recording newly-appearing sensitive words in the dynamic sensitive word library in real time, detecting a target sentence segment more comprehensively and completely, avoiding omission under the condition that the sensitive words are continuously updated and changed in actual work, and having stronger robustness; the topic classification and redundancy elimination operation are carried out on the text to be detected before the poetry sentence segments are screened, so that the situation of misjudgment can be reduced, the problem of manual interference can be avoided, and the accuracy of detection and filtration of sensitive words can be improved; in the process of constructing the dynamic sensitive word library, all the sensitive words are classified, in the process of matching detection, the screened data range is reduced step by step according to the keywords obtained by word segmentation of the target sentence segment, then matching is carried out, the sensitivity value of the text to be detected is calculated according to the matching detection result, different coping modes are adopted according to different situations of the sensitivity value, and the detection filtering speed can be greatly improved.
In some optional embodiments of the present invention, the method for detecting and filtering sensitive words includes:
collecting sensitive words in a current network, determining sensitive level factors of the sensitive words, and recording the sensitive words and the corresponding sensitive level factors into the dynamic sensitive word stock;
setting an updating period, adding new sensitive words into the dynamic sensitive word stock according to sensitive words in a network at the updating time when updating each time, determining sensitive level factors of the new sensitive words, and recording the sensitive level factors of the new sensitive words into the dynamic sensitive word stock;
the first characters of the sensitive words are Chinese characters, and the sensitive words are classified according to the pinyin first letters of the first characters;
the first characters of the sensitive words are pinyin or English words, and are classified according to the first letters.
It should be understood by those skilled in the art that the sensitive word level factor may be determined by referring to existing network sensitive word auditing rules according to the specific content of the sensitive word; the updating period can be set according to actual working requirements, and can be preset, flexible adaptive adjustment is carried out according to the actual detection and filtration effects, and the final sensitive word detection and filtration effects are optimal.
In the sensitive word detection and filtering method, a dynamic sensitive word library which is updated periodically is constructed as a basis for matching detection. By setting the flexibly adjustable updating period, the dynamic sensitive word library is ensured to be capable of inputting the sensitive words in the network completely and comprehensively, so that the sensitive word detection and filtering method is ensured to be capable of completely and comprehensively detecting and filtering the sensitive words in the face of the condition that the sensitive words are continuously updated and changed in actual work, omission is avoided, and stronger robustness is achieved. In addition, all the sensitive words are pre-classified according to the pinyin initials of the sensitive word initial characters in the process of constructing the dynamic sensitive word bank, so that the data range managed in the subsequent matching detection process is facilitated, the matching detection workload is reduced, and the working efficiency of the matching detection is improved.
In some optional embodiments of the present invention, in a method for detecting and filtering sensitive words, the preprocessing the text to be detected to obtain a text to be screened S2 includes:
determining a theme of the text to be tested according to semantic content and data sources of the text to be tested, classifying the text to be tested according to the theme of the text to be tested, and adding a classification mark for the text to be tested;
And removing nonsensical marks and connecting characters in the text to be screened to obtain the text to be screened, wherein the nonsensical marks comprise HTML labels and notes, and the connecting characters comprise language aid words, special symbols and numbers.
According to the sensitive word detection and filtering method, after the text data to be detected are collected, the text to be detected is subjected to classification, redundancy removal and word segmentation. It will be appreciated by those skilled in the art that the determination of whether a word is a sensitive word is different when the same word is in text of different topic categories, such as a sensitive word appearing in a partially sensitive web page, and that it cannot be practically classified as a sensitive word when it appears in a health or educational-science web page, which means that the topic classification of a text will have an effect on the final result when determining whether a word in that text is a sensitive word. According to some optional embodiments of the invention, the text to be detected is classified according to the subject, so that the situation of misjudgment can be avoided, and the accuracy of the result of the sensitive word detection and filtering method is ensured.
It should be further understood by those skilled in the art that many nonsensical marks and connection characters, such as HTML tags, notes, etc., exist in the collected text to be tested, and "wa", "ou", "o", etc., and "although …" and "…", "even …" and "…" are modified words, and "&", "#", and "555" are nonsensical sign numbers, etc., where the marks and characters occur more frequently than the sensitive words, which often increases the data calculation amount of the detection work and affects the accuracy of the detection and filtering result of the sensitive words. The sensitive word detection and filtration method performs redundancy elimination operation on the text to be detected, eliminates redundant marks and characters, can avoid the interference of nonsensical marks and characters, and improves the accuracy and the working efficiency of sensitive word detection and filtration.
As shown in fig. 2, in a method for detecting and filtering sensitive words provided in some optional embodiments of the present invention, the determining the text to be screened according to the poetry form feature, and screening to obtain a poetry segment S3 according to the determination result includes:
s31: performing clause processing on the text to be screened according to punctuation marks in the text to be screened by using a semantic analysis technology;
s32: if at least four sentences with the same length appear continuously, then the sentences with the same length of the at least four sentences continuously are cut and selected as equal-length sentence fragments;
s33: judging whether the equal-length sentence segments accord with the rule of level and rhyme of poems according to the text pinyin information of the equal-length sentence segments, and screening the equal-length sentence segments accord with the rule of level and rhyme of poems to obtain the poems.
It should be understood by those skilled in the art that, in order to detect a sensitive word in a tibetan poem that can exist in a text to be detected, the content of the poem in the text to be detected needs to be screened first, in the method for detecting and filtering the sensitive word, the text to be screened is determined according to the form characteristics of the poem, first, equal-length sentence segments that may be poems are screened from the text to be screened according to the sentence length, and then, the equal-length sentence segments are determined according to the characteristic word tone level and vowel change characteristics of the poem.
The number of words, the number of sentences, the level and the level of rhyme and the like of the poem have stricter regulations, such as a representative poem body with a dead sentence and a law poem, and the poem body has a fixed sentence, a fixed word, a fixed position, a fixed sound and a fixed pair. For example: in the best sentence poem of melt, "Geranium and Queen," the white sun is kept full, and the yellow river enters the ocean current. The rule of the level and the narrow rule of the upper floor to be poor, the level and the narrow rule of the upper floor is 'iu'; the law poem of Dufu is "Chunsheng" in the state of broken mountain river, and the city is deep in Chunsheng. The sense of the sense is that the eyes are splashed, and the sense of the sense is that the birds are frightened. The honeycomb is used for treating the fire of the home book. The white-head pruritus is shorter, the mixed desire is not superior to the hairpin, the flat-zepe rule is ' zepe flat-zepe, the method comprises the steps of (1) leveling the zep, and using the charm of ' an '. And comparing the character pinyin information of the equal-length sentence segments with the poetry level and level rule and the rhyme rule, thereby finally determining whether the equal-length sentence segments belong to poetry.
The sensitive word detection and filtration method adopts the mode to judge the equal-length sentence segments, so that whether the equal-length sentence segments are poems can be further determined, and further, the poems can be screened from the text to be screened.
In some optional embodiments of the present invention, in a method for detecting and filtering sensitive words, the extracting, according to a characteristic of a bezels style, a target segment S4 from the poetry segment includes:
extracting the first character of each sentence in the poetry section, and sequentially connecting to obtain the target sentence section corresponding to the Tibetan-first-type Tibetan-first poetry;
as shown in fig. 3-a and fig. 3-b, taking the five-language and seven-language head hiding poems as examples, considering that the characters of the information hidden by the head hiding poems are located at the position of the head of each sentence, the first character of each sentence in the poem sentence is extracted in the sensitive word detection and filtration method, and the first characters are sequentially connected to obtain the target sentence section corresponding to the head hiding poem.
Extracting tail characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the tail hiding type head hiding poetry;
as shown in fig. 4-a and fig. 4-b, taking the five-language and seven-language bezels as examples, considering that the words of the hidden information of the bezels are located at the position of the end of each sentence, the sensitive word detection and filtration method extracts the last character of each sentence in the poetry paragraph, and connects the last characters in sequence to obtain the target sentence paragraph corresponding to the bezels.
Extracting intermediate characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the culvert-medium-sized hidden-head poetry;
as shown in fig. 5-a and fig. 5-b, taking the five-language and seven-language bezels as examples, considering the position of the text of the hidden information in each sentence, the sensitive word detection and filtration method extracts a character in the middle of each sentence in the poetry section, and connects the characters in sequence to obtain the target sentence section corresponding to the culvert-medium bezels.
Extracting the head and tail characters of a first sentence and the head and tail characters of a last sentence in the poetry section, and sequentially connecting to obtain the target section corresponding to the culvert angle type hiding head poetry;
as shown in fig. 6-a and fig. 6-b, taking the five-language and seven-language head hiding poems as an example, considering that the characters of the hidden information of the culvert-angle head hiding poems are located at four corners of the head hiding poems, the head and tail characters of the head sentence and the head and tail characters of the tail sentence are extracted in the sensitive word detection and filtration method, and the head and tail characters are sequentially connected to obtain the target sentence segments corresponding to the culvert-angle head hiding poems.
And arranging the poetry sections side by side according to rows, extracting diagonal characters in the poetry sections, and sequentially connecting the diagonal characters to obtain the target sections corresponding to the inclined ladder type hiding poetry.
As shown in fig. 7-a and fig. 7-b, taking the five-language and seven-language poems as examples, considering that characters of information hidden by the inclined ladder type poems are arranged diagonally, the diagonal characters are extracted in the sensitive word detection and filtration method, and target sentence segments corresponding to the inclined ladder type poems are obtained by sequential connection. In the extraction, after a character in a certain position is extracted in a first row, starting from a second row, the position of each row of extracted characters is shifted forward by one bit or backward by one bit compared with the position of the character extracted in the previous row. Fig. 7-a is a five-language tibetan poem, in which the arrow indicates the order of extracting characters, for example, the first character is extracted in the first line, the position of the extracted characters is shifted backward from the second line, the second character is extracted in the second line, and the third character … … is extracted in the third line; or the first row extracts the second character, the second row extracts the third character, and the third row extracts the fourth character … …; or the first line extracts the first last character, the character position is shifted forward from the second line, the second line extracts the second last character, and the third line extracts the derivative third character … …. Fig. 7-b shows a seven-language poem, in which the arrows indicate the sequence of extracting characters (only the position of the extracted characters is moved backward in order to avoid confusion), and the concept of extracting the characters of the seven-language poem is the same as that of the five-language poem.
In the method, according to the form characteristics of the hidden head poems of different types, target sentence segments corresponding to the hidden head poems of different types are extracted from the poems, so that the sensitive word detection is conveniently carried out on the target sentence segments.
In some optional embodiments of the present invention, in a method for detecting and filtering a sensitive word, the word segmentation processing is performed on the target sentence segment to obtain a plurality of keywords forming the target sentence segment, including:
performing word segmentation on the target sentence segment by adopting a dictionary-based Chinese word segmentation technology to obtain a plurality of keywords used for forming the target sentence segment; the dictionary used in the Chinese word segmentation technology comprises all sensitive words in the dynamic sensitive word stock.
In the detection work of sensitive words, the word is the most direct object for matching detection, however, in general, only words, sentences and segments in a text can be simply delimited by obvious delimiters, and only words do not have one type of delimiter, so that word segmentation operation needs to be performed on the text to be detected first. In the sensitive word detection and filtering method, a dictionary-based Chinese word segmentation technology is adopted to segment the text to be detected, all sensitive words in the dynamic sensitive word bank are added into a dictionary used by the Chinese word segmentation technology before the text to be detected, and the sensitive words are used as user-defined words, so that word segmentation results of the text to be detected can be more attached to the sensitive words in the dynamic sensitive word bank, and the final sensitive word detection and filtering results are more accurate.
In some optional embodiments of the present invention, in a method for detecting and filtering sensitive words, the matching detection of the plurality of keywords with the dynamic sensitive word stock includes:
selecting corresponding sensitive word classification from the dynamic sensitive word stock according to the pinyin initial of the first character of the keyword;
starting from the first character of the keyword, screening out a sensitive word matched with the first character of the keyword in the corresponding sensitive word classification, and continuing to screen out a sensitive word matched with the next character of the keyword from the screened sensitive word until the sensitive word matched with the last character of the keyword is screened out;
detecting whether the screened sensitive words matched with the last character of the key word have the sensitive words with the same number as the key word characters, and if so, indicating that the key word is the sensitive word.
When the matching detection is carried out on the keywords of the text to be detected, firstly, selecting the sensitive word classification corresponding to the keywords from the dynamic sensitive word library, and then screening the sensitive words matched with the keywords from the corresponding sensitive word classification, wherein the screening process starts from the first character of the keywords, screens the sensitive words matched with the keywords, and shortens the screening data range step by step, thereby being capable of quickly reducing the matching range and obtaining accurate matching results. If the sensitive word matched with the keyword cannot be found in the process, the keyword is not the sensitive word; if a sensitive word which is completely matched with the keyword is screened, but the character corresponding to the last character of the sensitive word is not the last character of the sensitive word, namely the length of the keyword is different from that of the corresponding sensitive word, the keyword is only matched with the corresponding sensitive word by the first few characters, the length of the sensitive word is larger than that of the keyword, and the fact that the keyword is not the sensitive word is also indicated, which is obvious to a person skilled in the art.
In some optional embodiments of the present invention, in a method for detecting and filtering a sensitive word, the calculating a sensitivity value of the text to be detected according to a matching detection result, and filtering the keyword in the text to be detected according to the sensitivity value includes:
calculating word frequencies of all the keywords which are detected in the text to be detected and are sensitive words, and determining word frequency factors of the keywords according to the word frequencies of the keywords:
Figure BDA0002108449450000131
wherein W is i,j Representing word frequency, w, of the keyword i in the text j to be tested i,j Representing the number of times of occurrence of the keyword i in the text j to be tested, and sigma x w x,j Representing the total keyword number in the text j to be tested;
the word frequency factors of the keywords are as follows:
Figure BDA0002108449450000132
wherein wf i The word frequency factor representing the keyword i;
in the sensitive word detection and filtering method, word frequency factors of the keywords are calculated according to word frequencies of the keywords, and the word frequency W i,j And the ratio of the number of times of occurrence of the keyword i in the text j to be tested to the total number of keywords in the text j to be tested is obtained. The more times the keyword i appears, the word frequency factor wf thereof i The larger it causes a greater impact.
Determining the theme factors of the keywords according to the classification marks of the text to be tested;
determining azimuth factors of the keywords according to the positions of the poetry segments of the keywords in the text to be detected;
calculating a sensitivity weight of the keyword according to the word frequency factor, the theme factor, the azimuth factor and the sensitivity level factor of the matched sensitive word of the keyword, and calculating a sensitivity value of the text to be detected according to the sensitivity weight of the keyword:
value i =α×wf i +β×pos i +λ×lev i +θ×top i
wherein value is i The sensitivity weight, wf, representing the keyword i i The word frequency factor, pos, representing the keyword i i The orientation factor, lev, representing the keyword i i The sensitivity level factor, top, representing the matching sensitive word of the keyword i i A topic factor representing the keyword i; alpha, beta, lambda and theta are word frequency adjusting parameters, azimuth adjusting parameters, sensitivity level adjusting parameters and theme adjusting parameters of the keyword i respectively;
it should be understood by those skilled in the art that the word frequency adjustment parameter, the azimuth adjustment parameter, the sensitivity level adjustment parameter and the theme adjustment parameter may be preset according to the influence degree caused by the word frequency, the theme, the azimuth and the sensitivity level, and after the setting, the flexible adaptive adjustment may be performed according to the actual detection filtering effect, so that the final sensitive word detection filtering effect may be optimal.
The sensitivity value of the text to be tested is as follows:
Figure BDA0002108449450000141
wherein V represents the sensitivity value of the text to be tested, and k represents the total number of the keywords which are screened out from the text to be tested and are all sensitive words;
comparing the sensitivity value V of the text to be tested with a text sensitivity threshold mu;
if V is more than or equal to mu, filtering and shielding all the keywords which are detected in the text to be detected and are sensitive words one by one;
if V<μ, the word frequency W of all the keywords detected in the text to be detected as sensitive words is calculated i,j Comparing the word frequency W with a sensitive frequency threshold value l i,j The keyword filtering mask is smaller than the sensitivity frequency threshold l.
According to the method for detecting and filtering the sensitive words, the sensitive weight of the key words is calculated according to the word frequency factors, the theme factors, the azimuth factors and the sensitivity level factors of the matched sensitive words, then the overall sensitivity value of the text to be detected is determined, and four influencing factors including word frequency, theme, azimuth and sensitivity level are comprehensively considered, so that the final detection and filtering result is more accurate and proper.
In another aspect, some optional embodiments of the present invention further provide a sensitive word detection filtering apparatus.
As shown in fig. 8, some alternative embodiments of the present invention provide a sensitive word detection filtering apparatus, including:
the sensitive word stock module 1 is configured to construct a dynamic sensitive word stock which is updated periodically;
the preprocessing module 2 is configured to acquire a text to be detected, and preprocess the text to be detected to obtain a text to be screened;
the poetry sentence segment module 3 is configured to judge the text to be screened according to the poetry form characteristics, and screen and obtain the poetry sentence segment according to the judging result;
the target sentence segment module 4 is configured to extract a target sentence segment from the poetry sentence segment according to the characteristic of the bezels;
the word segmentation module 5 is configured to perform word segmentation processing on the target sentence segment to obtain a plurality of keywords forming the target sentence segment;
a match detection module 6 configured to perform match detection on a plurality of the keywords and the dynamic sensitive word stock;
and the filtering module 7 is configured to calculate a sensitivity value of the text to be detected according to the matching detection result, and filter the keywords in the text to be detected according to the sensitivity value.
The sensitive word detection and filtration device screens poetry segments from texts to be detected according to poetry form features, then further extracts target segments possibly containing sensitive words from the poetry segments according to Tibetan head poetry form features, and detects the sensitive words in the target segments by using a periodically updated dynamic sensitive word library. The method comprises the steps of constructing a periodically updated dynamic sensitive word library as a detection basis, recording newly-appearing sensitive words in the dynamic sensitive word library in real time, detecting a target sentence segment more comprehensively and completely, avoiding omission under the condition that the sensitive words are continuously updated and changed in actual work, and having stronger robustness; the topic classification and redundancy elimination operation are carried out on the text to be detected before the poetry sentence segments are screened, so that the situation of misjudgment can be reduced, the problem of manual interference can be avoided, and the accuracy of detection and filtration of sensitive words can be improved; in the process of constructing the dynamic sensitive word library, all the sensitive words are classified, in the process of matching detection, the screened data range is reduced step by step according to the keywords obtained by word segmentation of the target sentence segment, then matching is carried out, the sensitivity value of the text to be detected is calculated according to the matching detection result, different coping modes are adopted according to different situations of the sensitivity value, and the detection filtering speed can be greatly improved.
In another aspect, some optional embodiments of the present invention further provide a sensitive word detection filtering electronic device.
As shown in fig. 9, the electronic device includes:
one or more processors 801, and a memory 802, one processor 801 being illustrated in fig. 9.
The electronic device for executing the sensitive word detection filtering method may further include: an input device 803 and an output device 804.
The processor 801, memory 802, input device 803, and output device 804 may be connected by a bus or other means, for example in fig. 9.
The memory 802 is used as a non-volatile computer readable storage medium, and may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the sensitive word detection filtering method in the embodiments of the present application. The processor 801 executes various functional applications of the server and data processing, that is, implements the sensitive word detection filtering method of the above-described method embodiment, by running nonvolatile software programs, instructions, and modules stored in the memory 802.
Memory 802 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of a device performing the sensitive word detection filtering method, etc. In addition, memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 802 may optionally include memory remotely located with respect to processor 801, which may be connected to the membership user action monitoring device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the means for performing the sensitive word detection filtering method. The output device 804 may include a display device such as a display screen.
The one or more modules are stored in the memory 802 that, when executed by the one or more processors 801, perform the sensitive word detection filtering method of any of the method embodiments described above.
The device of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the invention. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The embodiments of the invention are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the present invention should be included in the scope of the present invention.

Claims (8)

1. A method for detecting and filtering sensitive words, comprising:
constructing a periodically updated dynamic sensitive word stock;
obtaining a text to be tested, and preprocessing the text to be tested to obtain a text to be screened;
judging the text to be screened according to the poetry form characteristics, and screening to obtain poetry sentence fragments according to the judging result; the text to be screened is judged according to the poetry form characteristics, and the poetry sentence segments are obtained through screening according to the judging result, and the method comprises the following steps:
performing clause processing on the text to be screened according to punctuation marks in the text to be screened by using a semantic analysis technology;
If at least four sentences with the same length appear continuously, then the sentences with the same length of the at least four sentences continuously are cut and selected as equal-length sentence fragments;
judging whether the equal-length sentence segments accord with the rule of level and rhyme of poems according to the text pinyin information of the equal-length sentence segments, and screening the equal-length sentence segments accord with the rule of level and rhyme of poems to obtain the poems;
extracting a target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan head poetry form; the extracting the target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan head poetry form comprises the following steps:
extracting the first character of each sentence in the poetry section, and sequentially connecting to obtain the target sentence section corresponding to the Tibetan-first-type Tibetan-first poetry;
extracting tail characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the tail hiding type head hiding poetry;
extracting intermediate characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the culvert-medium-sized hidden-head poetry;
extracting the head and tail characters of a first sentence and the head and tail characters of a last sentence in the poetry section, and sequentially connecting to obtain the target section corresponding to the culvert angle type hiding head poetry;
The poetry sections are arranged side by side in rows, diagonal characters in the poetry sections are extracted, and the target sections corresponding to the inclined ladder type hiding poems are sequentially obtained;
word segmentation is carried out on the target sentence segment, so that a plurality of keywords forming the target sentence segment are obtained;
matching and detecting a plurality of keywords with the dynamic sensitive word stock;
and calculating a sensitivity value of the text to be detected according to the matching detection result, and filtering the keywords in the text to be detected according to the sensitivity value.
2. The method of claim 1, wherein the constructing a periodically updated dynamically sensitive word stock comprises:
collecting sensitive words in a current network, determining sensitive level factors of the sensitive words, and recording the sensitive words and the corresponding sensitive level factors into the dynamic sensitive word stock;
setting an updating period, adding new sensitive words into the dynamic sensitive word stock according to sensitive words in a network at the updating time when updating each time, determining sensitive level factors of the new sensitive words, and recording the sensitive level factors of the new sensitive words into the dynamic sensitive word stock;
The first characters of the sensitive words are Chinese characters, and the sensitive words are classified according to the pinyin first letters of the first characters;
the first characters of the sensitive words are pinyin or English words, and are classified according to the first letters.
3. The method according to claim 2, wherein the preprocessing the text to be filtered to obtain the text to be filtered includes:
determining a theme of the text to be tested according to semantic content and data sources of the text to be tested, classifying the text to be tested according to the theme of the text to be tested, and adding a classification mark for the text to be tested;
and removing nonsensical marks and connecting characters in the text to be screened to obtain the text to be screened, wherein the nonsensical marks comprise HTML labels and notes, and the connecting characters comprise language aid words, special symbols and numbers.
4. The method of claim 1, wherein the word segmentation process is performed on the target sentence segment to obtain a plurality of keywords that form the target sentence segment, including:
performing word segmentation on the target sentence segment by adopting a dictionary-based Chinese word segmentation technology to obtain a plurality of keywords used for forming the target sentence segment; the dictionary used in the Chinese word segmentation technology comprises all sensitive words in the dynamic sensitive word stock.
5. The method of claim 1, wherein said matching the plurality of keywords with the dynamically sensitive word stock comprises:
selecting corresponding sensitive word classification from the dynamic sensitive word stock according to the pinyin initial of the first character of the keyword;
starting from the first character of the keyword, screening out a sensitive word matched with the first character of the keyword in the corresponding sensitive word classification, and continuing to screen out a sensitive word matched with the next character of the keyword from the screened sensitive word until the sensitive word matched with the last character of the keyword is screened out;
detecting whether the screened sensitive words matched with the last character of the key word have the sensitive words with the same number as the key word characters, and if so, indicating that the key word is the sensitive word.
6. The method of claim 3, wherein the calculating a sensitivity value of the text to be tested according to the matching detection result, and the filtering the keywords in the text to be tested according to the sensitivity value, comprises:
calculating word frequencies of all the keywords which are detected in the text to be detected and are sensitive words, and determining word frequency factors of the keywords according to the word frequencies of the keywords:
Figure QLYQS_1
Wherein,,
Figure QLYQS_2
representing the keywordsiAt the text to be testedjWord frequency of->
Figure QLYQS_3
Representing the keywordsiAt the text to be testedjThe number of occurrences of>
Figure QLYQS_4
Representing the text to be testedjThe total keyword number;
the word frequency factors of the keywords are as follows:
Figure QLYQS_5
wherein,,
Figure QLYQS_6
representing the keywordsiIs used for generating word frequency factors;
determining the theme factors of the keywords according to the classification marks of the text to be tested;
determining azimuth factors of the keywords according to the positions of the poetry segments of the keywords in the text to be detected;
calculating a sensitivity weight of the keyword according to the word frequency factor, the theme factor, the azimuth factor and the sensitivity level factor of the matched sensitive word of the keyword, and calculating a sensitivity value of the text to be detected according to the sensitivity weight of the keyword:
Figure QLYQS_7
wherein,,
Figure QLYQS_8
representing the keywordsiIs sensitive to the weight of->
Figure QLYQS_9
Representing the keywordsiIs->
Figure QLYQS_10
Representing the keywordsiIs->
Figure QLYQS_11
Representing the keywordsiSaid sensitivity level factor of matching sensitive words, < ->
Figure QLYQS_12
Representing the keywordsiSubject factors of (2); />
Figure QLYQS_13
The key words respectively iWord frequency adjusting parameters, azimuth adjusting parameters, sensitivity level adjusting parameters and theme adjusting parameters;
the sensitivity value of the text to be tested is as follows:
Figure QLYQS_14
wherein,,
Figure QLYQS_15
a sensitivity value representing the text under test, < >>
Figure QLYQS_16
Representing the total number of the keywords which are all sensitive words and are screened out from the text to be tested;
the sensitivity value of the text to be tested is calculated
Figure QLYQS_17
And text sensitivity threshold->
Figure QLYQS_18
Comparing;
if it is
Figure QLYQS_19
Filtering and shielding all the keywords which are detected in the text to be detected and are sensitive words one by one;
if it is
Figure QLYQS_20
The word frequency of the keywords which are all sensitive words detected in the text to be detected is increased>
Figure QLYQS_21
And a sensitive frequency thresholdlComparing the word frequency ++>
Figure QLYQS_22
Less than the sensitivityFrequency thresholdlIs filtered and masked.
7. A sensitive word detection filter device, comprising:
the sensitive word stock module is configured to construct a dynamic sensitive word stock which is updated periodically;
the preprocessing module is configured to acquire a text to be detected, and preprocess the text to be detected to obtain a text to be screened;
the poem sentence segment module is configured to judge the text to be screened according to the poem form characteristics, and screen the poem sentence segment according to the judging result; the text to be screened is judged according to the poetry form characteristics, and the poetry sentence segments are obtained through screening according to the judging result, and the method comprises the following steps:
Performing clause processing on the text to be screened according to punctuation marks in the text to be screened by using a semantic analysis technology;
if at least four sentences with the same length appear continuously, then the sentences with the same length of the at least four sentences continuously are cut and selected as equal-length sentence fragments;
judging whether the equal-length sentence segments accord with the rule of level and rhyme of poems according to the text pinyin information of the equal-length sentence segments, and screening the equal-length sentence segments accord with the rule of level and rhyme of poems to obtain the poems;
the target sentence segment module is configured to extract a target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan head poetry form; the extracting the target sentence segment from the poetry sentence segment according to the characteristic of the Tibetan head poetry form comprises the following steps:
extracting the first character of each sentence in the poetry section, and sequentially connecting to obtain the target sentence section corresponding to the Tibetan-first-type Tibetan-first poetry;
extracting tail characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the tail hiding type head hiding poetry;
extracting intermediate characters of each sentence in the poetry sentence segment, and sequentially connecting to obtain the target sentence segment corresponding to the culvert-medium-sized hidden-head poetry;
Extracting the head and tail characters of a first sentence and the head and tail characters of a last sentence in the poetry section, and sequentially connecting to obtain the target section corresponding to the culvert angle type hiding head poetry;
the poetry sections are arranged side by side in rows, diagonal characters in the poetry sections are extracted, and the target sections corresponding to the inclined ladder type hiding poems are sequentially obtained;
the word segmentation module is configured to segment the target sentence segment to obtain a plurality of keywords forming the target sentence segment;
the matching detection module is configured to perform matching detection on the keywords and the dynamic sensitive word stock;
the filtering module is configured to calculate a sensitivity value of the text to be detected according to the matching detection result, and filter the keywords in the text to be detected according to the sensitivity value.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when the program is executed by the processor.
CN201910561689.8A 2019-06-26 2019-06-26 Sensitive word detection and filtering method and device and electronic equipment Active CN110457428B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910561689.8A CN110457428B (en) 2019-06-26 2019-06-26 Sensitive word detection and filtering method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910561689.8A CN110457428B (en) 2019-06-26 2019-06-26 Sensitive word detection and filtering method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110457428A CN110457428A (en) 2019-11-15
CN110457428B true CN110457428B (en) 2023-07-04

Family

ID=68481144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910561689.8A Active CN110457428B (en) 2019-06-26 2019-06-26 Sensitive word detection and filtering method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110457428B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241389B (en) * 2019-12-30 2024-03-22 西安鼎辉物联智能科技有限公司 Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
CN111737398B (en) * 2020-05-26 2023-06-23 北京百度网讯科技有限公司 Method and device for retrieving sensitive words in text, electronic equipment and storage medium
CN111737627A (en) * 2020-06-28 2020-10-02 北京明略软件系统有限公司 Page sensitivity detection method and device, electronic equipment and storage medium
CN111859013A (en) * 2020-07-17 2020-10-30 腾讯音乐娱乐科技(深圳)有限公司 Data processing method, device, terminal and storage medium
CN112231442A (en) * 2020-10-15 2021-01-15 北京临近空间飞行器系统工程研究所 Sensitive word filtering method and device
CN112364153A (en) * 2020-11-10 2021-02-12 中数通信息有限公司 Keyword identification method and device based on interference characteristics
CN114257563B (en) * 2021-12-20 2023-10-24 创盛视联数码科技(北京)有限公司 Filtering method for chat content callback in live broadcasting room
CN116776862A (en) * 2023-08-25 2023-09-19 福昕鲲鹏(北京)信息科技有限公司 Sensitive word shielding method, device, equipment and medium of OFD file

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117339A (en) * 2011-03-30 2011-07-06 曹晓晶 Filter supervision method specific to unsecure web page texts
CN106254074A (en) * 2016-08-12 2016-12-21 南京航空航天大学 A kind of such poems of the Song Dynasty carrier Text information hiding technology based on Hybrid Encryption
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN106569995A (en) * 2016-09-26 2017-04-19 天津大学 Method for automatically generating Chinese poetry based on corpus and metrical rule
CN108280130A (en) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 A method of finding sensitive data in text big data
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117339A (en) * 2011-03-30 2011-07-06 曹晓晶 Filter supervision method specific to unsecure web page texts
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN106254074A (en) * 2016-08-12 2016-12-21 南京航空航天大学 A kind of such poems of the Song Dynasty carrier Text information hiding technology based on Hybrid Encryption
CN106569995A (en) * 2016-09-26 2017-04-19 天津大学 Method for automatically generating Chinese poetry based on corpus and metrical rule
CN108280130A (en) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 A method of finding sensitive data in text big data
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种适用于唐诗诗句分词方法的研究;闫伟等;《现代计算机(专业版)》;20160125(第03期);摘要,第1-5节 *

Also Published As

Publication number Publication date
CN110457428A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110457428B (en) Sensitive word detection and filtering method and device and electronic equipment
CN110209796B (en) Sensitive word detection and filtering method and device and electronic equipment
CN106599155B (en) Webpage classification method and system
KR101813683B1 (en) Method for automatic correction of errors in annotated corpus using kernel Ripple-Down Rules
CN106503055A (en) A kind of generation method from structured text to iamge description
KR100999488B1 (en) Method and apparatus for detecting document plagiarism
CN105893410A (en) Keyword extraction method and apparatus
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN102096680A (en) Method and device for analyzing information validity
CN106570180A (en) Artificial intelligence based voice searching method and device
KR102034346B1 (en) Method and Device for Detecting Slang Based on Learning
CN106294396A (en) Keyword expansion method and keyword expansion system
CN110008474B (en) Key phrase determining method, device, equipment and storage medium
JP6427466B2 (en) Synonym pair acquisition apparatus, method and program
JP6558863B2 (en) Model creation device, estimation device, method, and program
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN115809662B (en) Method, device, equipment and medium for detecting anomaly of text content
CN110879963A (en) Sensitive expression package detection method and device and electronic equipment
CN110750981A (en) High-accuracy website sensitive word detection method based on machine learning
CN104572633A (en) Method for determining meanings of polysemous word
Badawi et al. Kurdish news dataset headlines (KNDH) through multiclass classification
Kawahara et al. Single Classifier Approach for Verb Sense Disambiguation based on Generalized Features.
JP2016218512A (en) Information processing device and information processing program
KR101837003B1 (en) Method for monitoring online communities
Rofiq Indonesian news extractive text summarization using latent semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant