CN109657228A - It is a kind of sensitivity text determine method and device - Google Patents
It is a kind of sensitivity text determine method and device Download PDFInfo
- Publication number
- CN109657228A CN109657228A CN201811290233.4A CN201811290233A CN109657228A CN 109657228 A CN109657228 A CN 109657228A CN 201811290233 A CN201811290233 A CN 201811290233A CN 109657228 A CN109657228 A CN 109657228A
- Authority
- CN
- China
- Prior art keywords
- character
- text
- sensitive
- target text
- white list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
Embodiment of the disclosure provides a kind of sensitive text and determines method and device, which comprises determines in target text and belongs to default blacklist with the presence or absence of at least one character;In the case where belonging to default blacklist there is no character, the target text is matched according to default white list, and the character total length of statistical match;The match parameter of target text Yu the white list is determined according to the length of the matched character total length and target text;In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.Match parameter can be calculated according to matching length and text size, and determine whether text is sensitive text according to match parameter, help to improve the recognition accuracy of sensitive text;It can also be first determined whether to help to effectively improve recognition speed for sensitive text according to the less white and black list of data volume.
Description
Technical field
Embodiment of the disclosure is related to text matching techniques field more particularly to a kind of sensitive text determines method and dress
It sets.
Background technique
The online sale platform of commodity facilitates people's lives.In order to guarantee the sound development of platform, operation wind is reduced
Danger, needs that the sensitive information in merchandise news is identified and filtered.
In the prior art, sensitive information identification, improved sensitivity text determination side are carried out compared to using full text matches
Case has preferable matching efficiency.It mainly identifies the sensitive information in text by matching algorithm.For example, KMP algorithm uses
Whether string matching is judged in target string by constantly moving reference character string comprising the reference character string.Work as ginseng
When examining character string and a segment be identical in target string, determine the target string include the reference character string, matching at
Function;When any segment in reference character string discord target string is identical, determine that the target string does not include the ginseng
Examine character string.
As can be seen that above scheme when being confirmed whether it is sensitive text, thinks the text for sensitivity if successful match
Text, algorithm is simple, causes accuracy lower, and furthermore matching causes recognition speed lower one by one.
Summary of the invention
The embodiment of the present disclosure provides a kind of sensitive text and determines method and device, helps to improve and determines sensitive text
Accuracy.
According to the first aspect of the embodiments of the present disclosure, it provides a kind of sensitive text and determines method, which comprises
It determines in target text and belongs to default blacklist with the presence or absence of at least one character;
In the case where belonging to default blacklist there is no character, the target text is carried out according to default white list
Matching, and the character total length of statistical match;
Of target text Yu the white list is determined according to the length of the matched character total length and target text
With parameter;
In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive
Text.
According to the second aspect of an embodiment of the present disclosure, a kind of sensitive text determining device is provided, described device includes:
Blacklist matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;
White list matching module, for there is no character belong to default blacklist in the case where, according to default white list
The target text is matched, and the character total length of statistical match;
Match parameter determining module, for determining mesh according to the length of the matched character total length and target text
Mark the match parameter of text and the white list;
Sensitive determining module, described in determining in the case where the match parameter is greater than preset matching parameter threshold
Target text is non-sensitive text.
According to the third aspect of an embodiment of the present disclosure, a kind of electronic equipment is provided, comprising:
Processor, memory and it is stored in the computer journey that can be run on the memory and on the processor
Sequence, the processor realize that sensitive text above-mentioned determines method when executing described program.
According to a fourth aspect of embodiments of the present disclosure, a kind of readable storage medium storing program for executing is provided, when in the storage medium
When instruction is executed by the processor of electronic equipment, so that electronic equipment is able to carry out sensitive text above-mentioned and determines method.
Embodiment of the disclosure provides a kind of sensitive text and determines method and device, which comprises determines target
Belong to default blacklist with the presence or absence of at least one character in text;In the case where belonging to default blacklist there is no character,
The target text is matched according to default white list, and the character total length of statistical match;According to the matched word
The length of symbol total length and target text determines the match parameter of target text Yu the white list;It is big in the match parameter
In the case where preset matching parameter threshold, determine that the target text is non-sensitive text.Can according to matching length and
Text size calculates match parameter, and determines whether text is sensitive text according to match parameter, helps to improve sensitive text
Recognition accuracy;It can also be first determined whether to help for sensitive text according to the less white and black list of data volume
In effectively improving recognition speed.
Detailed description of the invention
It, below will be in the description to the embodiment of the present disclosure in order to illustrate more clearly of the technical solution of the embodiment of the present disclosure
Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the disclosure
Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is that a kind of sensitive text that the embodiment of the present disclosure one provides determines the specific steps flow chart of method;
Fig. 2 is that a kind of sensitive text that the embodiment of the present disclosure two provides determines the specific steps flow chart of method;
Fig. 3 is a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure three provides;
Fig. 4 is a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure four provides;
Fig. 5 is the structure chart for the electronic equipment that the embodiment of the present disclosure provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present disclosure, the technical solution in the embodiment of the present disclosure is carried out clear, complete
Site preparation description, it is clear that described embodiment is disclosure a part of the embodiment, instead of all the embodiments.Based on this public affairs
Embodiment in opening, every other reality obtained by those of ordinary skill in the art without making creative efforts
Example is applied, the range of disclosure protection is belonged to.
Embodiment one
Referring to Fig.1, it illustrates the specific steps streams that a kind of sensitive text that the embodiment of the present disclosure one provides determines method
Cheng Tu.
Step 101, it determines in target text and belongs to default blacklist with the presence or absence of at least one character.
Wherein, blacklist is used to store the object of serious sensitivity, for example, being related to terrorism, being related to drugs, be related to cruelly
The word of power.
Since blacklist is not related to business scenario, plot influences seriously, to need to filter first.
Specifically, each word in blacklist is matched with target text, it is black if there are at least one in target text
Name single object, then successful match, target text are confirmed to be sensitive text;If blacklist object is not present in target text,
It fails to match, needs to continue with target text at this time to identify whether as sensitive text.
Step 102, in the case where belonging to default blacklist there is no character, according to default white list to the target
Text is matched, and the character total length of statistical match.
Wherein, white list, which is used to store, exempts from the sensitive text information checked, and white list can be manually set, or from standard
It is excavated in database.For example, online sale platform is to commodity text for a typical case scene of the embodiment of the present invention
Sensitive identification, platform thinks that there is no sensitive informations certainly, or the commodity that the businessman often cooperated uploads, and mark can be added
In quasi- commodity library.So as to excavate white list from the standard merchandise library.
It is appreciated that simplest white list matching process is by object each in white list and target text progress
Match, if there are at least one object in white list, successful match in target text;Otherwise, it fails to match.
Step 103, according to the length of the matched character total length and target text determine target text with it is described white
The match parameter of list.
In embodiments of the present invention, in order to further determine matched degree, using match parameter profile matching degree.
For example, for the case where it fails to match, match parameter 0;The case where for successful match, if matched object is more, object
Length is longer, then match parameter is bigger;Matched object is fewer, and object length is smaller, then match parameter is smaller.
It should be noted that the embodiment of the present invention can be applied to the various scenes of text matches, it is not limited to commodity
The sensitive identification of text.
Step 104, in the case where the match parameter is greater than preset matching parameter threshold, the target text is determined
For non-sensitive text.
Wherein, match parameter threshold value needs to carry out target text sensitive identification for judging whether.It is appreciated that
It can be set according to practical application scene with parameter threshold, the embodiment of the present invention is without restriction to its.
It can be concluded that when match parameter is less than match parameter threshold value, it is believed that target text is not the mesh for exempting sensitive identification
Text is marked, needs further to judge whether target text is sensitive text;Match parameter is greater than or equal to match parameter threshold value
When, it is believed that target text is the target text for exempting sensitive identification, and the target text is directly regarded as non-sensitive text.
For a typical case scene of the embodiment of the present invention, online sale platform determines the target text about commodity
When this is sensitive text, then the audit fails for the commodity, and businessman is prompted to modify merchandise news;Determine the target text about commodity
When this is not sensitive text, then commodity audit passes through, and the commodity is allowed to move in platform.
In conclusion the embodiment of the present disclosure, which provides a kind of sensitive text, determines method, which comprises determine mesh
Belong to default blacklist with the presence or absence of at least one character in mark text;The case where belonging to default blacklist there is no character
Under, the target text is matched according to default white list, and the character total length of statistical match;According to the matching
Character total length and the length of target text determine the match parameter of target text Yu the white list;Join in the matching
In the case that number is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.It can be according to matching length
Match parameter is calculated with text size, and determines whether text is sensitive text according to match parameter, helps to improve sensitivity
The recognition accuracy of text;It can also be first determined whether according to the less white and black list of data volume as sensitive text,
Help to effectively improve recognition speed.
Embodiment two
Referring to Fig. 2, it illustrates the specific steps streams that a kind of sensitive text that the embodiment of the present disclosure two provides determines method
Cheng Tu.
Step 201, it determines in target text and belongs to default blacklist with the presence or absence of at least one character.
The step is referred to the detailed description of step 101, and details are not described herein.
Step 202, belonging to default blacklist there is no character, and the main information is not in the main body white list
In the case where, matched character total length is 0.
Wherein, main information is the key message for distinguishing different target text, for example, for the target text based on commodity
This, commodity sign or entitled main information, so that the commodity sign or name of sensitive identification are exempted in storage in main body white list
Claim.
It is appreciated that it fails to match with white list for target text when main information is not in white list.
Optionally, in another embodiment of the disclosure, aforementioned body information includes: title, the main body white list
Including title white list, the related information includes: brand and specification, and the association white list includes: brand white list and rule
Lattice white list.
Specifically, white list can also carry out different demarcation according to application scenarios, for example, can be divided into quotient for commodity
Product white list, brand white list and specification white list.Wherein, commodity white list, which can determine, does not have to carry out sensitive identification
Commodity, for example, mineral water, rice etc. can not have the commodity of sensitive information;White list, which can determine, does not have to carry out sensitive knowledge
Other brand, for example, the inspection-free brand that the premium brands such as Chef Kang, Wang Wang or other countries are approved;Specification white list can be from
Weight, quantity, volume etc. are set, for example, 20 kilograms or less, 100 or less, 500 milliliters or less.
Step 203, belonging to default blacklist there is no character, and the main information is in the main body white list
In the case of, the related information is matched according to the association white list, obtains the related information of successful match.
Wherein, related information is other relevant informations in text except main information, for example, for commodity, association letter
Breath can be brand, specification etc..It include storing brand to exempt the brand white list of sensitive identification, storage to be associated with white list
Specification exempts the specification white list of sensitive identification.
Specifically, each related information being associated in white list is matched with target text, if being wrapped in target text
Containing the related information, then the related information is the related information of successful match;If not including the related information in target text,
Then the related information is the related information that it fails to match.
Step 204, the sum of the related information of the successful match and the length of main information are calculated, matched word is obtained
Accord with total length.
It is appreciated that length is indicated using character number.
Specifically, for a target text, matched character total length MatchLen can be according to following calculation formula
It obtains:
Wherein, MLen is the main information length of successful match, and M is the number of the related information of successful match, Len1i
For the length of the related information of i-th of successful match.
It is appreciated that main information length is also when the target text to commodity carries out multiple texts while matching
It can be the sum of multiple main information length.
Step 205, the ratio for calculating the length of the matched character total length and the target text, is matched
Parameter.
Specifically, match parameter MatchPara can be calculated according to the following formula:
Wherein, L is the length of target text, can be indicated with number of characters.
It is appreciated that in practical applications, which can also be converted, as match parameter, thus
The value range of match parameter can be adjusted flexibly.
Step 206, in the case where the match parameter is greater than preset matching parameter threshold, the target text is determined
For non-sensitive text.
The step is referred to the detailed description of step 104, and details are not described herein.
Step 207, in the case where the match parameter is less than or equal to preset matching parameter threshold, using pre- Mr.
At pinyin probabilities matrix the pinyin character in the target text is segmented.
Wherein, match parameter threshold value is used to judge the matching degree of target text and white list, can be according to practical application
Scene settings, the embodiment of the present invention are without restriction to its value.
Pinyin probabilities matrix illustrates the splicing probability of two syllables in practical application, for example, " h " is spliced into " hu's "
Probability is 0.7, and the probability that " h " and " ang " is spliced into " hang " is 0.8.So as to according to splicing maximum probability the case where into
Row participle, selection hang are word segmentation result.
Optionally, in another embodiment of the disclosure, above-mentioned steps 207 include sub-step 2071 to 2073:
Sub-step 2071 segments the pinyin character in the target text to obtain participle group, the participle group packet
Include the syllable group of at least one syllable splicing.
Specifically, syllable splicing table can be used and determine possible splicing result.For example, for " huanghe ", it may
Word segmentation result be " hu ang he ", " huang he ", and " hu an ghe ", " h ua ng he " are impossible participle
As a result.
Wherein, for " huang he ", it is divided into two syllable groups " huang " and " he ";For " hu ang he ", syllable
Group is " hu ", " ang ", " he " three syllable groups.
Sub-step 2072 determines the participle group using pre-generated pinyin probabilities matrix for each participle group
Word segmentation accuracy.
Specifically, firstly, for each syllable group in each participle group, first is found from pinyin probabilities matrix
The probability of syllable and the splicing of the second syllable, and so on, second, third splicing probability is obtained, by the spelling of each adjacent syllable
Probability multiplication is connect, the probability of the syllable group is obtained;Finally, the participle that the probability multiplication of each syllable group is obtained participle group is accurate
Degree.
The maximum participle group of word segmentation accuracy is replaced the pinyin character by sub-step 2073.
It in practical applications, can also be by the corresponding character group of different participles or the biggish multiple characters of word segmentation accuracy
Group is added to pinyin character.If one of character group successful match represents a successful match;If it fails to match,
Then it fails to match for representative.
Step 208, extensive processing is carried out to the target text, the extensive processing includes: that character merges, character is torn open
Divide, character sequence extension, character conversion.
In embodiments of the present invention, in order to improve the sensitive recognition accuracy to target text, sensitive identification is being carried out
It is preceding that extensive processing is carried out to target text, eliminate the inaccurate information in target text.Target text after extensive processing with
Length before may be different.So that target text is unified phonetic alphabet format, and can be to that may be present in target text
Non-standardization character is corrected.
It can be with the stop words and idle character in Filtration Goal text before extensive processing.
Wherein, stop words refers to save memory space and improving search efficiency when information retrieval, in processing natural language
The inessential word or word of automatic fitration before or after data (or text).For example, modal particle, stop words etc..
Idle character can not influence the word of text original idea for other except stop words.Idle character can be for not
Same scene.
It in practical applications, can be using the stop words deactivated in dictionary identification target text.Idle character can also root
According to the different idle character library of different application scene settings.
In embodiments of the present invention, its interference to text matches can be reduced by filtering out stop words and idle character, can
To effectively improve the efficiency and accuracy of text matches.
Specifically, the step of character merges includes: using word-breaking dictionary to the adjacent Chinese characters character in the target text
It merges, and obtained character will be merged in the case where merging successfully and be added in the target text.
Wherein, word-breaking dictionary have recorded can be split as two to three parts Chinese character, word-breaking be divided into up and down fractionation, left and right
Split two kinds.For example, " merchant " can according to being split as " west " and " shellfish " up and down, " building " can be split as according to left and right " wood " and
" Lou ".
Specifically, can by adjacent Chinese characters character or so merge or up and down merge, judge merge after Chinese character whether
In word-breaking dictionary.If merging success, combined character being added in the target text.For example, can be added to
After character before corresponding merging.
The step of character is split includes: to be torn open using word-breaking dictionary to each chinese character in the target text
Point, and obtained character will be split in the case where splitting successfully and be added in the target text.
Specifically, it can be determined that chinese character whether there is in word-breaking dictionary, and if it exists, then by the word after fractionation
It is added in target text.For example, being added to after the character before splitting.
Further, it is also possible to the character after fractionation or merging is marked, thus in matching, if before splitting or merging
Character, fractionation or the equal successful match of character after merging the character before fractionation or merging is made as a successful match
For the character of successful match.
The step of character sequence extends includes: for the adjacent Chinese characters character in the target text, by the adjacent Chinese
Word character expansion is the chinese character group of kinds of characters sequence, and is added in the target text.
Specifically, select several chinese characters as one group of carry out sequence recombination.For example, for " Yellow Crane Tower ", Ke Yikuo
Exhibition is " Huang Louhe ", " Lou Huanghe ", " Lou Hehuang ", " He Huanglou ", " He Louhuang ".
In embodiments of the present invention, the chinese character number of recombination can be chosen according to practical application scene, under normal conditions
Two to three adjacent chinese characters can be chosen.
It is appreciated that above-mentioned participle, Chinese character separating and merging, character sequence extension, it can be with appropriate adjustment sequence, or choosing
Select wherein one or more realizations.
The step of character is converted includes: step A1, and the emoticon in the target text is replaced with corresponding Chinese character
Character.
Wherein, emoticon may include the figures such as smiling face, greeting.
In practical applications, each emoticon can specify its corresponding chinese character in definition, generate emoticon
Number library.To which user is when inputting chinese character, it can be associated with out corresponding emoticon, or, finding by emoticon
Corresponding chinese character.
Chinese character in the target text is replaced with corresponding pinyin character by step A2.
Specifically, the corresponding pinyin character of chinese character can be searched from dictionary.
It is appreciated that chinese character here includes original chinese character in target text, it also include in step 210
The chinese character of conversion.
Step 209, the target text is matched using default sensitive database, obtains the sensitivity of successful match
Word.
It is appreciated that being directed to step 208, the sensitive information in sensitive database is indicated with pinyin character.
Specifically, the pinyin character in sensitive database is matched with target text, it should if existing in target text
Pinyin character, then successful match, using the corresponding chinese character of the pinyin character as the sensitive word of successful match;If target is literary
The pinyin character is not present in this, then it fails to match, and the corresponding chinese character of the pinyin character is not the sensitivity of successful match
Word continues to match other pinyin characters.
Step 210, the sensitive parameter of the target text is determined according to the sensitive word total length of the successful match.
Wherein, sensitive parameter is related to the number of the sensitive word of successful match and length, for example, the sensitivity of successful match
The number of word is bigger, and length is longer, and sensitive parameter is bigger;The number of the sensitive word of successful match is smaller, and length is shorter, sensitive
Parameter is smaller.
It is appreciated that sensitive parameter can be 0 when the number of the sensitive word of successful match is 0.
Optionally, in another embodiment of the invention, above-mentioned steps 210 include sub-step 2101 to 2102:
Sub-step 2101 calculates the sum of the length of sensitive word of the successful match, obtains sensitive length.
Specifically, for a target text, sensitive length SenLen can be obtained according to following calculation formula:
Wherein, N is the number of the sensitive word of successful match, Len2jFor the length of the sensitive word of j-th of successful match.
Sub-step 2102 calculates the ratio of the length of the sensitive length and the target text, obtains the target text
This sensitive parameter.
Specifically, sensitive parameter SenPara can be calculated according to the following formula:
Wherein, L is identical as the L in formula (2), is length of the target text without any processing, can use number of characters
It indicates.
Step 211, in the case where the sensitive parameter is greater than default sensitive parameter threshold value, the target text is determined
For sensitive text.
Wherein, the sensitive parameter threshold value, can be according to practical application for determining whether target text is sensitive text
Scene settings.
It is appreciated that when sensitive parameter is greater than or equal to sensitive parameter threshold value, it is believed that target text is sensitive text
This;When sensitive parameter is less than sensitive parameter threshold value, it is believed that target text is to take sensitive text.
In conclusion the embodiment of the present disclosure, which provides a kind of sensitive text, determines method, which comprises determine mesh
Belong to default blacklist with the presence or absence of at least one character in mark text;Belonging to default blacklist, and main body there is no character
For information not in the case where the main body white list, matched character total length is 0;Black name is preset belonging to there is no character
It is single, and the main information is in the case where the main body white list, according to the association white list to the related information into
Row matching, obtains the related information of successful match;Calculate the successful match related information and main information length it
With obtain matched character total length;The ratio of the length of the matched character total length and the target text is calculated,
Obtain match parameter;In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non-
Sensitive text;In the case where the match parameter is less than or equal to preset matching parameter threshold, using pre-generated phonetic
Probability matrix segments the pinyin character in the target text;Extensive processing is carried out to the target text, it is described general
Change processing includes: that character merges, character is split, character sequence extension, character conversion;Using default sensitive database to described
Target text is matched, and the sensitive word of successful match is obtained;According to the determination of the sensitive word total length of the successful match
The sensitive parameter of target text;In the case where the sensitive parameter is greater than default sensitive parameter threshold value, the target text is determined
This is sensitive text.Match parameter can be calculated according to matching length and text size, and determine that text is according to match parameter
No is sensitive text, helps to improve the recognition accuracy of sensitive text;It can also be according to the less white list of data volume and black
List first determines whether to help to effectively improve recognition speed for sensitive text.Further, it is also possible to be carried out to target text
Participle splits chinese character, merges chinese character, character sequence extension, and it is sensitive really to be finally used uniformly pinyin character progress
Recognize, helps to further increase recognition accuracy.
Embodiment three
Referring to Fig. 3, it illustrates a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure three provides, tools
Body is as follows.
Blacklist matching module 301 presets black name for determining to belong in target text with the presence or absence of at least one character
It is single.
White list matching module 302, for there is no character belong to default blacklist in the case where, it is white according to presetting
Target text described in name single pair is matched, and the character total length of statistical match;
Match parameter determining module 303, for being determined according to the length of the matched character total length and target text
The match parameter of target text and the white list.
Sensitive determining module 304, for determining in the case where the match parameter is greater than preset matching parameter threshold
The target text is non-sensitive text.
In conclusion the embodiment of the present disclosure provides a kind of sensitive text determining device, described device includes: blacklist
Matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;White list matches mould
Block, for there is no character belong to default blacklist in the case where, according to default white list to the target text carry out
Match, and the character total length of statistical match;Match parameter determining module, for according to the matched character total length and mesh
The length of mark text determines the match parameter of target text Yu the white list;Sensitive determining module, in the matching
In the case that parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.It can be grown according to matching
Degree and text size calculate match parameter, and determine whether text is sensitive text according to match parameter, help to improve sensitivity
The recognition accuracy of text;It can also be first determined whether according to the less white and black list of data volume as sensitive text,
Help to effectively improve recognition speed.
Embodiment three is the corresponding Installation practice of embodiment of the method one, and detailed description is referred to embodiment one, herein
It repeats no more.
Example IV
Referring to Fig. 4, it illustrates a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure four provides, tools
Body is as follows.
Blacklist matching module 401 presets black name for determining to belong in target text with the presence or absence of at least one character
It is single.
White list matching module 402, for there is no character belong to default blacklist in the case where, it is white according to presetting
Target text described in name single pair is matched, and the character total length of statistical match;Optionally, in embodiment of the disclosure,
Above-mentioned white list matching module 402, comprising:
It fails to match submodule 4021, in the main information not in the case where the main body white list, matching
Character total length be 0.
Related information matched sub-block 4022 is used in the main information in the case where main body white list, root
The related information is matched according to the association white list, obtains the related information of successful match.
Matching length computational submodule 4023, for calculating the related information of the successful match and the length of main information
The sum of degree, obtains matched character total length.
Match parameter determining module 403, for being determined according to the length of the matched character total length and target text
The match parameter of target text and the white list;Optionally, in embodiment of the disclosure, above-mentioned match parameter determines mould
Block 403, comprising:
Match parameter computational submodule 4031, for calculating the matched character total length and the target text
The ratio of length, obtains match parameter.
Sensitive determining module 404, for determining in the case where the match parameter is greater than preset matching parameter threshold
The target text is non-sensitive text.
Word segmentation module 405, for adopting in the case where the match parameter is less than or equal to preset matching parameter threshold
The pinyin character in the target text is segmented with pre-generated pinyin probabilities matrix.
Extensive processing module 406, for carrying out extensive processing to the target text, the extensive processing includes: character
Merge, character is split, character sequence extension, character conversion.
Sensitive word matching module 407 is obtained for being matched using default sensitive database to the target text
The sensitive word of successful match.
Sensitive parameter determining module 408 determines the target text for the sensitive word total length according to the successful match
This sensitive parameter.
Second sensitive determining module 409 is used in the case where the sensitive parameter is greater than default sensitive parameter threshold value,
Determine the target text for sensitive text.
Optionally, in another embodiment of the disclosure, aforementioned body information includes: title, the main body white list
Including title white list, the related information includes: brand and specification, and the association white list includes: brand white list and rule
Lattice white list.
Optionally, in another embodiment of the disclosure, above-mentioned word segmentation module 405 includes:
Participle group generates submodule, obtains participle group, institute for being segmented to the pinyin character in the target text
State the syllable group that participle group includes the splicing of at least one syllable.
Word segmentation accuracy determines submodule, is used for for each participle group, true using pre-generated pinyin probabilities matrix
The word segmentation accuracy of the fixed participle group.
Submodule is segmented, for the maximum participle group of word segmentation accuracy to be replaced the pinyin character.
In conclusion the embodiment of the present disclosure provides a kind of sensitive text determining device, described device includes: blacklist
Matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;White list matches mould
Block, for there is no character belong to default blacklist in the case where, according to default white list to the target text carry out
Match, and the character total length of statistical match;Above-mentioned white list matching module, comprising: it fails to match submodule, for described
For main information not in the case where the main body white list, matched character total length is 0;Related information matched sub-block,
For in the main information in the case where main body white list, according to the association white list to the related information
It is matched, obtains the related information of successful match;Matching length computational submodule, for calculating the pass of the successful match
Join the sum of information and the length of main information, obtains matched character total length;Match parameter determining module, for according to institute
The length for stating matched character total length and target text determines the match parameter of target text Yu the white list;Above-mentioned
With parameter determination module, comprising: match parameter computational submodule, for calculating the matched character total length and the mesh
The ratio for marking the length of text, obtains match parameter;Sensitive determining module, for being greater than preset matching in the match parameter
In the case where parameter threshold, determine that the target text is non-sensitive text;Word segmentation module, for small in the match parameter
In or be equal to preset matching parameter threshold in the case where, using pre-generated pinyin probabilities matrix in the target text
Pinyin character segmented;Extensive processing module, for carrying out extensive processing, the extensive processing to the target text
It include: that character merges, character is split, character sequence extension, character conversion;Sensitive word matching module, for quick using presetting
Sense database matches the target text, obtains the sensitive word of successful match;Sensitive parameter determining module is used for root
The sensitive parameter of the target text is determined according to the sensitive word total length of the successful match;Second sensitive determining module, is used for
In the case where the sensitive parameter is greater than default sensitive parameter threshold value, determine the target text for sensitive text.It can root
Match parameter is calculated according to matching length and text size, and determines whether text is sensitive text according to match parameter, is helped
In the recognition accuracy for improving sensitive text.Further, it is also possible to be segmented to target text, split chinese character, merge the Chinese
Word character, character sequence extension are finally used uniformly pinyin character and carry out sensitive confirmation, help to further increase identification standard
Exactness.
Example IV is the corresponding Installation practice of embodiment of the method two, and detailed description is referred to embodiment two, herein
It repeats no more.
The embodiment of the present disclosure additionally provides a kind of electronic equipment, referring to Fig. 5, comprising: processing, 501, memory 502 and
It is stored in the computer program 5021 that can be run on the memory 502 and on the processor 501, the processor 501
Realize that sensitive text above-mentioned determines method when executing described program.
The embodiment of the present disclosure additionally provides a kind of readable storage medium storing program for executing, when the instruction in the storage medium is set by electronics
When standby processor executes, so that electronic equipment is able to carry out sensitive text above-mentioned and determines method.
For device embodiment, since it is basically similar to the method embodiment, so being described relatively simple, phase
Place is closed to illustrate referring to the part of embodiment of the method.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein.
Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system
Structure be obvious.In addition, the disclosure is also not for any particular programming language.It should be understood that can use various
Programming language realizes content of this disclosure described herein, and the description done above to language-specific is to disclose this
Disclosed preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the disclosure
Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail
Structure and technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects,
In the description above to the exemplary embodiment of the disclosure, each feature of the disclosure is grouped together into single reality sometimes
It applies in example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: being wanted
The disclosure of protection is asked to require features more more than feature expressly recited in each claim.More precisely, such as
As following claims reflect, inventive aspect is all features less than single embodiment disclosed above.
Therefore, it then follows thus claims of specific embodiment are expressly incorporated in the specific embodiment, wherein each right is wanted
It asks in itself all as the separate embodiments of the disclosure.
Those skilled in the art will understand that adaptivity can be carried out to the module in the equipment in embodiment
Ground changes and they is arranged in one or more devices different from this embodiment.It can be the module in embodiment
Or unit or assembly is combined into a module or unit or component, and furthermore they can be divided into multiple submodule or sons
Unit or sub-component.It, can be with other than such feature and/or at least some of process or unit exclude each other
Using any combination to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and such as
All process or units of any method or apparatus of the displosure are combined.Unless expressly stated otherwise, this specification
Each feature disclosed in (including the accompanying claims, abstract and drawings) can be by providing identical, equivalent, or similar purpose
Alternative features replace.
The various component embodiments of the disclosure can be implemented in hardware, or to transport on one or more processors
Capable software module is realized, or is implemented in a combination thereof.It will be understood by those of skill in the art that can be in practice
It is realized using microprocessor or digital signal processor (DSP) in the sequencing display equipment according to the embodiment of the present disclosure
The some or all functions of some or all components.The disclosure is also implemented as executing side as described herein
Some or all device or device programs of method.Such program for realizing the disclosure can store in computer
On readable medium, or it may be in the form of one or more signals.Such signal can be from internet website
Downloading obtains, and is perhaps provided on the carrier signal or is provided in any other form.
The disclosure is limited it should be noted that above-described embodiment illustrates rather than the disclosure, and this
Field technical staff can be designed alternative embodiment without departing from the scope of the appended claims.In claim
In, any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" is not excluded for depositing
In element or step not listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple
Such element.The disclosure can be by means of including the hardware of several different elements and by means of properly programmed calculating
Machine is realized.In the unit claims listing several devices, several in these devices can be by same
A hardware branch embodies.The use of word first, second, and third does not indicate any sequence.It can be by these words
It is construed to title.
It is apparent to those skilled in the art that for convenience and simplicity of description, foregoing description is
The specific work process of system, device and unit, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
The foregoing is merely the preferred embodiments of the disclosure, not to limit the disclosure, all essences in the disclosure
Made any modifications, equivalent replacements, and improvements etc., should be included within the protection scope of the disclosure within mind and principle.
The above, the only specific embodiment of the disclosure, but the protection scope of the disclosure is not limited thereto, and is appointed
What those familiar with the art can easily think of the change or the replacement, answer in the technical scope that the disclosure discloses
Cover within the protection scope of the disclosure.Therefore, the protection scope of the disclosure should be subject to the protection scope in claims.
Claims (11)
1. a kind of sensitivity text determines method, which is characterized in that the described method includes:
It determines in target text and belongs to default blacklist with the presence or absence of at least one character;
In the case where belonging to default blacklist there is no character, the target text is matched according to default white list,
And the character total length of statistical match;
Determine that the matching of target text and the white list is joined according to the length of the matched character total length and target text
Number;
In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.
2. the method according to claim 1, wherein the target text includes main information and related information,
The white list includes main body white list and be associated with white list, and the basis is preset white list and carried out to the target text
Match, and the step of character total length of statistical match, comprising:
In the main information not in the case where the main body white list, matched character total length is 0;
In the main information in the case where main body white list, according to the association white list to the related information into
Row matching, obtains the related information of successful match;
The sum of the related information of the successful match and the length of main information are calculated, matched character total length is obtained.
3. the method according to claim 1, wherein described according to the matched character total length and target text
This length determines the step of match parameter of the target text with the white list, comprising:
The ratio for calculating the length of the matched character total length and the target text, obtains match parameter.
4. the method according to claim 1, wherein the method also includes:
In the case where the match parameter is less than or equal to preset matching parameter threshold, using default sensitive database to described
Target text is matched, and the sensitive word of successful match is obtained;
The sensitive parameter of the target text is determined according to the sensitive word total length of the successful match;
In the case where the sensitive parameter is greater than default sensitive parameter threshold value, determine the target text for sensitive text.
5. the method according to claim 1, wherein described literary to the target using default sensitive database
Before the step of this is matched, and the sensitive word of successful match is obtained, further includes:
The pinyin character in the target text is segmented using pre-generated pinyin probabilities matrix.
6. according to the method described in claim 5, it is characterized in that, described use pre-generated pinyin probabilities matrix to described
The step of pinyin character in target text is segmented, comprising:
Pinyin character in the target text is segmented to obtain participle group, the participle group includes that at least one syllable is spelled
The syllable group connect;
For each participle group, the word segmentation accuracy of the participle group is determined using pre-generated pinyin probabilities matrix;
The maximum participle group of word segmentation accuracy is replaced into the pinyin character.
7. the method according to claim 1, wherein described literary to the target using default sensitive database
Before the step of this is matched, and the sensitive word of successful match is obtained, further includes:
Extensive processing is carried out to the target text, the extensive processing includes: that character merges, character is split, character sequence expands
Exhibition, character conversion.
8. according to the method described in claim 2, it is characterized in that, the main information includes: title, the main body white list
Including title white list, the related information includes: brand and specification, and the association white list includes: brand white list and rule
Lattice white list.
9. a kind of sensitivity text determining device, which is characterized in that described device includes:
Blacklist matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;
White list matching module, for there is no character belong to default blacklist in the case where, according to default white list to institute
It states target text to be matched, and the character total length of statistical match;
Match parameter determining module, for determining target text according to the length of the matched character total length and target text
With the match parameter of the white list;
Sensitive determining module, for determining the target in the case where the match parameter is greater than preset matching parameter threshold
Text is non-sensitive text.
10. a kind of electronic equipment characterized by comprising
Processor, memory and it is stored in the computer program that can be run on the memory and on the processor,
It is characterized in that, the processor realizes the sensitive text as described in one or more in claim 1 to 8 when executing described program
Determine method.
11. a kind of readable storage medium storing program for executing, which is characterized in that when the instruction in the storage medium is held by the processor of electronic equipment
When row, so that electronic equipment is able to carry out the sensitive text determination side as described in one or more in claim to a method 1 to 8
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811290233.4A CN109657228B (en) | 2018-10-31 | 2018-10-31 | Sensitive text determining method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811290233.4A CN109657228B (en) | 2018-10-31 | 2018-10-31 | Sensitive text determining method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109657228A true CN109657228A (en) | 2019-04-19 |
CN109657228B CN109657228B (en) | 2023-06-06 |
Family
ID=66110662
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811290233.4A Active CN109657228B (en) | 2018-10-31 | 2018-10-31 | Sensitive text determining method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109657228B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061874A (en) * | 2019-12-10 | 2020-04-24 | 苏州思必驰信息科技有限公司 | Sensitive information detection method and device |
CN111159354A (en) * | 2019-12-31 | 2020-05-15 | 中国银行股份有限公司 | Sensitive information detection method, device, equipment and system |
CN111159759A (en) * | 2019-12-19 | 2020-05-15 | 上海上讯信息技术股份有限公司 | Mixed sensitive information discovery method and device based on black and white list and electronic equipment |
CN113076748A (en) * | 2021-04-16 | 2021-07-06 | 平安国际智慧城市科技股份有限公司 | Method, device and equipment for processing bullet screen sensitive words and storage medium |
CN113128220A (en) * | 2021-04-30 | 2021-07-16 | 北京奇艺世纪科技有限公司 | Text distinguishing method and device, electronic equipment and storage medium |
CN113408270A (en) * | 2021-06-10 | 2021-09-17 | 广州三七极创网络科技有限公司 | Variant text recognition method and device and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040032514A1 (en) * | 1997-07-15 | 2004-02-19 | Kia Silverbrook | Apparatus for adding user-supplied text to a digital still image |
CN108182246A (en) * | 2017-12-28 | 2018-06-19 | 东软集团股份有限公司 | Sensitive word detection filter method, device and computer equipment |
CN108519970A (en) * | 2018-02-06 | 2018-09-11 | 平安科技(深圳)有限公司 | The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text |
-
2018
- 2018-10-31 CN CN201811290233.4A patent/CN109657228B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040032514A1 (en) * | 1997-07-15 | 2004-02-19 | Kia Silverbrook | Apparatus for adding user-supplied text to a digital still image |
CN108182246A (en) * | 2017-12-28 | 2018-06-19 | 东软集团股份有限公司 | Sensitive word detection filter method, device and computer equipment |
CN108519970A (en) * | 2018-02-06 | 2018-09-11 | 平安科技(深圳)有限公司 | The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061874A (en) * | 2019-12-10 | 2020-04-24 | 苏州思必驰信息科技有限公司 | Sensitive information detection method and device |
CN111159759A (en) * | 2019-12-19 | 2020-05-15 | 上海上讯信息技术股份有限公司 | Mixed sensitive information discovery method and device based on black and white list and electronic equipment |
CN111159354A (en) * | 2019-12-31 | 2020-05-15 | 中国银行股份有限公司 | Sensitive information detection method, device, equipment and system |
CN113076748A (en) * | 2021-04-16 | 2021-07-06 | 平安国际智慧城市科技股份有限公司 | Method, device and equipment for processing bullet screen sensitive words and storage medium |
CN113076748B (en) * | 2021-04-16 | 2024-01-19 | 平安国际智慧城市科技股份有限公司 | Bullet screen sensitive word processing method, device, equipment and storage medium |
CN113128220A (en) * | 2021-04-30 | 2021-07-16 | 北京奇艺世纪科技有限公司 | Text distinguishing method and device, electronic equipment and storage medium |
CN113128220B (en) * | 2021-04-30 | 2023-07-18 | 北京奇艺世纪科技有限公司 | Text discrimination method, text discrimination device, electronic equipment and storage medium |
CN113408270A (en) * | 2021-06-10 | 2021-09-17 | 广州三七极创网络科技有限公司 | Variant text recognition method and device and electronic equipment |
CN113408270B (en) * | 2021-06-10 | 2023-02-10 | 广州三七极创网络科技有限公司 | Variant text recognition method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109657228B (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109657228A (en) | It is a kind of sensitivity text determine method and device | |
US10360307B2 (en) | Automated ontology building | |
CN109065031A (en) | Voice annotation method, device and equipment | |
AU2021269302C1 (en) | System and method for coupled detection of syntax and semantics for natural language understanding and generation | |
US10665267B2 (en) | Correlation of recorded video presentations and associated slides | |
US20180121413A1 (en) | System and method for extracting entities in electronic documents | |
JP2006190006A5 (en) | ||
CN111339250B (en) | Mining method for new category labels, electronic equipment and computer readable medium | |
EP3113174A1 (en) | Method for building a speech feature library, method, apparatus, and device for speech synthesis | |
CN105653984A (en) | File fingerprint check method and apparatus | |
CN103164698A (en) | Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested | |
GB2555207A (en) | System and method for identifying passages in electronic documents | |
CN113076748B (en) | Bullet screen sensitive word processing method, device, equipment and storage medium | |
CN109508448A (en) | Short information method, medium, device are generated based on long article and calculate equipment | |
CN105653949A (en) | Malicious program detection method and device | |
CN103617192A (en) | Method and device for clustering data objects | |
CN105378706B (en) | Entity extraction is fed back | |
CN106022357A (en) | Data input calibration method and terminal | |
CN109492401B (en) | Content carrier risk detection method, device, equipment and medium | |
CN114676231A (en) | Target information detection method, device and medium | |
US9613019B2 (en) | Techniques for automatically generating test data | |
CN115470489A (en) | Detection model training method, detection method, device and computer readable medium | |
US9898457B1 (en) | Identifying non-natural language for content analysis | |
Rofiq | Indonesian news extractive text summarization using latent semantic analysis | |
CN108804917A (en) | A kind of file test method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |