CN109657228A - It is a kind of sensitivity text determine method and device - Google Patents

It is a kind of sensitivity text determine method and device Download PDF

Info

Publication number
CN109657228A
CN109657228A CN201811290233.4A CN201811290233A CN109657228A CN 109657228 A CN109657228 A CN 109657228A CN 201811290233 A CN201811290233 A CN 201811290233A CN 109657228 A CN109657228 A CN 109657228A
Authority
CN
China
Prior art keywords
character
text
sensitive
target text
white list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811290233.4A
Other languages
Chinese (zh)
Other versions
CN109657228B (en
Inventor
袁喆
张晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201811290233.4A priority Critical patent/CN109657228B/en
Publication of CN109657228A publication Critical patent/CN109657228A/en
Application granted granted Critical
Publication of CN109657228B publication Critical patent/CN109657228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

Embodiment of the disclosure provides a kind of sensitive text and determines method and device, which comprises determines in target text and belongs to default blacklist with the presence or absence of at least one character;In the case where belonging to default blacklist there is no character, the target text is matched according to default white list, and the character total length of statistical match;The match parameter of target text Yu the white list is determined according to the length of the matched character total length and target text;In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.Match parameter can be calculated according to matching length and text size, and determine whether text is sensitive text according to match parameter, help to improve the recognition accuracy of sensitive text;It can also be first determined whether to help to effectively improve recognition speed for sensitive text according to the less white and black list of data volume.

Description

It is a kind of sensitivity text determine method and device
Technical field
Embodiment of the disclosure is related to text matching techniques field more particularly to a kind of sensitive text determines method and dress It sets.
Background technique
The online sale platform of commodity facilitates people's lives.In order to guarantee the sound development of platform, operation wind is reduced Danger, needs that the sensitive information in merchandise news is identified and filtered.
In the prior art, sensitive information identification, improved sensitivity text determination side are carried out compared to using full text matches Case has preferable matching efficiency.It mainly identifies the sensitive information in text by matching algorithm.For example, KMP algorithm uses Whether string matching is judged in target string by constantly moving reference character string comprising the reference character string.Work as ginseng When examining character string and a segment be identical in target string, determine the target string include the reference character string, matching at Function;When any segment in reference character string discord target string is identical, determine that the target string does not include the ginseng Examine character string.
As can be seen that above scheme when being confirmed whether it is sensitive text, thinks the text for sensitivity if successful match Text, algorithm is simple, causes accuracy lower, and furthermore matching causes recognition speed lower one by one.
Summary of the invention
The embodiment of the present disclosure provides a kind of sensitive text and determines method and device, helps to improve and determines sensitive text Accuracy.
According to the first aspect of the embodiments of the present disclosure, it provides a kind of sensitive text and determines method, which comprises
It determines in target text and belongs to default blacklist with the presence or absence of at least one character;
In the case where belonging to default blacklist there is no character, the target text is carried out according to default white list Matching, and the character total length of statistical match;
Of target text Yu the white list is determined according to the length of the matched character total length and target text With parameter;
In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive Text.
According to the second aspect of an embodiment of the present disclosure, a kind of sensitive text determining device is provided, described device includes:
Blacklist matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;
White list matching module, for there is no character belong to default blacklist in the case where, according to default white list The target text is matched, and the character total length of statistical match;
Match parameter determining module, for determining mesh according to the length of the matched character total length and target text Mark the match parameter of text and the white list;
Sensitive determining module, described in determining in the case where the match parameter is greater than preset matching parameter threshold Target text is non-sensitive text.
According to the third aspect of an embodiment of the present disclosure, a kind of electronic equipment is provided, comprising:
Processor, memory and it is stored in the computer journey that can be run on the memory and on the processor Sequence, the processor realize that sensitive text above-mentioned determines method when executing described program.
According to a fourth aspect of embodiments of the present disclosure, a kind of readable storage medium storing program for executing is provided, when in the storage medium When instruction is executed by the processor of electronic equipment, so that electronic equipment is able to carry out sensitive text above-mentioned and determines method.
Embodiment of the disclosure provides a kind of sensitive text and determines method and device, which comprises determines target Belong to default blacklist with the presence or absence of at least one character in text;In the case where belonging to default blacklist there is no character, The target text is matched according to default white list, and the character total length of statistical match;According to the matched word The length of symbol total length and target text determines the match parameter of target text Yu the white list;It is big in the match parameter In the case where preset matching parameter threshold, determine that the target text is non-sensitive text.Can according to matching length and Text size calculates match parameter, and determines whether text is sensitive text according to match parameter, helps to improve sensitive text Recognition accuracy;It can also be first determined whether to help for sensitive text according to the less white and black list of data volume In effectively improving recognition speed.
Detailed description of the invention
It, below will be in the description to the embodiment of the present disclosure in order to illustrate more clearly of the technical solution of the embodiment of the present disclosure Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the disclosure Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be according to these Attached drawing obtains other attached drawings.
Fig. 1 is that a kind of sensitive text that the embodiment of the present disclosure one provides determines the specific steps flow chart of method;
Fig. 2 is that a kind of sensitive text that the embodiment of the present disclosure two provides determines the specific steps flow chart of method;
Fig. 3 is a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure three provides;
Fig. 4 is a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure four provides;
Fig. 5 is the structure chart for the electronic equipment that the embodiment of the present disclosure provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present disclosure, the technical solution in the embodiment of the present disclosure is carried out clear, complete Site preparation description, it is clear that described embodiment is disclosure a part of the embodiment, instead of all the embodiments.Based on this public affairs Embodiment in opening, every other reality obtained by those of ordinary skill in the art without making creative efforts Example is applied, the range of disclosure protection is belonged to.
Embodiment one
Referring to Fig.1, it illustrates the specific steps streams that a kind of sensitive text that the embodiment of the present disclosure one provides determines method Cheng Tu.
Step 101, it determines in target text and belongs to default blacklist with the presence or absence of at least one character.
Wherein, blacklist is used to store the object of serious sensitivity, for example, being related to terrorism, being related to drugs, be related to cruelly The word of power.
Since blacklist is not related to business scenario, plot influences seriously, to need to filter first.
Specifically, each word in blacklist is matched with target text, it is black if there are at least one in target text Name single object, then successful match, target text are confirmed to be sensitive text;If blacklist object is not present in target text, It fails to match, needs to continue with target text at this time to identify whether as sensitive text.
Step 102, in the case where belonging to default blacklist there is no character, according to default white list to the target Text is matched, and the character total length of statistical match.
Wherein, white list, which is used to store, exempts from the sensitive text information checked, and white list can be manually set, or from standard It is excavated in database.For example, online sale platform is to commodity text for a typical case scene of the embodiment of the present invention Sensitive identification, platform thinks that there is no sensitive informations certainly, or the commodity that the businessman often cooperated uploads, and mark can be added In quasi- commodity library.So as to excavate white list from the standard merchandise library.
It is appreciated that simplest white list matching process is by object each in white list and target text progress Match, if there are at least one object in white list, successful match in target text;Otherwise, it fails to match.
Step 103, according to the length of the matched character total length and target text determine target text with it is described white The match parameter of list.
In embodiments of the present invention, in order to further determine matched degree, using match parameter profile matching degree. For example, for the case where it fails to match, match parameter 0;The case where for successful match, if matched object is more, object Length is longer, then match parameter is bigger;Matched object is fewer, and object length is smaller, then match parameter is smaller.
It should be noted that the embodiment of the present invention can be applied to the various scenes of text matches, it is not limited to commodity The sensitive identification of text.
Step 104, in the case where the match parameter is greater than preset matching parameter threshold, the target text is determined For non-sensitive text.
Wherein, match parameter threshold value needs to carry out target text sensitive identification for judging whether.It is appreciated that It can be set according to practical application scene with parameter threshold, the embodiment of the present invention is without restriction to its.
It can be concluded that when match parameter is less than match parameter threshold value, it is believed that target text is not the mesh for exempting sensitive identification Text is marked, needs further to judge whether target text is sensitive text;Match parameter is greater than or equal to match parameter threshold value When, it is believed that target text is the target text for exempting sensitive identification, and the target text is directly regarded as non-sensitive text.
For a typical case scene of the embodiment of the present invention, online sale platform determines the target text about commodity When this is sensitive text, then the audit fails for the commodity, and businessman is prompted to modify merchandise news;Determine the target text about commodity When this is not sensitive text, then commodity audit passes through, and the commodity is allowed to move in platform.
In conclusion the embodiment of the present disclosure, which provides a kind of sensitive text, determines method, which comprises determine mesh Belong to default blacklist with the presence or absence of at least one character in mark text;The case where belonging to default blacklist there is no character Under, the target text is matched according to default white list, and the character total length of statistical match;According to the matching Character total length and the length of target text determine the match parameter of target text Yu the white list;Join in the matching In the case that number is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.It can be according to matching length Match parameter is calculated with text size, and determines whether text is sensitive text according to match parameter, helps to improve sensitivity The recognition accuracy of text;It can also be first determined whether according to the less white and black list of data volume as sensitive text, Help to effectively improve recognition speed.
Embodiment two
Referring to Fig. 2, it illustrates the specific steps streams that a kind of sensitive text that the embodiment of the present disclosure two provides determines method Cheng Tu.
Step 201, it determines in target text and belongs to default blacklist with the presence or absence of at least one character.
The step is referred to the detailed description of step 101, and details are not described herein.
Step 202, belonging to default blacklist there is no character, and the main information is not in the main body white list In the case where, matched character total length is 0.
Wherein, main information is the key message for distinguishing different target text, for example, for the target text based on commodity This, commodity sign or entitled main information, so that the commodity sign or name of sensitive identification are exempted in storage in main body white list Claim.
It is appreciated that it fails to match with white list for target text when main information is not in white list.
Optionally, in another embodiment of the disclosure, aforementioned body information includes: title, the main body white list Including title white list, the related information includes: brand and specification, and the association white list includes: brand white list and rule Lattice white list.
Specifically, white list can also carry out different demarcation according to application scenarios, for example, can be divided into quotient for commodity Product white list, brand white list and specification white list.Wherein, commodity white list, which can determine, does not have to carry out sensitive identification Commodity, for example, mineral water, rice etc. can not have the commodity of sensitive information;White list, which can determine, does not have to carry out sensitive knowledge Other brand, for example, the inspection-free brand that the premium brands such as Chef Kang, Wang Wang or other countries are approved;Specification white list can be from Weight, quantity, volume etc. are set, for example, 20 kilograms or less, 100 or less, 500 milliliters or less.
Step 203, belonging to default blacklist there is no character, and the main information is in the main body white list In the case of, the related information is matched according to the association white list, obtains the related information of successful match.
Wherein, related information is other relevant informations in text except main information, for example, for commodity, association letter Breath can be brand, specification etc..It include storing brand to exempt the brand white list of sensitive identification, storage to be associated with white list Specification exempts the specification white list of sensitive identification.
Specifically, each related information being associated in white list is matched with target text, if being wrapped in target text Containing the related information, then the related information is the related information of successful match;If not including the related information in target text, Then the related information is the related information that it fails to match.
Step 204, the sum of the related information of the successful match and the length of main information are calculated, matched word is obtained Accord with total length.
It is appreciated that length is indicated using character number.
Specifically, for a target text, matched character total length MatchLen can be according to following calculation formula It obtains:
Wherein, MLen is the main information length of successful match, and M is the number of the related information of successful match, Len1i For the length of the related information of i-th of successful match.
It is appreciated that main information length is also when the target text to commodity carries out multiple texts while matching It can be the sum of multiple main information length.
Step 205, the ratio for calculating the length of the matched character total length and the target text, is matched Parameter.
Specifically, match parameter MatchPara can be calculated according to the following formula:
Wherein, L is the length of target text, can be indicated with number of characters.
It is appreciated that in practical applications, which can also be converted, as match parameter, thus The value range of match parameter can be adjusted flexibly.
Step 206, in the case where the match parameter is greater than preset matching parameter threshold, the target text is determined For non-sensitive text.
The step is referred to the detailed description of step 104, and details are not described herein.
Step 207, in the case where the match parameter is less than or equal to preset matching parameter threshold, using pre- Mr. At pinyin probabilities matrix the pinyin character in the target text is segmented.
Wherein, match parameter threshold value is used to judge the matching degree of target text and white list, can be according to practical application Scene settings, the embodiment of the present invention are without restriction to its value.
Pinyin probabilities matrix illustrates the splicing probability of two syllables in practical application, for example, " h " is spliced into " hu's " Probability is 0.7, and the probability that " h " and " ang " is spliced into " hang " is 0.8.So as to according to splicing maximum probability the case where into Row participle, selection hang are word segmentation result.
Optionally, in another embodiment of the disclosure, above-mentioned steps 207 include sub-step 2071 to 2073:
Sub-step 2071 segments the pinyin character in the target text to obtain participle group, the participle group packet Include the syllable group of at least one syllable splicing.
Specifically, syllable splicing table can be used and determine possible splicing result.For example, for " huanghe ", it may Word segmentation result be " hu ang he ", " huang he ", and " hu an ghe ", " h ua ng he " are impossible participle As a result.
Wherein, for " huang he ", it is divided into two syllable groups " huang " and " he ";For " hu ang he ", syllable Group is " hu ", " ang ", " he " three syllable groups.
Sub-step 2072 determines the participle group using pre-generated pinyin probabilities matrix for each participle group Word segmentation accuracy.
Specifically, firstly, for each syllable group in each participle group, first is found from pinyin probabilities matrix The probability of syllable and the splicing of the second syllable, and so on, second, third splicing probability is obtained, by the spelling of each adjacent syllable Probability multiplication is connect, the probability of the syllable group is obtained;Finally, the participle that the probability multiplication of each syllable group is obtained participle group is accurate Degree.
The maximum participle group of word segmentation accuracy is replaced the pinyin character by sub-step 2073.
It in practical applications, can also be by the corresponding character group of different participles or the biggish multiple characters of word segmentation accuracy Group is added to pinyin character.If one of character group successful match represents a successful match;If it fails to match, Then it fails to match for representative.
Step 208, extensive processing is carried out to the target text, the extensive processing includes: that character merges, character is torn open Divide, character sequence extension, character conversion.
In embodiments of the present invention, in order to improve the sensitive recognition accuracy to target text, sensitive identification is being carried out It is preceding that extensive processing is carried out to target text, eliminate the inaccurate information in target text.Target text after extensive processing with Length before may be different.So that target text is unified phonetic alphabet format, and can be to that may be present in target text Non-standardization character is corrected.
It can be with the stop words and idle character in Filtration Goal text before extensive processing.
Wherein, stop words refers to save memory space and improving search efficiency when information retrieval, in processing natural language The inessential word or word of automatic fitration before or after data (or text).For example, modal particle, stop words etc..
Idle character can not influence the word of text original idea for other except stop words.Idle character can be for not Same scene.
It in practical applications, can be using the stop words deactivated in dictionary identification target text.Idle character can also root According to the different idle character library of different application scene settings.
In embodiments of the present invention, its interference to text matches can be reduced by filtering out stop words and idle character, can To effectively improve the efficiency and accuracy of text matches.
Specifically, the step of character merges includes: using word-breaking dictionary to the adjacent Chinese characters character in the target text It merges, and obtained character will be merged in the case where merging successfully and be added in the target text.
Wherein, word-breaking dictionary have recorded can be split as two to three parts Chinese character, word-breaking be divided into up and down fractionation, left and right Split two kinds.For example, " merchant " can according to being split as " west " and " shellfish " up and down, " building " can be split as according to left and right " wood " and " Lou ".
Specifically, can by adjacent Chinese characters character or so merge or up and down merge, judge merge after Chinese character whether In word-breaking dictionary.If merging success, combined character being added in the target text.For example, can be added to After character before corresponding merging.
The step of character is split includes: to be torn open using word-breaking dictionary to each chinese character in the target text Point, and obtained character will be split in the case where splitting successfully and be added in the target text.
Specifically, it can be determined that chinese character whether there is in word-breaking dictionary, and if it exists, then by the word after fractionation It is added in target text.For example, being added to after the character before splitting.
Further, it is also possible to the character after fractionation or merging is marked, thus in matching, if before splitting or merging Character, fractionation or the equal successful match of character after merging the character before fractionation or merging is made as a successful match For the character of successful match.
The step of character sequence extends includes: for the adjacent Chinese characters character in the target text, by the adjacent Chinese Word character expansion is the chinese character group of kinds of characters sequence, and is added in the target text.
Specifically, select several chinese characters as one group of carry out sequence recombination.For example, for " Yellow Crane Tower ", Ke Yikuo Exhibition is " Huang Louhe ", " Lou Huanghe ", " Lou Hehuang ", " He Huanglou ", " He Louhuang ".
In embodiments of the present invention, the chinese character number of recombination can be chosen according to practical application scene, under normal conditions Two to three adjacent chinese characters can be chosen.
It is appreciated that above-mentioned participle, Chinese character separating and merging, character sequence extension, it can be with appropriate adjustment sequence, or choosing Select wherein one or more realizations.
The step of character is converted includes: step A1, and the emoticon in the target text is replaced with corresponding Chinese character Character.
Wherein, emoticon may include the figures such as smiling face, greeting.
In practical applications, each emoticon can specify its corresponding chinese character in definition, generate emoticon Number library.To which user is when inputting chinese character, it can be associated with out corresponding emoticon, or, finding by emoticon Corresponding chinese character.
Chinese character in the target text is replaced with corresponding pinyin character by step A2.
Specifically, the corresponding pinyin character of chinese character can be searched from dictionary.
It is appreciated that chinese character here includes original chinese character in target text, it also include in step 210 The chinese character of conversion.
Step 209, the target text is matched using default sensitive database, obtains the sensitivity of successful match Word.
It is appreciated that being directed to step 208, the sensitive information in sensitive database is indicated with pinyin character.
Specifically, the pinyin character in sensitive database is matched with target text, it should if existing in target text Pinyin character, then successful match, using the corresponding chinese character of the pinyin character as the sensitive word of successful match;If target is literary The pinyin character is not present in this, then it fails to match, and the corresponding chinese character of the pinyin character is not the sensitivity of successful match Word continues to match other pinyin characters.
Step 210, the sensitive parameter of the target text is determined according to the sensitive word total length of the successful match.
Wherein, sensitive parameter is related to the number of the sensitive word of successful match and length, for example, the sensitivity of successful match The number of word is bigger, and length is longer, and sensitive parameter is bigger;The number of the sensitive word of successful match is smaller, and length is shorter, sensitive Parameter is smaller.
It is appreciated that sensitive parameter can be 0 when the number of the sensitive word of successful match is 0.
Optionally, in another embodiment of the invention, above-mentioned steps 210 include sub-step 2101 to 2102:
Sub-step 2101 calculates the sum of the length of sensitive word of the successful match, obtains sensitive length.
Specifically, for a target text, sensitive length SenLen can be obtained according to following calculation formula:
Wherein, N is the number of the sensitive word of successful match, Len2jFor the length of the sensitive word of j-th of successful match.
Sub-step 2102 calculates the ratio of the length of the sensitive length and the target text, obtains the target text This sensitive parameter.
Specifically, sensitive parameter SenPara can be calculated according to the following formula:
Wherein, L is identical as the L in formula (2), is length of the target text without any processing, can use number of characters It indicates.
Step 211, in the case where the sensitive parameter is greater than default sensitive parameter threshold value, the target text is determined For sensitive text.
Wherein, the sensitive parameter threshold value, can be according to practical application for determining whether target text is sensitive text Scene settings.
It is appreciated that when sensitive parameter is greater than or equal to sensitive parameter threshold value, it is believed that target text is sensitive text This;When sensitive parameter is less than sensitive parameter threshold value, it is believed that target text is to take sensitive text.
In conclusion the embodiment of the present disclosure, which provides a kind of sensitive text, determines method, which comprises determine mesh Belong to default blacklist with the presence or absence of at least one character in mark text;Belonging to default blacklist, and main body there is no character For information not in the case where the main body white list, matched character total length is 0;Black name is preset belonging to there is no character It is single, and the main information is in the case where the main body white list, according to the association white list to the related information into Row matching, obtains the related information of successful match;Calculate the successful match related information and main information length it With obtain matched character total length;The ratio of the length of the matched character total length and the target text is calculated, Obtain match parameter;In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non- Sensitive text;In the case where the match parameter is less than or equal to preset matching parameter threshold, using pre-generated phonetic Probability matrix segments the pinyin character in the target text;Extensive processing is carried out to the target text, it is described general Change processing includes: that character merges, character is split, character sequence extension, character conversion;Using default sensitive database to described Target text is matched, and the sensitive word of successful match is obtained;According to the determination of the sensitive word total length of the successful match The sensitive parameter of target text;In the case where the sensitive parameter is greater than default sensitive parameter threshold value, the target text is determined This is sensitive text.Match parameter can be calculated according to matching length and text size, and determine that text is according to match parameter No is sensitive text, helps to improve the recognition accuracy of sensitive text;It can also be according to the less white list of data volume and black List first determines whether to help to effectively improve recognition speed for sensitive text.Further, it is also possible to be carried out to target text Participle splits chinese character, merges chinese character, character sequence extension, and it is sensitive really to be finally used uniformly pinyin character progress Recognize, helps to further increase recognition accuracy.
Embodiment three
Referring to Fig. 3, it illustrates a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure three provides, tools Body is as follows.
Blacklist matching module 301 presets black name for determining to belong in target text with the presence or absence of at least one character It is single.
White list matching module 302, for there is no character belong to default blacklist in the case where, it is white according to presetting Target text described in name single pair is matched, and the character total length of statistical match;
Match parameter determining module 303, for being determined according to the length of the matched character total length and target text The match parameter of target text and the white list.
Sensitive determining module 304, for determining in the case where the match parameter is greater than preset matching parameter threshold The target text is non-sensitive text.
In conclusion the embodiment of the present disclosure provides a kind of sensitive text determining device, described device includes: blacklist Matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;White list matches mould Block, for there is no character belong to default blacklist in the case where, according to default white list to the target text carry out Match, and the character total length of statistical match;Match parameter determining module, for according to the matched character total length and mesh The length of mark text determines the match parameter of target text Yu the white list;Sensitive determining module, in the matching In the case that parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.It can be grown according to matching Degree and text size calculate match parameter, and determine whether text is sensitive text according to match parameter, help to improve sensitivity The recognition accuracy of text;It can also be first determined whether according to the less white and black list of data volume as sensitive text, Help to effectively improve recognition speed.
Embodiment three is the corresponding Installation practice of embodiment of the method one, and detailed description is referred to embodiment one, herein It repeats no more.
Example IV
Referring to Fig. 4, it illustrates a kind of structure chart for sensitive text determining device that the embodiment of the present disclosure four provides, tools Body is as follows.
Blacklist matching module 401 presets black name for determining to belong in target text with the presence or absence of at least one character It is single.
White list matching module 402, for there is no character belong to default blacklist in the case where, it is white according to presetting Target text described in name single pair is matched, and the character total length of statistical match;Optionally, in embodiment of the disclosure, Above-mentioned white list matching module 402, comprising:
It fails to match submodule 4021, in the main information not in the case where the main body white list, matching Character total length be 0.
Related information matched sub-block 4022 is used in the main information in the case where main body white list, root The related information is matched according to the association white list, obtains the related information of successful match.
Matching length computational submodule 4023, for calculating the related information of the successful match and the length of main information The sum of degree, obtains matched character total length.
Match parameter determining module 403, for being determined according to the length of the matched character total length and target text The match parameter of target text and the white list;Optionally, in embodiment of the disclosure, above-mentioned match parameter determines mould Block 403, comprising:
Match parameter computational submodule 4031, for calculating the matched character total length and the target text The ratio of length, obtains match parameter.
Sensitive determining module 404, for determining in the case where the match parameter is greater than preset matching parameter threshold The target text is non-sensitive text.
Word segmentation module 405, for adopting in the case where the match parameter is less than or equal to preset matching parameter threshold The pinyin character in the target text is segmented with pre-generated pinyin probabilities matrix.
Extensive processing module 406, for carrying out extensive processing to the target text, the extensive processing includes: character Merge, character is split, character sequence extension, character conversion.
Sensitive word matching module 407 is obtained for being matched using default sensitive database to the target text The sensitive word of successful match.
Sensitive parameter determining module 408 determines the target text for the sensitive word total length according to the successful match This sensitive parameter.
Second sensitive determining module 409 is used in the case where the sensitive parameter is greater than default sensitive parameter threshold value, Determine the target text for sensitive text.
Optionally, in another embodiment of the disclosure, aforementioned body information includes: title, the main body white list Including title white list, the related information includes: brand and specification, and the association white list includes: brand white list and rule Lattice white list.
Optionally, in another embodiment of the disclosure, above-mentioned word segmentation module 405 includes:
Participle group generates submodule, obtains participle group, institute for being segmented to the pinyin character in the target text State the syllable group that participle group includes the splicing of at least one syllable.
Word segmentation accuracy determines submodule, is used for for each participle group, true using pre-generated pinyin probabilities matrix The word segmentation accuracy of the fixed participle group.
Submodule is segmented, for the maximum participle group of word segmentation accuracy to be replaced the pinyin character.
In conclusion the embodiment of the present disclosure provides a kind of sensitive text determining device, described device includes: blacklist Matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;White list matches mould Block, for there is no character belong to default blacklist in the case where, according to default white list to the target text carry out Match, and the character total length of statistical match;Above-mentioned white list matching module, comprising: it fails to match submodule, for described For main information not in the case where the main body white list, matched character total length is 0;Related information matched sub-block, For in the main information in the case where main body white list, according to the association white list to the related information It is matched, obtains the related information of successful match;Matching length computational submodule, for calculating the pass of the successful match Join the sum of information and the length of main information, obtains matched character total length;Match parameter determining module, for according to institute The length for stating matched character total length and target text determines the match parameter of target text Yu the white list;Above-mentioned With parameter determination module, comprising: match parameter computational submodule, for calculating the matched character total length and the mesh The ratio for marking the length of text, obtains match parameter;Sensitive determining module, for being greater than preset matching in the match parameter In the case where parameter threshold, determine that the target text is non-sensitive text;Word segmentation module, for small in the match parameter In or be equal to preset matching parameter threshold in the case where, using pre-generated pinyin probabilities matrix in the target text Pinyin character segmented;Extensive processing module, for carrying out extensive processing, the extensive processing to the target text It include: that character merges, character is split, character sequence extension, character conversion;Sensitive word matching module, for quick using presetting Sense database matches the target text, obtains the sensitive word of successful match;Sensitive parameter determining module is used for root The sensitive parameter of the target text is determined according to the sensitive word total length of the successful match;Second sensitive determining module, is used for In the case where the sensitive parameter is greater than default sensitive parameter threshold value, determine the target text for sensitive text.It can root Match parameter is calculated according to matching length and text size, and determines whether text is sensitive text according to match parameter, is helped In the recognition accuracy for improving sensitive text.Further, it is also possible to be segmented to target text, split chinese character, merge the Chinese Word character, character sequence extension are finally used uniformly pinyin character and carry out sensitive confirmation, help to further increase identification standard Exactness.
Example IV is the corresponding Installation practice of embodiment of the method two, and detailed description is referred to embodiment two, herein It repeats no more.
The embodiment of the present disclosure additionally provides a kind of electronic equipment, referring to Fig. 5, comprising: processing, 501, memory 502 and It is stored in the computer program 5021 that can be run on the memory 502 and on the processor 501, the processor 501 Realize that sensitive text above-mentioned determines method when executing described program.
The embodiment of the present disclosure additionally provides a kind of readable storage medium storing program for executing, when the instruction in the storage medium is set by electronics When standby processor executes, so that electronic equipment is able to carry out sensitive text above-mentioned and determines method.
For device embodiment, since it is basically similar to the method embodiment, so being described relatively simple, phase Place is closed to illustrate referring to the part of embodiment of the method.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the disclosure is also not for any particular programming language.It should be understood that can use various Programming language realizes content of this disclosure described herein, and the description done above to language-specific is to disclose this Disclosed preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the disclosure Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail Structure and technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, In the description above to the exemplary embodiment of the disclosure, each feature of the disclosure is grouped together into single reality sometimes It applies in example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: being wanted The disclosure of protection is asked to require features more more than feature expressly recited in each claim.More precisely, such as As following claims reflect, inventive aspect is all features less than single embodiment disclosed above. Therefore, it then follows thus claims of specific embodiment are expressly incorporated in the specific embodiment, wherein each right is wanted It asks in itself all as the separate embodiments of the disclosure.
Those skilled in the art will understand that adaptivity can be carried out to the module in the equipment in embodiment Ground changes and they is arranged in one or more devices different from this embodiment.It can be the module in embodiment Or unit or assembly is combined into a module or unit or component, and furthermore they can be divided into multiple submodule or sons Unit or sub-component.It, can be with other than such feature and/or at least some of process or unit exclude each other Using any combination to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and such as All process or units of any method or apparatus of the displosure are combined.Unless expressly stated otherwise, this specification Each feature disclosed in (including the accompanying claims, abstract and drawings) can be by providing identical, equivalent, or similar purpose Alternative features replace.
The various component embodiments of the disclosure can be implemented in hardware, or to transport on one or more processors Capable software module is realized, or is implemented in a combination thereof.It will be understood by those of skill in the art that can be in practice It is realized using microprocessor or digital signal processor (DSP) in the sequencing display equipment according to the embodiment of the present disclosure The some or all functions of some or all components.The disclosure is also implemented as executing side as described herein Some or all device or device programs of method.Such program for realizing the disclosure can store in computer On readable medium, or it may be in the form of one or more signals.Such signal can be from internet website Downloading obtains, and is perhaps provided on the carrier signal or is provided in any other form.
The disclosure is limited it should be noted that above-described embodiment illustrates rather than the disclosure, and this Field technical staff can be designed alternative embodiment without departing from the scope of the appended claims.In claim In, any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" is not excluded for depositing In element or step not listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple Such element.The disclosure can be by means of including the hardware of several different elements and by means of properly programmed calculating Machine is realized.In the unit claims listing several devices, several in these devices can be by same A hardware branch embodies.The use of word first, second, and third does not indicate any sequence.It can be by these words It is construed to title.
It is apparent to those skilled in the art that for convenience and simplicity of description, foregoing description is The specific work process of system, device and unit, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
The foregoing is merely the preferred embodiments of the disclosure, not to limit the disclosure, all essences in the disclosure Made any modifications, equivalent replacements, and improvements etc., should be included within the protection scope of the disclosure within mind and principle.
The above, the only specific embodiment of the disclosure, but the protection scope of the disclosure is not limited thereto, and is appointed What those familiar with the art can easily think of the change or the replacement, answer in the technical scope that the disclosure discloses Cover within the protection scope of the disclosure.Therefore, the protection scope of the disclosure should be subject to the protection scope in claims.

Claims (11)

1. a kind of sensitivity text determines method, which is characterized in that the described method includes:
It determines in target text and belongs to default blacklist with the presence or absence of at least one character;
In the case where belonging to default blacklist there is no character, the target text is matched according to default white list, And the character total length of statistical match;
Determine that the matching of target text and the white list is joined according to the length of the matched character total length and target text Number;
In the case where the match parameter is greater than preset matching parameter threshold, determine that the target text is non-sensitive text.
2. the method according to claim 1, wherein the target text includes main information and related information, The white list includes main body white list and be associated with white list, and the basis is preset white list and carried out to the target text Match, and the step of character total length of statistical match, comprising:
In the main information not in the case where the main body white list, matched character total length is 0;
In the main information in the case where main body white list, according to the association white list to the related information into Row matching, obtains the related information of successful match;
The sum of the related information of the successful match and the length of main information are calculated, matched character total length is obtained.
3. the method according to claim 1, wherein described according to the matched character total length and target text This length determines the step of match parameter of the target text with the white list, comprising:
The ratio for calculating the length of the matched character total length and the target text, obtains match parameter.
4. the method according to claim 1, wherein the method also includes:
In the case where the match parameter is less than or equal to preset matching parameter threshold, using default sensitive database to described Target text is matched, and the sensitive word of successful match is obtained;
The sensitive parameter of the target text is determined according to the sensitive word total length of the successful match;
In the case where the sensitive parameter is greater than default sensitive parameter threshold value, determine the target text for sensitive text.
5. the method according to claim 1, wherein described literary to the target using default sensitive database Before the step of this is matched, and the sensitive word of successful match is obtained, further includes:
The pinyin character in the target text is segmented using pre-generated pinyin probabilities matrix.
6. according to the method described in claim 5, it is characterized in that, described use pre-generated pinyin probabilities matrix to described The step of pinyin character in target text is segmented, comprising:
Pinyin character in the target text is segmented to obtain participle group, the participle group includes that at least one syllable is spelled The syllable group connect;
For each participle group, the word segmentation accuracy of the participle group is determined using pre-generated pinyin probabilities matrix;
The maximum participle group of word segmentation accuracy is replaced into the pinyin character.
7. the method according to claim 1, wherein described literary to the target using default sensitive database Before the step of this is matched, and the sensitive word of successful match is obtained, further includes:
Extensive processing is carried out to the target text, the extensive processing includes: that character merges, character is split, character sequence expands Exhibition, character conversion.
8. according to the method described in claim 2, it is characterized in that, the main information includes: title, the main body white list Including title white list, the related information includes: brand and specification, and the association white list includes: brand white list and rule Lattice white list.
9. a kind of sensitivity text determining device, which is characterized in that described device includes:
Blacklist matching module belongs to default blacklist with the presence or absence of at least one character for determining in target text;
White list matching module, for there is no character belong to default blacklist in the case where, according to default white list to institute It states target text to be matched, and the character total length of statistical match;
Match parameter determining module, for determining target text according to the length of the matched character total length and target text With the match parameter of the white list;
Sensitive determining module, for determining the target in the case where the match parameter is greater than preset matching parameter threshold Text is non-sensitive text.
10. a kind of electronic equipment characterized by comprising
Processor, memory and it is stored in the computer program that can be run on the memory and on the processor, It is characterized in that, the processor realizes the sensitive text as described in one or more in claim 1 to 8 when executing described program Determine method.
11. a kind of readable storage medium storing program for executing, which is characterized in that when the instruction in the storage medium is held by the processor of electronic equipment When row, so that electronic equipment is able to carry out the sensitive text determination side as described in one or more in claim to a method 1 to 8 Method.
CN201811290233.4A 2018-10-31 2018-10-31 Sensitive text determining method and device Active CN109657228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811290233.4A CN109657228B (en) 2018-10-31 2018-10-31 Sensitive text determining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811290233.4A CN109657228B (en) 2018-10-31 2018-10-31 Sensitive text determining method and device

Publications (2)

Publication Number Publication Date
CN109657228A true CN109657228A (en) 2019-04-19
CN109657228B CN109657228B (en) 2023-06-06

Family

ID=66110662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811290233.4A Active CN109657228B (en) 2018-10-31 2018-10-31 Sensitive text determining method and device

Country Status (1)

Country Link
CN (1) CN109657228B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061874A (en) * 2019-12-10 2020-04-24 苏州思必驰信息科技有限公司 Sensitive information detection method and device
CN111159354A (en) * 2019-12-31 2020-05-15 中国银行股份有限公司 Sensitive information detection method, device, equipment and system
CN111159759A (en) * 2019-12-19 2020-05-15 上海上讯信息技术股份有限公司 Mixed sensitive information discovery method and device based on black and white list and electronic equipment
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113128220A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Text distinguishing method and device, electronic equipment and storage medium
CN113408270A (en) * 2021-06-10 2021-09-17 广州三七极创网络科技有限公司 Variant text recognition method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040032514A1 (en) * 1997-07-15 2004-02-19 Kia Silverbrook Apparatus for adding user-supplied text to a digital still image
CN108182246A (en) * 2017-12-28 2018-06-19 东软集团股份有限公司 Sensitive word detection filter method, device and computer equipment
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040032514A1 (en) * 1997-07-15 2004-02-19 Kia Silverbrook Apparatus for adding user-supplied text to a digital still image
CN108182246A (en) * 2017-12-28 2018-06-19 东软集团股份有限公司 Sensitive word detection filter method, device and computer equipment
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061874A (en) * 2019-12-10 2020-04-24 苏州思必驰信息科技有限公司 Sensitive information detection method and device
CN111159759A (en) * 2019-12-19 2020-05-15 上海上讯信息技术股份有限公司 Mixed sensitive information discovery method and device based on black and white list and electronic equipment
CN111159354A (en) * 2019-12-31 2020-05-15 中国银行股份有限公司 Sensitive information detection method, device, equipment and system
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113076748B (en) * 2021-04-16 2024-01-19 平安国际智慧城市科技股份有限公司 Bullet screen sensitive word processing method, device, equipment and storage medium
CN113128220A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Text distinguishing method and device, electronic equipment and storage medium
CN113128220B (en) * 2021-04-30 2023-07-18 北京奇艺世纪科技有限公司 Text discrimination method, text discrimination device, electronic equipment and storage medium
CN113408270A (en) * 2021-06-10 2021-09-17 广州三七极创网络科技有限公司 Variant text recognition method and device and electronic equipment
CN113408270B (en) * 2021-06-10 2023-02-10 广州三七极创网络科技有限公司 Variant text recognition method and device and electronic equipment

Also Published As

Publication number Publication date
CN109657228B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN109657228A (en) It is a kind of sensitivity text determine method and device
US10360307B2 (en) Automated ontology building
CN109065031A (en) Voice annotation method, device and equipment
AU2021269302C1 (en) System and method for coupled detection of syntax and semantics for natural language understanding and generation
US10665267B2 (en) Correlation of recorded video presentations and associated slides
US20180121413A1 (en) System and method for extracting entities in electronic documents
JP2006190006A5 (en)
CN111339250B (en) Mining method for new category labels, electronic equipment and computer readable medium
EP3113174A1 (en) Method for building a speech feature library, method, apparatus, and device for speech synthesis
CN105653984A (en) File fingerprint check method and apparatus
CN103164698A (en) Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested
GB2555207A (en) System and method for identifying passages in electronic documents
CN113076748B (en) Bullet screen sensitive word processing method, device, equipment and storage medium
CN109508448A (en) Short information method, medium, device are generated based on long article and calculate equipment
CN105653949A (en) Malicious program detection method and device
CN103617192A (en) Method and device for clustering data objects
CN105378706B (en) Entity extraction is fed back
CN106022357A (en) Data input calibration method and terminal
CN109492401B (en) Content carrier risk detection method, device, equipment and medium
CN114676231A (en) Target information detection method, device and medium
US9613019B2 (en) Techniques for automatically generating test data
CN115470489A (en) Detection model training method, detection method, device and computer readable medium
US9898457B1 (en) Identifying non-natural language for content analysis
Rofiq Indonesian news extractive text summarization using latent semantic analysis
CN108804917A (en) A kind of file test method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant