CN105320659A - Sensitive word filtering method - Google Patents

Sensitive word filtering method Download PDF

Info

Publication number
CN105320659A
CN105320659A CN201410243936.7A CN201410243936A CN105320659A CN 105320659 A CN105320659 A CN 105320659A CN 201410243936 A CN201410243936 A CN 201410243936A CN 105320659 A CN105320659 A CN 105320659A
Authority
CN
China
Prior art keywords
word
perform
mailbox
list
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410243936.7A
Other languages
Chinese (zh)
Inventor
王专
吴志祥
吴剑
张海龙
马和平
郭凤林
沈健
查江
靳彩娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongcheng Network Technology Co Ltd
Original Assignee
Tongcheng Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongcheng Network Technology Co Ltd filed Critical Tongcheng Network Technology Co Ltd
Priority to CN201410243936.7A priority Critical patent/CN105320659A/en
Publication of CN105320659A publication Critical patent/CN105320659A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a sensitive word filtering method. The sensitive word filtering method is characterized in that a Chinese judgmental algorithm is adopted; sensitive word judgment is performed by adopting a maximum forward matching algorithm; interpretation is conducted on continuous repeated words, forward detection is adopted, and if two repeated words exist, a detection repeating mode is executed till the requirement for the minimum number of the repeated words is met. Whether domain names are characteristic continuous character strings or not is judged. In addition, the sensitive words meeting the filtering demand in the method undergo filtering processing, otherwise, the sensitive words are released. Therefore, comprehensive sensitive word judgment and filtering can be achieved by proceeding from Chinese words and phrases and combining various characters, such as emails, cell-phone numbers and web sites. What's more important is that after the sensitive word filtering method is adopted, automatic continuous repeated word interpretation and detection without artificial participation, analysis of emails, cell-phone numbers and web sites, matching of continuous character strings and automatic comparison of sensitive word dictionaries can be achieved by adopting a software matching algorithm, the accuracy of website data is improved, and especially the verification efficiency of users' comments is improved.

Description

Filtering sensitive words method
Technical field
The present invention relates to a kind of filter method, particularly relate to a kind of filtering sensitive words method.
Background technology
With regard to prior art, website submits review information to for user, if want to carry out filtering sensitive words or screening, then major part needs manual examination and verification, or by several sensitive words that routine processes often occurs.This mode deals with more loaded down with trivial details, needs people to consume the participation of a lot of time.Meanwhile, need constantly to supplement the sensitive word that may exist, be then not easy to judge to reduplication.What is more important, along with the increase browsing user, in time commenting on increasing, this mode is almost infeasible.In order to address this is that, a very powerful sensitive word disposal route is needed to coordinate the execution of software to solve problems.
Summary of the invention
Object of the present invention is exactly to solve the above-mentioned problems in the prior art, provides a kind of filtering sensitive words method.
Object of the present invention is achieved through the following technical solutions:
Filtering sensitive words method, wherein: at least adopt following method, adopts Chinese evaluation algorithm; By maximum forward matching algorithm, carry out sensitive word judgement; Carry out interpretation by continuous repetitor, adopt forward to detect, if there are two repetitors, just start to enter duplicate detection pattern, until meet minimum repetitor number; Adopt mailbox, cell-phone number, network address analysis, judge mailbox, whether the domain name of cell-phone number, network address be the continuation character string of feature; In said method all meet filtration needs then carry out filtration treatment, otherwise to be let pass.
Above-mentioned filtering sensitive words method, wherein: described Chinese evaluation algorithm is dropped within the scope of Chinese character code by base layer encodes and judges, described Chinese character code scope is, first segment word, row code 0x81 to 0xFE second section word, row code 0x40 to 0x7E, 0xA1 to 0xFE.
Further, above-mentioned filtering sensitive words method, wherein: described sensitive word deterministic process is as follows: 1. step, sets up sensitive word dictionary, all puts into needing the sensitive word judged in dictionary.
2. step, carries out judgement initialization, if s=1, n=1, s represent to get from which word, and n represents and gets several word.3. more whether step, got individual n the word backward of s of input word, existed, if existed, perform step 4. with the word in dictionary, and if there is no, then n=n+1, continues to perform step 3..4. the word matched, is put into list of matches by step, if s=inputs the maximum length of word, then performs step 5..If s+n has equaled the maximum length inputting word, then s=s+1 is set, n=1; Otherwise n=n+1, continues to perform step 3..5., list of matches duplicate removal also returns step.
Further, above-mentioned filtering sensitive words method, wherein: described continuous repetitor interpretation process is that 1. step, carries out initialization, if s=1.2. step, gets s word, and searches an identical word backward, finds, records the position of same word, be set to p, and perform step 3.; If do not found, then s=s+1, and continue to perform this step, if s equals the maximum length inputting word, then perform step 4..3. step, arranges n=1, if s+n>p, s=s+1 perform step 2.; Relatively whether s+n word be identical with p+n word, if identical, then n=n+1 continues to perform this step; If different, and s+n=p, then get s+n-1 word above, put into repetitor list; If 2. s+n<p, s=s+1 perform step.4. step carries out duplicate removal process to duplicate contents, returns corresponding repetitor list.
Further, above-mentioned filtering sensitive words method, wherein: described employing mailbox analytic process is that 1. step, carries out initialization, if s=1.2. step, is searched the position of symbol, is set to p from s word, if find, then performs step 3., if do not find, then s=s+1, continues to perform this step.3., turn left respectively from p position and the word searched and meet mailbox of turning right, described word comprises step, letter, numeral, underscore, period, in one or more, if leftmost position is designated as m, least significant is designated as n, the word got between m to n judges whether it is mailbox, if mailbox, then put into mailbox list, if not, then s=p+1 is set, and performs step 2..4. mailbox list duplicate removal, returns by step.
Further, above-mentioned filtering sensitive words method, wherein: described cell-phone number analytic process is, 1. step, carries out initialization, if s=1.2. step, is searched the position of numeral, is set to p from s word, if find, then performs step 3., if do not find, then s=s+1, continues to perform this step.3., turn left respectively from p position and turn right and search numeral, leftmost position is designated as m to step, and least significant is designated as n, and the word got between m to n judges whether it is cell-phone number, if so, then just puts into cell-phone number list, if not, then s=p+1 is set, and performs step 2..4. cell-phone number list duplicate removal, returns by step.
Further, above-mentioned filtering sensitive words method, wherein: described network address analytic process is, 1. step, carries out initialization, if s=1.2. step, is searched the position of ". " symbol, is set to p from s word, if find, then performs step 3., if do not find, then s=s+1, continues to perform this step.Step 3., the word searched and meet domain suffix of turning right from p position, leftmost position is designated as m, least significant is designated as n, and the word got between m to n judges whether to belong to domain suffix, if do not belong to domain suffix, then perform s=p+1, return step 2., if belong to domain suffix, then search match information forward from p position, if leftmost position is designated as q, the word got between q to n judges whether it is network address content, if so, then list of websites is put into, if not, then s=p+1 is set, performs step 2..
Again further, above-mentioned filtering sensitive words method, wherein: described word comprises letter, numeral, period, described domain suffix is " .com " or " .cn ", described required content comprises letter, numeral, period, back slash, colon, and described network address content is " http: // " or " ftp: // ".
The advantage of technical solution of the present invention is mainly reflected in: from Chinese Words judgement, in conjunction with all kinds of characters such as mailbox, cell-phone number, network address, can realize judgement and the filtration of comprehensive sensitive word.What is more important, algorithm can be coordinated by software after adopting the present invention, the continuous repetitor interpretation detection of robotization that manpower-free participates in can be realized, mailbox, cell-phone number, network address analysis, coupling continuation character string, automatic contrast sensitive word dictionary, improves website data, especially the review efficiency of user comment.
Embodiment
Filtering sensitive words method, its unusual part is: at least adopt following method, first, adopts Chinese evaluation algorithm.Meanwhile, by maximum forward matching algorithm, carry out sensitive word judgement.In order to expand the determination range of sensitive word, simultaneously combine Chinese for custom, sensitive word judge can support phonetic.Further, interpretation can be carried out by repetitor continuously, adopt forward to detect, if there are two repetitors, just start to enter duplicate detection pattern, until meet minimum repetitor number.Moreover, adopt mailbox, cell-phone number, network address analysis, judge mailbox, whether the domain name of cell-phone number, network address be the continuation character string of feature, promotes the judging efficiency of entirety.In said method all meet filtration needs then carry out filtration treatment, otherwise to be let pass.
Meanwhile, consider the coding singularity of Chinese character, in order to improve judgement effect, Chinese evaluation algorithm is dropped within the scope of Chinese character code by base layer encodes and judges.Specifically, Chinese character code scope is, first segment word, row code 0x81 to 0xFE second section word, row code 0x40 to 0x7E, 0xA1 to 0xFE.
With regard to the present invention one preferably embodiment, the sensitive word deterministic process of employing is as follows: 1. step, sets up sensitive word dictionary, all puts into needing the sensitive word judged in dictionary.2. step, carries out judgement initialization, if s=1, n=1, s represent to get from which word, and n represents and gets several word.3. more whether step, got individual n the word backward of s of input word, existed, if existed, perform step 4. with the word in dictionary, and if there is no, then n=n+1, continues to perform step 3..4. the word matched, is put into list of matches by step, if s=inputs the maximum length of word, then performs step 5..If s+n has equaled the maximum length inputting word, then s=s+1 is set, n=1; Otherwise n=n+1, continues to perform step 3..5., list of matches duplicate removal also returns step.
Further, the continuous repetitor interpretation process of employing is: 1. step, carries out initialization, if s=1.2. step, gets s word, and searches an identical word backward, finds, records the position of same word, be set to p, and perform step 3..If do not found, then s=s+1, and continue to perform this step, if s equals the maximum length inputting word, then perform step 4..Specifically, 3. the step of employing is arrange n=1, if s+n>p, s=s+1 perform step 2..Further, whether identically with p+n word compare s+n word, if identical, then n=n+1 continues to perform this step.If different, and s+n=p, then get s+n-1 word above, put into repetitor list.If 2. s+n<p, s=s+1 perform step.4. step carries out duplicate removal process to duplicate contents, returns corresponding repetitor list.
Again further, the present invention adopts mailbox analytic process to be: 1. step, carries out initialization, if s=1.2. step, is searched the position of symbol, is set to p from s word, if find, then performs step 3., if do not find, then s=s+1, continues to perform this step.3., turn left respectively from p position and the word searched and meet mailbox of turning right, described word comprises step, letter, numeral, underscore, period, in one or more, if leftmost position is designated as m, least significant is designated as n, the word got between m to n judges whether it is mailbox, if mailbox, then put into mailbox list, if not, then s=p+1 is set, and performs step 2..Finally, 4. mailbox list duplicate removal is returned by step.
Same, in order to effectively realize investigating the supervision of cell-phone number, the cell-phone number analytic process of employing is as follows: 1. step, carries out initialization, if s=1.Afterwards, enter step 2., search the position of numeral, be set to p from s word, if find, then perform step 3., if do not find, then s=s+1, continues to perform this step.About step 3., the process of its process is that turn left respectively from p position and turn right and search numeral, leftmost position is designated as m, least significant is designated as n, and the word got between m to n judges whether it is cell-phone number, if so, then just puts into cell-phone number list, if not, then s=p+1 is set, and performs step 2..Finally, enter step 4., cell-phone number list duplicate removal is returned.
Further, consider the network address that may be mingled with in band examining content, in order to improve the integral filter to content, the network address analytic process of employing is: 1. step, carries out initialization, if s=1.Consider conventional network address adopt ". " more as the situation of separator, step 2. in, from s word, search the position of ". " symbol, be set to p, if find, then perform step 3., if do not find, then s=s+1, continue perform this walk.Enter step after completing above-mentioned work 3., the word searched and meet domain suffix of turning right from p position, leftmost position is designated as m, least significant is designated as n, and the word got between m to n judges whether to belong to domain suffix, if do not belong to domain suffix, then perform s=p+1, return step 2..If belong to domain suffix, then search match information forward from p position.During this period, if leftmost position is designated as q, the word got between q to n judges whether it is network address content, if so, then puts into list of websites, if not, then s=p+1 is set, performs step 2..
Again in conjunction with common network address form, comparatively common word comprises letter, numeral, period.Meanwhile, domain suffix is " .com " or " .cn ".Further, the match information related to comprises letter, numeral, period, back slash, colon.Further, for the ease of identifying, network address content is " http: // " or " ftp: // ".Certainly, also do not get rid of other and may occur suffix name or other network address content informations, can by follow-up interpolation.
In conjunction with actual service condition of the present invention, such as, certain name be 13402129234 user comment say: the dish on http://ffb.bandcmac.com is eaten nice nice nice nice nice nice very well.Than nice hundred times of the dish on http://www.fdgood.com, the dish on http://www.fdgood.com tastes bad, Nima, and I wipes.Adopt filter method of the present invention, following content can be filtered out respectively: cell-phone number, 13402129234.Two network address, http://ffb.bandcmac.com and http://www.fdgood.com.Meanwhile, continuous word " nice nice nice " is judged.Further, two sensitive words are filtered out: " Nima ", " I wipes ".In order to improve whole filtration treatment effect, at least three sensitive words can be adopted to detect server be respectively used to filter Chinese sensitive word, analyze mailbox, cell-phone number, network address, and judge mailbox, whether the domain name of cell-phone number, network address be the continuation character string etc. of feature, realize load balancing as far as possible, effectively transfer content in sensitive word database (content in sensitive word database can be changed at any time, supplement, meet best screening needs).Moreover, during actual enforcement, detect that the action request (comprising all kinds of floor status of user) of user can filter.After filtration treatment, can directly shield sensitive word or adopt relevant character to substitute, can also rule settings be passed through, shield whole piece speech or shield this speech user and include blacklist management in.
Can be found out by above-mentioned character express, after adopting the present invention, from Chinese Words judgement, in conjunction with all kinds of characters such as mailbox, cell-phone number, network address, judgement and the filtration of comprehensive sensitive word can be realized.What is more important, algorithm can be coordinated by software after adopting the present invention, the continuous repetitor interpretation detection of robotization that manpower-free participates in can be realized, mailbox, cell-phone number, network address analysis, coupling continuation character string, automatic contrast sensitive word dictionary, improves website data, especially the review efficiency of user comment.

Claims (8)

1. filtering sensitive words method, is characterized in that: at least adopt following method, adopts Chinese evaluation algorithm; By maximum forward matching algorithm, carry out sensitive word judgement; Carry out interpretation by continuous repetitor, adopt forward to detect, if there are two repetitors, just start to enter duplicate detection pattern, until meet minimum repetitor number; Adopt mailbox, cell-phone number, network address analysis, judge mailbox, whether the domain name of cell-phone number, network address be the continuation character string of feature; In said method all meet filtration needs then carry out filtration treatment, otherwise to be let pass.
2. filtering sensitive words method according to claim 1, it is characterized in that: described Chinese evaluation algorithm is dropped within the scope of Chinese character code by base layer encodes and judges, described Chinese character code scope is, first segment word, row code 0x81 to 0xFE second section word, row code 0x40 to 0x7E, 0xA1 to 0xFE.
3. filtering sensitive words method according to claim 1, is characterized in that: described sensitive word deterministic process is as follows:
1. step, sets up sensitive word dictionary, all puts into needing the sensitive word judged in dictionary;
2. step, carries out judgement initialization, if s=1, n=1, s represent to get from which word, and n represents and gets several word;
3. more whether step, got individual n the word backward of s of input word, existed, if existed, perform step 4. with the word in dictionary, and if there is no, then n=n+1, continues to perform step 3.;
4. the word matched, is put into list of matches by step, if s=inputs the maximum length of word, then performs step 5.; If s+n has equaled the maximum length inputting word, then s=s+1 is set, n=1; Otherwise n=n+1, continues to perform step 3.;
5., list of matches duplicate removal also returns step.
4. filtering sensitive words method according to claim 1, is characterized in that: described continuous repetitor interpretation process is,
1. step, carries out initialization, if s=1;
2. step, gets s word, and searches an identical word backward, finds, records the position of same word, be set to p, and perform step 3.; If do not found, then s=s+1, and continue to perform this step, if s equals the maximum length inputting word, then perform step 4.;
3. step, arranges n=1, if s+n>p, s=s+1 perform step 2.; Relatively whether s+n word be identical with p+n word, if identical, then n=n+1 continues to perform this step; If different, and s+n=p, then get s+n-1 word above, put into repetitor list; If 2. s+n<p, s=s+1 perform step;
4. step carries out duplicate removal process to duplicate contents, returns corresponding repetitor list.
5. filtering sensitive words method according to claim 1, is characterized in that: described employing mailbox analytic process is,
1. step, carries out initialization, if s=1,
2. step, is searched the position of symbol, is set to p from s word, if find, then performs step 3., if do not find, then s=s+1, continues to perform this step;
3., turn left respectively from p position and the word searched and meet mailbox of turning right, described word comprises step, letter, numeral, underscore, period, in one or more, if leftmost position is designated as m, least significant is designated as n, the word got between m to n judges whether it is mailbox, if mailbox, then put into mailbox list, if not, then s=p+1 is set, and performs step 2.;
4. mailbox list duplicate removal, returns by step.
6. filtering sensitive words method according to claim 1, is characterized in that: described cell-phone number analytic process is,
1. step, carries out initialization, if s=1,
2. step, is searched the position of numeral, is set to p from s word, if find, then performs step 3., if do not find, then s=s+1, continues to perform this step;
3., turn left respectively from p position and turn right and search numeral, leftmost position is designated as m to step, and least significant is designated as n, and the word got between m to n judges whether it is cell-phone number, if so, then just puts into cell-phone number list, if not, then s=p+1 is set, and performs step 2.;
4. cell-phone number list duplicate removal, returns by step.
7. filtering sensitive words method according to claim 1, is characterized in that: described network address analytic process is,
1. step, carries out initialization, if s=1,
2. step, is searched the position of ". " symbol, is set to p from s word, if find, then performs step 3., if do not find, then s=s+1, continues to perform this step;
Step 3., the word searched and meet domain suffix of turning right from p position, leftmost position is designated as m, least significant is designated as n, and the word got between m to n judges whether to belong to domain suffix, if do not belong to domain suffix, then perform s=p+1, return step 2., if belong to domain suffix, then search match information forward from p position, if leftmost position is designated as q, the word got between q to n judges whether it is network address content, if so, then list of websites is put into, if not, then s=p+1 is set, performs step 2..
8. filtering sensitive words method according to claim 7, it is characterized in that: described word comprises letter, numeral, period, described domain suffix is " .com " or " .cn ", described required content comprises letter, numeral, period, back slash, colon, and described network address content is " http: // " or " ftp: // ".
CN201410243936.7A 2014-06-04 2014-06-04 Sensitive word filtering method Pending CN105320659A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410243936.7A CN105320659A (en) 2014-06-04 2014-06-04 Sensitive word filtering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410243936.7A CN105320659A (en) 2014-06-04 2014-06-04 Sensitive word filtering method

Publications (1)

Publication Number Publication Date
CN105320659A true CN105320659A (en) 2016-02-10

Family

ID=55248063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410243936.7A Pending CN105320659A (en) 2014-06-04 2014-06-04 Sensitive word filtering method

Country Status (1)

Country Link
CN (1) CN105320659A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704538A (en) * 2017-09-22 2018-02-16 北京锐安科技有限公司 A kind of rubbish text processing method, device, equipment and storage medium
CN108280560A (en) * 2017-01-06 2018-07-13 广州市动景计算机科技有限公司 A kind of anti-brush method and device of subject evaluation
CN115208789A (en) * 2022-07-14 2022-10-18 上海斗象信息科技有限公司 Method and device for determining directory blasting behavior, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682090A (en) * 2012-04-26 2012-09-19 焦点科技股份有限公司 System and method for matching and processing sensitive words on basis of polymerized word tree
CN103514238A (en) * 2012-06-30 2014-01-15 重庆新媒农信科技有限公司 Sensitive word recognition processing method based on classification searching
CN103678651A (en) * 2013-12-20 2014-03-26 Tcl集团股份有限公司 Sensitive word searching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682090A (en) * 2012-04-26 2012-09-19 焦点科技股份有限公司 System and method for matching and processing sensitive words on basis of polymerized word tree
CN103514238A (en) * 2012-06-30 2014-01-15 重庆新媒农信科技有限公司 Sensitive word recognition processing method based on classification searching
CN103678651A (en) * 2013-12-20 2014-03-26 Tcl集团股份有限公司 Sensitive word searching method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280560A (en) * 2017-01-06 2018-07-13 广州市动景计算机科技有限公司 A kind of anti-brush method and device of subject evaluation
CN107704538A (en) * 2017-09-22 2018-02-16 北京锐安科技有限公司 A kind of rubbish text processing method, device, equipment and storage medium
CN115208789A (en) * 2022-07-14 2022-10-18 上海斗象信息科技有限公司 Method and device for determining directory blasting behavior, electronic equipment and storage medium
CN115208789B (en) * 2022-07-14 2023-06-09 上海斗象信息科技有限公司 Method and device for determining directory blasting behavior, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Mejova et al. GOP primary season on twitter: " popular" political sentiment in social media
CN106815263B (en) The searching method and device of legal provision
CN103778548B (en) Merchandise news and key word matching method, merchandise news put-on method and device
CN104809108B (en) Information monitoring analysis system
CN105183731B (en) Recommendation information generation method, device and system
JP5575902B2 (en) Information retrieval based on query semantic patterns
CN102612691B (en) Method and system for scoring texts
CN104636371B (en) Information recommendation method and equipment
CN102495892A (en) Webpage information extraction method
CN102880647A (en) Method and device for acquiring another name of organization
CN104504027B (en) The auto-screening method and device of web page contents
CN104750704A (en) Webpage uniform resource locator (URL) classification and identification method and device
CN106055539A (en) Name disambiguation method and apparatus
CN110688455A (en) Method, medium and computer equipment for filtering invalid comments based on artificial intelligence
CN105320659A (en) Sensitive word filtering method
CN108197243A (en) Method and device is recommended in a kind of input association based on user identity
CN106547924A (en) The sentiment analysis method and device of text message
Dorle et al. Political sentiment analysis through social media
KR101326313B1 (en) Method of classifying emotion from multi sentence using context information
CN111125561A (en) Network heat display method and device
CN106611029A (en) Method and device for improving site search efficiency in website
Xu et al. Sentiment Analysis On Twitter Posts About The Russia and Ukraine War With Long Short-Term Memory
CN105955990A (en) Method for sequencing and screening of comments with consideration of diversity and effectiveness
CN106844743B (en) Emotion classification method and device for Uygur language text
Liu et al. Detecting spam comments posted in micro-blogs using the self-extensible spam dictionary

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160210