CN103838710B

CN103838710B - Text filtering methods based on key word weights and system

Info

Publication number: CN103838710B
Application number: CN201210479196.8A
Authority: CN
Inventors: 粟栗; 张峰; 付俊
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Filing date: 2012-11-22
Publication date: 2016-11-30
Anticipated expiration: 2032-11-22

Abstract

This application provides a kind of text filtering method based on key word weights, the method comprises the following steps: calculate the weights of key word；And text is filtered by weights of based on the key word calculated；Wherein, the step calculating key word weights includes: judge whether described key word is brand-new key word, if it is, calculate history to judge that the bar number being appropriately determined data in data and the bar number of mistake judgement data and the bar number being appropriately determined data comprising described key word and mistake judge the bar number of data；And calculate the weights of described key word.Additionally, present invention also provides a kind of text filtering system based on key word weights.

Description

Text filtering methods based on key word weights and system

Technical field

The application relates to safety, field of data service, particularly relates to text filtering method based on key word weights and is System.

Background technology

Text message is to propagate the most content of quantity in mobile Internet information, including: webpage, note, multimedia message, instant Means of communication etc..It is in the Internet that sensitive content in File Transfer carries out information filtering (such as politics, pornographic, gambling ...) An important technology.In general, text can be classified as " normally " and " needing to filter " two classes by system.

From quantity of information, the most several hundred million of the text data amount that each of the links (10G) user's upper every day accesses, the whole network has Hundreds billion of data, and wherein need the information accounting filtered considerably less, it is generally less than 1%, therefore accurately catches from mass data It is difficult for obtaining information to be filtered.Even if there being a small amount of erroneous judgement (10%), also make system acquisition to data in erroneous judgement information Accounting reaches more than 90%.In order to avoid erroneous judgement, need finally to be judged by the examination & verification of artificial secondary, and the result of manual examination and verification is The most accurately, but comparatively speaking efficiency also ratio is relatively low.

Text is identified main with the method classified by existing information filtering system (hereinafter referred to as " filtration system ") Have following several:

(1) judge based on key word quantity

The main thought of this technical scheme is to set keywords database, and each key word no longer arranges other and considers index；Right The key word comprised in text is identified, and according to quantity number whether reach the threshold value of default data returned Class.

(2) judge based on entropy (weights) sum

This technology sets entropy to each key word, is set to higher by the entropy of important key word, unessential pass The entropy of keyword is set to relatively low.When text is identified, calculate the entropy sum of the key word comprised in text, and foundation Whether entropy reaches the threshold value of default is sorted out data.

(3) judge based on semantics recognition

Semantics recognition not only defines key word, and the contact (the most simultaneously occurring) defined between key word determines entropy Value, and the classification of text is determined eventually through semanteme in full.Such as when " gun " and " sale " two key words individually occur, Should be normal；If occurred in a certain distance, such as, " sell import gun ", then judge between two words, to there is semanteme connection System.

But, all there is a certain degree of deficiency in existing 3 kinds of technical schemes, specific as follows:

(1) based on key word quantitative determination

Simple quantity based on key word carries out judgement can produce very many erroneous judgements, such as " sell ", " gun " all For key word, an article a large amount of " sale " occur is likely to be judged as needing to filter, but be likely to one normally (as Taobao) transaction page.

The False Rate using this technology may be up to 50%, needs the original data stream of filtration information for only comprising 1% Saying, the impact that erroneous judgement causes substantially can not be accepted.

(2) judge based on entropy sum

Decision procedure based on entropy greatly strengthen decision-making ability undoubtedly, such as, set by the entropy of " sale ", " gun " It is set to 1, the entropy of " sale " & " gun " is set as 100.The simple text " sale " or " gun " occur then just can be judged to Often, occur that the text of " sale " and " gun " just can be judged to need to filter the most simultaneously.

This technology is used largely to decrease erroneous judgement, the tune when keyword weight which exists sets unreasonable Whole problem.

(3) judge based on semantics recognition

The condition of semantics recognition is more, the most accurate.But in semantics recognition, face two technological difficulties equally: One is the problem that semantics recognition the most also faces how key word weights set, because in existing technical scheme, it is fixed to lack The mode of justice keyword weight setting and method；Two is the inefficient of semantics recognition analysis, is not suitable for processing mass data.

Summary of the invention

The problem relatively low in order to solve the order of accuarcy of system judgement text, this application provides a kind of based on key word power The text filtering method of value, the method comprises the following steps: calculate the weights of key word；And based on the key word calculated Weights text is filtered；The step wherein calculating key word weights includes: judge whether key word is brand-new key word, If it is not, then calculate history judge the bar number M being appropriately determined data in data and mistake judge data bar number N and The bar number M1 being appropriately determined data comprising key word and mistake judge the bar number N1 of data；And calculate key word weights

Value 0 = VL + \frac{M 1 / M}{M 1 / M + N 1 / N} (VH - VL) .

Wherein VL is the minimum weights of the key word being set by the user, VH be by The maximum weights of the key word that user sets.

On the other hand, present invention also provides a kind of text filtering system based on key word weights, this system includes: crucial Word weight computing module, for calculating the weights of key word；And text filtering module, for based on the key word calculated Text is filtered by weights；Wherein key word weight computing module includes: the first judging unit, is used for judging that whether key word is Brand-new key word；First computing unit, judges correctly sentencing in data for calculating history when key word is not brand-new key word The bar number M of given data and the bar number N of mistake judgement data and the bar number M1 being appropriately determined data and the mistake that comprise key word are sentenced The bar number N1 of given data；Second computing unit, is used for calculating key word weights

Value 0 = VL + \frac{M 1 / M}{M 1 / M + N 1 / N} (VH - VL),

Wherein VL is the minimum weights of the key word being set by the user, and VH is the maximum weights of the key word being set by the user.

By said method and system, the system that can be effectively increased judges the order of accuarcy of text.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of text filtering system based on key word weights；

Fig. 2 is the flow chart of text filtering method based on key word weights.

Detailed description of the invention

Owing to carrying out the sample size of the key word of information filtering the most little (hundreds of to thousand of), but dictionary is used to enter There are hundreds billion of object (text) every day that row judges, and need every day the sample carrying out manual review also to have tens thousand of.

For same key word, such as " gun ", both it was possibly used for normal text, it is also possible to for relating to the net of sudden and violent class Page.And owing to, in existing filtration system, there is the link of manual examination and verification, the result of available manual examination and verification determines that key word exists On the impact being appropriately determined and mistake judges in judgement；The comprehensive analysis key word positively and negatively effect in judgement, finally Determine the weights of key word.

The application proposes a kind of mechanism being optimized keyword weight based on classification samples and setting.This machine is made For the key word weight setting mechanism rubbed based on sample, sample is divided into judgement correct and decision error two class, for both may be used Can occur in and correctly be likely to occur in the key word in type of error sample, by this pass comprised in existing or newly-increased sample Weights are increased or decreased by the situation of keyword.This mechanism is similar to be placed between two pieces of planks a key word carry out Friction, referred to as sample scraping mechanism.

Below with reference to accompanying drawings the detailed description of the invention of the application is described.

Fig. 1 is the schematic diagram of text filtering system 1000 based on key word weights.Text filtering system 1000 includes: close Keyword weight computing module 1100, for calculating the weights of key word；And text filtering module 1200, for based on being calculated Text is filtered by the weights of the key word gone out.Wherein, key word weight computing module 1100 includes: the first judging unit 1010, it is used for judging whether key word is brand-new key word；First computing unit 1011, is used for when key word is not brand-new key Calculate history during word and judge that the bar number M being appropriately determined data in data and mistake judge the bar number N of data and comprise key The bar number M1 being appropriately determined data of word and mistake judge the bar number N1 of data；Second computing unit 1012, is used for calculating key Word weights

Value 0 = VL + \frac{M 1 / M}{M 1 / M + N 1 / N} (VH - VL),

Wherein VL is the minimum weights of the key word being set by the user, VH is the maximum weights of the key word being set by the user.In some embodiments, text filtering system 1000 can also include Store history and judge that the history of data judges data base.Specifically, history judges that data representative is by manual review mistake and complete Become the data sorted out.History judges that data can include being appropriately determined data and mistake judges data, is wherein appropriately determined data generation The data being appropriately determined confirmed as by table through manual review, and mistake judges that data represent the number confirming as erroneous judgement through manual review According to.In some embodiments, key word weight computing module 1100 also includes: the 3rd computing unit 1013, is used for calculating pass Keyword judges to be appropriately determined the number of times Xi appeared in data and key word in history judges data in data in history Mistake judges the number of times Yi appeared in data；And the 4th computing unit 1014, it is used for calculating coefficient of friction

μ = \frac{\min ((VH - Value 0), (Value 0 - VL))}{Xi + Yi} .

In some embodiments, key word weight computing module 1100 is also wrapped Include: the second judging unit 1018, be used for judging whether the size of (Xi-Yi) μ exceedes user's weighed value adjusting set in advance threshold Value, weighed value adjusting unit 1019, for exceeding the power of described weighed value adjusting threshold value key word in season when the size of (Xi-Yi) μ Value Value=Value0+ (Xi-Yi) μ, otherwise makes the weights Value=Value0 of key word.In some embodiments, Key word weight computing module 1100 also includes: the 5th computing unit 1017, is used for when key word is brand-new key word institute in season State the weights Value0=(VH+VL of key word)/2, wherein VL is the minimum weights of the described key word being set by the user, and VH is The maximum weights of the described key word being set by the user；6th computing unit 1018, be used in history judges data is correct Judge, when data or mistake judge new key word occur in data, to calculate key word being appropriately determined in history judges data Number of times Xi and key word mistake in history judges data appeared in data judge the number of times Yi appeared in data；With And the 7th computing unit 1017, it is used for calculating coefficient of friction

μ = \frac{\min ((VH - Value 0), (Value 0 - VL))}{Xi + Yi} .

Fig. 2 is the flow chart of text filtering method 2000 based on key word weights.The method mainly includes calculating key Text is filtered by weights and the weights based on the key word calculated of word, and wherein weights based on key word are to text Carrying out filtration is to well known to a person skilled in the art, does not repeats them here.The main step of the weights that calculate key word is described below Suddenly.In step 201, it is judged that whether key word to be filtered is brand-new key word, if this key word is not brand-new key word, Calculate the most in step 202. history judge the bar number M being appropriately determined data in data and mistake judge data bar number N, with And comprise the bar number M1 being appropriately determined data and the bar number N1 of mistake judgement data of described key word.Subsequently, in step 203 In, calculate the weights of key word

Value 0 = VL + \frac{M 1 / M}{M 1 / M + N 1 / N} (VH - VL),

Wherein VL is be set by the user described The minimum weights of key word, VH is the maximum weights of the described key word being set by the user.M1/M and N1/N is correctly to sentence respectively Given data and mistake judge the key word accounting of data, and key word accounting illustrates this key word discrimination in two classification. Such as key word often occurs in the sample being appropriately determined, such as M1/M=45%, and less appearance in erroneous judgement sample, such as N1/N =1%；Then illustrate this key word to be more likely to sample and judge correct rather than erroneous judgement.In one embodiment, acquiescence VL=0, VH =100, now, the weights of this key word are 97.8.In other embodiments, user can the most freely set VL and VH Value.In step 204, calculate key word to judge data are appropriately determined the number of times Xi appeared in data and pass in history Keyword mistake in history judges data judges the number of times Yi appeared in data.In step 205, coefficient of friction is calculatedIn step 206, it is judged that whether the size of (Xi-Yi) μ exceedes user sets in advance Fixed weighed value adjusting threshold value.In some embodiments, the weighed value adjusting threshold value of key word can being set as, " 1 " (i.e. change is big Little more than 1 after be just adjusted in system), weighed value adjusting threshold value also can be set as according to actual needs by user any suitably Value.If (Xi-Yi) size of μ exceedes described weighed value adjusting threshold value, make the weights of described key word the most in step 207 Value=Value0+ (Xi-Yi) μ, otherwise, makes the weights Value=Value0 of described key word in a step 208.With After, after this key word can occurring in being appropriately determined data or in mistake judges data, this key word occurring, return step 204.If key word to be filtered is a brand-new key word, there is no historical analysis data, in this case, can be in step In rapid 209, make the weights Value0=(VH+VL of key word)/2, wherein VL is the minimum of the described key word being set by the user Weights, VH is the maximum weights of the described key word being set by the user.In other embodiments, it is also possible to will close as required The weight setting of key assignments is any suitable value, such as, can make Value0=50.The most in step 210, when being appropriately determined number According to or time mistake judges new described key word occur in data, calculate key word and be appropriately determined number in history judges data Number of times Xi and key word mistake in history judges data appeared according to judge the number of times Yi appeared in data.In step In rapid 211, calculate coefficient of friction

μ = \frac{\min ((VH - Value 0), (Value 0 - VL))}{Xi + Yi} .

Judge subsequently back into step 206.

Describe below according to one embodiment of present invention.Such as a newly-increased key word " ammunition ", history number Drawing M1/M=15%, N1/N=20% according to analysis, weight setting interval is 0～100.Then we can set its weights as:

Value 0 = 0 + \frac{15}{15 + 20} (100 - 0) = 42.9 .

These weights are less than 50, illustrate that its effect produced is more for erroneous judgement.Assuming that system Count out Xi=15000 time；Yi=20000 time, then calculate μ=0.0012.If within follow-up a period of time, it is determined that for needs The data filtered comprise " ammunition " 1200 times, it is determined that the data for erroneous judgement comprise " ammunition " 400 times, then calculates its new power Value: Value=Value0+ (Xi-Yi) μ=42.9+0.96=43.86.If weighed value adjusting threshold value is set to 1, then may not be used Weights are adjusted, and retain 42.9.

The method one is typically characterised by convergence rapidly, and has carried out application in key word adjusts and completed to survey Examination.According to existing method at present adjustment to key word after, in the case of accuracy rate only reduces by 4%, False Rate can be dropped Low by 48%.

Above by reference to accompanying drawing, the exemplary embodiment of the application is described.Those skilled in the art should manage Solve, purpose that the embodiment above is merely to illustrate that and the example lifted rather than for limiting.All in the application Teaching and claims under made any amendment, equivalent etc., should be included in this application claims In the range of.

Claims

1. text filtering methods based on key word weights, said method comprising the steps of:

Calculate the weights of key word；And

Text is filtered by weights based on the key word calculated；

Wherein, the step calculating key word weights includes:

Judge whether described key word is brand-new key word, if described key word is not brand-new key word, then

Calculate history and judge that the bar number M being appropriately determined data in data and mistake judge the bar number N of data and comprise described The bar number M1 being appropriately determined data of key word and mistake judge the bar number N1 of data；And

Calculate the weights of described key wordWherein VL is to be set by the user The minimum weights of described key word, VH is the maximum weights of the described key word being set by the user；

If described key word is brand-new key word, then

Making the weights Value0=(VH+VL)/2 of described key word, wherein VL is the minimum of the described key word being set by the user Weights, VH is the maximum weights of the described key word being set by the user.

2. the method for claim 1, the step wherein calculating key word weights also includes:

If described key word is not brand-new key word, then

Calculate described key word to judge data are appropriately determined the number of times Xi appeared in data and described key word in history Mistake in history judges data judges the number of times Yi appeared in data；And

Calculate coefficient of friction

3. the method for claim 1, the step wherein calculating key word weights also includes:

If described key word is brand-new key word, then

When being appropriately determined data or time described mistake judges new described key word occur in data described, calculate described key Word judges to be appropriately determined the number of times Xi appeared in data and described key word in history judges data in data in history Mistake judge the number of times Yi appeared in data；And

Calculate coefficient of friction

4. method as claimed in claim 2 or claim 3, the step wherein calculating key word weights also includes:

Judge whether the size of (Xi-Yi) μ exceedes user's weighed value adjusting set in advance threshold value；And

If (Xi-Yi) size of μ exceedes described weighed value adjusting threshold value, then make the weights Value=of described key word Value0+ (Xi-Yi) μ, otherwise makes the weights Value=Value0 of described key word.

5. text filtering systems based on key word weights, described system includes:

Key word weight computing module, for calculating the weights of key word；And

Text filtering module, filters text for weights based on the key word calculated；

Wherein, described key word weight computing module includes:

First judging unit, is used for judging whether described key word is brand-new key word；

First computing unit, judges being appropriately determined in data for calculating history when described key word is not brand-new key word The bar number M of data and mistake judge the bar number N of data and comprise the bar number M1 being appropriately determined data and the mistake of described key word The bar number N1 of misinterpretation data；

Second computing unit, for calculating the weights of described key wordIts Middle VL is the minimum weights of the described key word being set by the user, and VH is the maximum weights of the described key word being set by the user；

5th computing unit, for when described key word being the weights Value0=(VH+ of brand-new key word described key word in season VL)/2, wherein VL is the minimum weights of the described key word being set by the user, VH be the described key word that is set by the user Big weights.

6. system as claimed in claim 5, wherein said key word weight computing module also includes:

If described key word is not brand-new key word, then

3rd computing unit, for calculate described key word history judge in data be appropriately determined appeared in data time Number Xi and the described key word mistake in history judges data judges the number of times Yi appeared in data；And

4th computing unit, is used for calculating coefficient of friction

7. system as claimed in claim 5, described key word weight computing module also includes:

If described key word is brand-new key word, then

6th computing unit, there is new institute in being appropriately determined in data or mistake judgement data in judging data when history When stating key word, calculate described key word and judge data are appropriately determined the number of times Xi appeared in data and described in history Key word mistake in history judges data judges the number of times Yi appeared in data；And

7th computing unit, is used for calculating coefficient of friction

System the most as claimed in claims 6 or 7, described key word weight computing module also includes:

Second judging unit, is used for judging whether the size of (Xi-Yi) μ exceedes user's weighed value adjusting set in advance threshold value,

Weighed value adjusting unit, for exceeding the power of described weighed value adjusting threshold value described key word in season when the size of (Xi-Yi) μ Value Value=Value0+ (Xi-Yi) μ, otherwise makes the weights Value=Value0 of described key word.