CN101699432A

CN101699432A - Ordering strategy-based information filtering system

Info

Publication number: CN101699432A
Application number: CN200910073206A
Authority: CN
Inventors: 齐浩亮; 杨沐昀; 韩咏; 李生; 运海红; 张艳艳; 黄成哲; 雷国华
Original assignee: Harbin Institute of Technology; Heilongjiang Institute of Technology
Current assignee: Harbin Institute of Technology; Heilongjiang Institute of Technology; INST OF Technology; INSTITUTE OF Technology
Priority date: 2009-11-13
Filing date: 2009-11-13
Publication date: 2010-04-28
Anticipated expiration: 2029-11-13
Also published as: CN101699432B

Abstract

The invention provides an ordering strategy-based information filtering system, which relates to the technical field of information filtering, and solves the problems of inconsistency between an optimization objective and a filtering problem evaluating indicator, the deviation of a model optimization result and restricted performance in the conventional information filtering model. The information filtering system of the invention consists of a training model, a filter and a feature weight library, wherein a method for identifying a new information unit by the filter comprises the following steps: converting an information filtering problem into an ordering problem; performing optimization aiming at a core evaluating indicator 1-ROCA; establishing an ordering strategy-based information filtering model, wherein the ordering strategy-based information filtering model adopts an ordering logistic regression learning algorithm and comprehensively uses a TONE strategy-based parameter weight updating algorithm and resampling technology to obtain a weight parameter and obtain a prediction score value of the new information unit; and judging the attribute of a new mail according to the result of comparison of the prediction score with a predetermined threshold. The method of the invention can be applied to various information filtering and information push systems.

Description

Information filtering system based on ordering strategy

Technical field

The present invention relates to the information filtering technical field.

Background technology

The rapid expansion of internet information makes the user be difficult in time constantly obtain own information of interest.Simultaneously, a large amount of junk information are also given and are used and management brings many troubles, and the information filtering technology is as the effective workaround of the problems referred to above, and (1) can initiatively provide personal interest relevant information to the user; (2) filter rubbish information (as: national security, violence, pornographic and reaction information etc.).

In information filtering, carried out extensive work by famous international document information retrieval evaluation and test TREC (the Text Retrieval Conference) meeting that planning studies office such as U.S. Department of Defense's height and NBS sponsor jointly.The evaluation and test task list of TREC is understood the development course of information filtering.Have only two tasks when TREC is incipient, ad hoc and routing.The former supposes that document sets to be retrieved is constant, and user's search request is Protean.This retrieval just is called ad hoc.The latter's situation is relative with the former, and the information requirement of supposing the user is stable, and document sets is constantly to change, and this task is referred to as routing.This task resembles our said news customization and so on very much, likes physical culture such as the user, and this interest is constant in a period of time, and sports news is in continuous variation.Early stage information filtering task that Here it is.

Along with progress of research, TREC has cancelled the routing task, the substitute is filtering, routing itself becomes the subtask of filtering, and filtering also has other two subtask adaptive filtering (Full filtering trail) and batch filtering (batch processing filtration).The former only has seldom positive example to each user interest (describing with topic), even the positive example that does not have positive example, the latter to provide is a lot.Adaptive filtering allows user's interest is constantly fed back (implication of adaptive), and batch filtering can feed back sometimes because of the requirement difference of each TREC, does not allow feedback sometimes.Under the situation that does not allow to feed back, batch filtering is actual to be exactly the process of a static classification.Routing is the same basically with batch filtering, and different is, the result that routing returns will sort, and the result of batch filtering is a set, does not need to sort, because ordering is arranged, so both evaluating methods are also different.Batch filtering can assess by the method for recall rate (recall)/accurate rate (precision), and the evaluating method of routing is similar a bit to ad hoc task, because the return results of common ad hoc task also is the requirement ordering.Spam filtering, because its typicalness as the information filtering task, and solving the urgency that spam spreads unchecked, TREC began to hold this evaluation and test in 2005.In the world, the research of current information filtration is that representative is carried out with Spam filtering research.

Along with the fast development of Internet technology, during having become people's routine work and lived, Email carries out communication and the main means that exchange, promoted the production and the progress of human society effectively.But incident a large amount of spams have had a strong impact on the use of Email.The third quater in 2008, the China netizen received weekly that on average the quantity of spam is 17.86 envelopes, compared with the same period of last year increased by 1.17 envelopes, and amplification is 7.0%; Receive that the shared ratio of spam is 57.89%, with rose 2.04 percentage points last year on year-on-year basis, the average level in the whole world is higher than above-mentioned data (" third season in 2008 China anti-rubbish mail survey report " http://www.12321.cn/viewnews.php especially? id=10752).Spam takies a large amount of resources, causes the huge wasting of resources, and user's time, energy and money have been wasted in the harm internet development, have damaged user's interests.Minority ax-grinder utilizes spam to disseminate various deceptive information or harmful information, serious harm society stable.Spam has brought huge economy and social danger, and becomes worse, and how carrying out the spam differentiation effectively is the current problem that presses for solution.

In order to solve the spam problem, people have proposed many solutions, and wherein the filtering technique advantage that has automaticity height, accuracy height, easily accepted by the user possesses researching value and development space, becomes the focus of research gradually.In order to check the validity of various filtering techniques in actual spam is filtered, high-caliber meeting and evaluation and test have been held both at home and abroad.Famous international document information retrieval evaluation and test TREC (Text Retrieval Conference) meeting beginning to hold Spam filtering evaluation and test task in 2005, and carried out Chinese Spam filtering in 2006 and evaluated and tested.CEAS (Conference on Emailand Anti-Spam) began to carry out special evaluation and test at the Spam filtering problem in 2007.Domestic national search engine and network information excavate scientific seminar (SEWM, Search Engine and WebMining) has increased Spam filtering evaluation and test project first in 2007.These meetings and evaluation and test have greatly promoted the development of filtering technique, and the perfect appraisement system of filtering technique has accumulated abundant experimental data.

Current, typical Spam filtering technology comprises: the filtering technique of black and white lists technology, rule-based and pattern match and based on the filtering technique of machine learning method.The black and white lists technology be a kind of simply, effectively, filter method the most commonly used, it filters by the IP address, can also filter by addressee's address list.Its advantage is that processing speed is fast, can be configured on gateway, can save a large amount of network bandwidths, memory capacity and processing time; Shortcoming is to retrain strong, underaction, and is not high to the differentiation accuracy of spam.Rule-based and filtering technique pattern match also is a kind of filter method commonly used, and it is provided with a series of filtering rule, and filters by the match pattern of searching existing spam.The mode that rule can combine by people's experience and machine learning obtains, and its advantage is that the purpose of rule treatments is very strong, and rule itself is understood and revised than being easier to, and possesses the performance of certain fuzzy matching; Shortcoming is that excessive having brought of regular quantity is difficult to the problem of coupling fast, exists conflict to bring the problem that rule conflict solves and rule is safeguarded between the rule.Filtering technique based on machine learning method is analyzed the content (as title, sender, transmitting time, body text etc.) of Email, on the basis of the model modeling of machine learning and parameter optimization theory, train filtrator by study, and utilize trained filtrator to identify spam sample.Because its accuracy height, cost are low, machine learning techniques becomes the main stream approach that solves the Spam filtering problem gradually.

The purpose of Spam filtering is to be two kinds of forms with mail sorting: spam (Spam) or normal email (Ham) are a kind of very natural in the case study and the modeling method of observing so convert it into the two-value classification problem.On this basis, the sorting algorithm that research is adopted can be divided into two kinds on modular concept: be the generation model of representative with the model-naive Bayesian, with supporting vector machine model (Support VectorMachine, SVM) and maximum entropy model (Maximum Entropy ME) is the differentiation learning model of representative.In the filtering system based on generation model, famous Bogo system makes up according to model-naive Bayesian, its in the TREC evaluation and test as benchmark (Baseline) system.In recent years, CTW (Context treeweight) and PPM data compression algorithms such as (Prediction by Partial Match) also is used to solve the Spam filtering problem.CTW and PPM are the dynamic compression algorithms that uses in the data compression, and its principle is the data stream that will occur according to the data stream that occurred prediction back, prediction accurate more, and required coding is also just few more, and classifies in view of the above.As far back as 1999, Provost just showed that under study for action the Bayesian model performance is better than rule-based method.In the filtering system of differentiating learning model, Drucker and Vapnik utilized linear supporting vector machine model in 1999, had selected for use various features such as speech feature, two value tags, TF-IDF to carry out Spam filtering, had obtained good experimental result.Goodman and Yin propose to use online Logic Regression Models, have avoided a large amount of calculating of SVM, maximum entropy model, and have obtained and the comparable result of previous year (2005) TREC evaluation and test best result.Sculley and Wachman adopt undemanding online supporting vector machine model (Relaxed Online SVM) to solve the Spam filtering problem, thereby have overcome the big problem of support vector machine calculated amount, and have obtained very good effect in TREC 2007 evaluation and tests.The tradition generation model thinks that data are based on all that certain distribute to generate, and modeling in view of the above.(Maximum Likelihood Estimation MLE) comes the solving model parameter, and solves the sparse problem of data with smoothing algorithm to adopt maximal possibility estimation.This method only is only optimum when following two conditions all satisfy: the first, and the form of probability of data is known; The second, could adopt maximal possibility estimation to come the solving model parameter when having enough big training data.But in actual applications, these two conditions many times can't satisfy.Differentiating learning model and generation model has essential difference, its assumed condition a little less than than MLE many, only require that training data and test data get final product from same distribution.And, the target often closely related (as the error rate of model on training data minimized) of differentiation learning algorithm with the evaluation criterion of practical application.In the near field text classification of Spam filtering problem, the classifying quality of differentiating learning model is better than generation model, and especially under the training data of small sample set, this phenomenon is more obvious.In 2004, Hulten and Goodman experimentized based on inhomogeneous filtering model on PU-1 spam collection, have also obtained same experimental result: promptly in the filtrating mail model, the classifying quality of differentiating learning model is better than generation model.In in recent years international TREC and the evaluation and test of CEAS, and in the domestic SEWM evaluation and test, differentiate learning model and all obtained success.

In addition, filtrator (sorter) can be divided into two kinds of on-line study and off-line learning (batch learning) according to the difference of mode of learning.Under the off-line learning mode,, no longer adjust the parameter of sorter during practical application by the parameter of training sample adjustment sorter; Under the on-line study mode, sorter makes system can adapt to the applied environment of continuous variation according to the continuous Adjustment System parameter of user's feedback.On-line study is applicable to the environment that needs fast updating, is limited by the online updating learner, and the complexity of parameter update algorithm is low, to adapt to the demand of practical application.Because the sender of spam is at filtering system update content deception constantly and the hiding mode of content, this just requires twit filter to have good adaptive faculty.Research before this shows that in the Spam filtering field, on-line filtration mode performance is better than the off-line batch processing mode.This is because online Spam filtering system can make system can adapt to the applied environment of continuous variation according to the online Adjustment System parameter of user's feedback; Online twit filter has good adaptive faculty, can satisfy the requirement of filtering the spam that constantly changes.The evaluation and test result of experiment shows both at home and abroad: the on-line study mode can satisfy the requirement of filtering the spam that constantly changes, and this also is the reason that the on-line study mode is adopted in TREC, CEAS and SEWM evaluation and test.

Obtain good achievement though utilize the differentiation learning method of two-value sorter to solve in the evaluation and test at home and abroad of Spam filtering problem,, solve the Spam filtering existing problems with disaggregated model from the angle of case study and modeling.In the training process of disaggregated model, the optimization aim of sorter is to seek the heavy parameter of one group of cum rights, perhaps optimal classification face, and carry out to a certain extent extensive on this basis, in the hope of minimizing the number of classification of mail mistake, that is to say that their optimization aim is to reduce spam to be divided into normal email and normal email are divided spam by mistake wrong number summation by mistake.Yet the 1-ROCA index is the core evaluation index of Spam filtering performance, by TREC, and CEAS, the consistent use among the SEWM.The number of classification of mail mistake is not directly related with reference to property index logic False Rate (lam%) with the another one that 1-ROCA and strainability are estimated, and causes the optimization aim of existing disaggregated model and filters the inconsistent of evaluation index.In other words, classification error being counted summation reduces to the minimum performance that can not guarantee filtrator and also reaches optimum.This shows that the performance of Spam filtering still has the space and the better solution of lifting.

In the machine learning field, in recent years the correlative study of ROC (Receiver Operating Characteristic) receives the concern of academia, carried out 3 Workshop respectively in 04,05 and 06 year as international machine learning conference (ICML), the ROC relevant issues have been discussed.But up to the present, Shang Weijian is the information filtering system of optimization aim, especially twit filter with 1-ROCA.At the entire machine learning areas, be that the research of optimization aim is also less with 1-ROCA, in the correlative study of two-value classification and information filtering system, as far as we know, have only following three pieces of documents to carry out research to a certain degree:

One, L.Park and J.Moon.A Learning Method of Directly Optimizing ClassifierPerformance at Local Operating Range (a kind of) .Proceedings of International Conference on IntelligentComputing (ICIC-05) in the interval learning method of directly optimizing filter capability of partial operation, 2005

Two, T.Joachims.A Support Vector Method for Multivariate PerformanceMeasures. (a kind of support vector machine method that is used for the multivariate performance metric) .Proceedings of the22nd International Conference on Machine Learning (ICML-05), 2005

Three, L.Yan, R.Dodier, M.C.Mozer and R.Wolniewicz.Optimizing ClassifierPerformance Via an Approximation to the Wilcoxon-Mann-Whitney Statistic. (by near-optimal filter capability) Proceedings ofthe 20th Annual International Conference on Machine Learning (ICML-03) to the Wilcoxon-Mann-Whitney statistic, 2003.

Wherein first piece of document directly is optimized this index according to the definition of 1-ROCA.Second and third piece document points out that Wilcoxon ' s Rank Sum Statistic is relevant with 1-ROCA.Because it is big directly to calculate the 1-ROCA calculated amount, therefore the 3rd piece of document adopts approximate data to calculate, but there is deviation in model optimization.Second piece of document improves the SVM model makes it be suitable for sort method, and directly reaches the purpose of optimizing 1-ROCA by reducing wrong sample ordered pair (Swapped pairs), but because the complexity of SVM model is higher, calculated amount is bigger than normal.Therefore, these relevant research and methods all can not be applied directly in the solution of Spam filtering.

Domestic scholars have also obtained a lot of achievements in the filter method research based on machine learning, especially making very big contribution aspect the Chinese Spam filtering.Tsing-Hua University provides the Chinese data of Spam filtering for the TREC evaluation and test.South China Science ﹠ Engineering University has sponsored domestic SEWM evaluation and test under Dong Shoubin professor's support, evaluation and test data, method and flow process are provided; Dalian University of Technology has studied the filtering system based on SVM model, NB model and language model respectively, and Shandong University has adopted the filtering system of rule-based technology and Multiple Classifier Fusion, and these schools have played an active part in these evaluation and tests, and has obtained good achievement.Institute of Computing Technology, CAS Wang Bin etc. have carried out deep research and summary domestic and international research method and achievement to the Spam filtering problem; Square emerging academician in shore of Beijing University of Post ﹠ Telecommunication waits Spam filtering system practicability, and the Zhong Yi research team that awards of believing in a religion has also obtained achievement in filtering short message research; Professor Wang Xufa of Chinese University of Science and Technology has proposed the multilayer Spam Filtering Algorithm based on artificial immunity, professor Chen Zhong of Peking University has studied the Chinese rubbish mail filtering method based on suffix array cluster, professor Xu Congfu of Zhejiang University has in depth studied the related algorithm of Spam filtering, and has applied for the patent of the Chinese rubbish mail filtering method that logic-based returns, professor Niu Junyu of Fudan University has proposed the rubbish mail filtering method based on the time properties of flow, the Li Jianhua professor of Shanghai Communications University has applied for the patent of intelligent electronic Mail Contents filter method.

Training pattern in the existing information filtering system is adjusted the feature weight of filtering model according to user's feedback information (normal information and junk information), and the feature weight storehouse is upgraded;

Filtrator is differentiated for the fresh information that receives based on feature in the feature weight storehouse and weight thereof.

The user gives dynamic adjustment and the renewal that the new feedback information of training pattern comes the supported feature weight not timing non-quantitative in the process of process information, this makes filtrator can in time handle the junk information of continuous variation.

The classic method of information filtering research all is that it is considered as the two-value classification problem, and to set up with the number that minimizes classification error on this basis be the disaggregated model of optimization aim.

In the ideal case, disaggregated model can be given in the correct prediction on the test set.Yet, can't guarantee fully under the true environment that the prediction of model is entirely true, need estimate for the performance of different models by suitable evaluation index.The model performance evaluation index mainly contains: accuracy rate (Accuracy), error rate (Error rate), precision ratio (Precision), recall ratio (Recall), F1 value etc.It is the information filtering problem of representative that but these evaluation indexes are not suitable for being used for estimating with the Spam filtering, and its defective is: These parameters is all only effective at an operating point, and can't embody the overall performance of filtering model under the different parameters threshold value; In test set the ratio of positive example and counter-example change or proportional difference very big, the change of category distribution, perhaps the wrong loss that divides of positive example or counter-example not simultaneously, These parameters can not embody the performance of model.

The core evaluation criterion of Spam filtering research is that (ReceiverOperating Characteristic ROC), also is called recall-fallout plot to recipient's operating curve ROC.It has following two advantages, can overcome the defective of above-mentioned evaluation index: the one, it is not subjected to the distribution influence of class, change to category distribution insensitive (promptly insensitive to the variation of the proportion of spam and normal email) is fit to assessment spam and the normal email unbalanced spam data set that distributes; The 2nd, under spam erroneous judgement rate (sm%) and normal email False Rate (hm%) situation that loss is failed to understand to Spam filtering performance (or perhaps the user estimates), evaluation index comprises all optional threshold values, not limited by selected decision-making value.

Can be about the document that the ROC curve is introduced with reference to G.Cormack, T.Lynam, TREC 2005Spam Track Overview (TREC 2005 Spam filterings summary) .The Fourteenth TextREtrieval Conference (TREC 2005) Proceedings.

The most direct effective ways that promote model performance are exactly the evaluation index of Optimization Model.Traditional sorting technique is an optimization aim to minimize the classification error number, has caused the evaluation index of the optimization aim of filtering model and information filtering problem itself inconsistent, has caused model optimization result's deviation.

Summary of the invention

In order to solve optimization aim that exists in the existing information filtering model and the problem that the filtration problem evaluation index is inconsistent, the model optimization result produces deviation, performance is restricted, the present invention proposes a kind of information filtering system based on ordering strategy.

Information filtering system based on ordering strategy of the present invention, it is made up of training pattern, filtrator and feature weight storehouse, wherein:

The feature weight storehouse is used for the feature and the weight information thereof of canned data unit;

Training aids is used for the feedback information according to the user, feature and weight thereof in the adjustment/renewal feature weight storehouse;

Filtrator is used for the message unit that receives is carried out feature extraction and obtains characteristic information; Also be used for discerning for the message unit that receives, described message unit is divided into normal information and improper information based on the feature in feature weight storehouse;

In the described filtrator, the method that new information element is discerned is:

Foundation is based on the information filtering model framework of ordering strategy,

Make x _iThe expression positive example, x _jThe expression counter-example, x _{I, j}=(its desired value is y ' for xi, xj) the consistent ordered pair of expression _Ij=1; x _{J, i}=(x _j, x _i) represent that inconsistent ordered pair, its desired value are y ' _Ij=-1, the order models target is to find one h ∈ H in hypothesis space H, it is satisfied minimize inconsistent ordered pair x _{J, i}, then have:

Formula two:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ (w, x_{i}, x_{j})} - - - (2)

In the formula, w representation feature weight vectors, Ψ (): x _i, x _j→ R.

Formula two is carried out conversion, copy the mode of structure ordering support vector machine, x _i-x _jAs new proper vector x, can obtain formula three:

Formula three:

h_{w}^{'} = (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ^{'} (w, x_{i} - x_{j})} - - - (3)

After obtaining best parameter W according to formula three, the prediction score value that obtains new information element X be Ψ ' (w, x); Obtain the prediction score value of new information element,, judge whether described new information element is normal information according to described prediction score value of contrast and preset threshold.

The present invention also provides another kind of information filtering system based on ordering strategy, and it is made up of training pattern, filtrator and feature weight storehouse, wherein:

In the described signal filter, the method that new information element is differentiated is:

Formula two:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ (w, x_{i}, x_{j})} - - - (2)

In the formula, w representation feature weight vectors is with Ψ (w, x _i, x _j) be defined as Ψ ' (w, x _j)-Ψ ' (w, x _j), promptly two classification information unit must divide poor, make Ψ (w, x _i, x _j)=sgn[Ψ ' (w, x _i)-Ψ ' (w, x _j)], wherein sgn (x) is a sign function, when x＞=0, and sgn (x)=1; Otherwise, sgn (x)=-1, then formula two carries out the conversion acquisition:

Formula five:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} sgn {y_{ij}^{'} \cdot [Ψ^{'} (w, x_{i}) - Ψ^{'} (w, x_{j})]}} - - - (5)

Based on formula five, in conjunction with Logic Regression Models, definition Ψ (w, x _i, x _j) be:

Formula six:

Ψ (w, x_{i}, x_{j}) = \frac{EXP (w \cdot x_{i})}{1 + EXP (w \cdot x_{i})} - \frac{EXP (w \cdot x_{j})}{1 + EXP (w \cdot x_{j})} - - - (6)

Order

f (w, x) = \frac{EXP (w \cdot x)}{1 + EXP (w \cdot x)},

Then have:

Formula seven:

\frac{&PartialD; Ψ}{&PartialD; w} = \frac{&PartialD; f (w, x_{i})}{&PartialD; w} - \frac{&PartialD; f (w, x_{j})}{&PartialD; w} = f (w, x_{i}) \cdot (1 - f (w, x_{i})) \cdot x_{i} - f (w, x_{j}) \cdot (1 - f (w, x_{j})) \cdot x_{j} - - - (7)

Wherein formula six be towards 1-ROCA optimize at line ordering logistic regression learning algorithm,

Can obtain upgrading and acquisition parameter vector weight w according to formula seven with the gradient descending method; And,, judge whether described new information element is normal information according to described prediction score value of contrast and preset threshold in view of the above to the predicting of new information element.

The calculating of 1-ROCA index can be tried to achieve by the ROC curve is carried out integration, but this computing method more complicated, also can be by formula one:

1 - ROCA = \frac{SwappedPairs}{mn} - - - (1)

Obtain, wherein, m and n are respectively the number of positive example and counter-example.For example, for Spam filtering, wherein said positive example is represented spam, and promptly spam message is called for short spam; Counter-example is represented normal email ham message, be called for short ham, (the ham that forms at ham and spam arbitrarily then, spam) in the ordered pair set, the score of spam is lower than normal email, ordering wrong ordered pair occurs and is defined as inconsistent ordered pair (Swapped Pairs), otherwise is that consistent order is right.In order to improve the performance of filtrator, according to formula one, need minimize inconsistent ordered pair, rather than reduce on the mis-classification number.Need minimize inconsistent ordered pair owing to optimize 1-ROCA, and this is consistent with the sort method based on ordered pair, therefore, the optimization problem of 1-ROCA index is more suitable for utilizing order models to solve, rather than disaggregated model.

Message unit described in the present invention can be any electronic information such as e-mail messages, short message, info web.

Normal information described in the present invention generally is meant legal or user's interest information, but not normal information generally is meant illegal or the uninterested information of user.

Method of the present invention is different with information filtering method in the past, the present invention is directed to core evaluation index 1-ROCA and is optimized, and introduces order models and solves the information filtering problem, and the key issue that the present invention solves comprises:

(1) based on the construction method of the information filtering model of ordering strategy

Based on core evaluation index 1-ROCA being optimized, the information filtering problem being changed into sequencing problem is basic thought of the present invention, study new filtering model construction method, formalization definition and formula and describe, thereby foundation is based on the information filtering model of ordering strategy.

(2) towards information filtering at line ordering logistic regression learning algorithm

In view of requiring sort algorithm, information filtering has high-performance, quick, low characteristic of storing, and existing sort algorithm can't satisfy these requirements, the present invention's logistic regression learning algorithm that proposes to sort addresses this problem, and further proposes the problem that causes performance to descend towards the message unit score that occurs when line ordering logistic regression learning algorithm the solves on-line filtration skew that 1-ROCA optimizes.

(3) the excessive problem of Model Parameter Optimization calculated amount

Employing is behind line ordering logistic regression learning algorithm, and calculated amount can enlarge markedly, and will influence application of model as not addressing this problem.The present invention adopts method that the ordered pair that only the recent information unit constituted trains and in conjunction with TONE (Train On or Near Error) algorithm, has solved the big problem of calculated amount.

Advantage of the present invention has:

(1) proposition is based on the information filtering novel method for modeling of evaluation index optimization, research is based on the information filtering basic framework of ordering strategy, replace the traditional classification model with disaggregated model, having avoided model optimization target and the inconsistent problem of filtration problem evaluation index, is new thinking and exploration in the information filtering research;

(2) on the basis of having set up the filtering model framework, research is adapted to the new sort algorithm of information filtering, propose ordering logistic regression learning algorithm and solve sequencing problem, and the message unit score that occurs during further at the on-line filtration skew problem that causes performance to descend, propose towards solving that 1-ROCA optimizes at line ordering logistic regression learning algorithm;

(3) propose also integrated use based on the parameter weight update algorithm and the resampling technology of TONE strategy,, satisfy online, the real-time requirement of filtering model to solve parameter optimization calculated amount problems of too;

Method of the present invention not only can provide resolution policy and support technology for information filtering problems such as Spam filtering problem, network information filtration problem, mobile phone filtering junk short messages problems, also will for numerous be that the two-value classification problem of optimization aim provides new solution thinking with 1-ROCA, solution as problems such as medical diagnosiss provides important references, simultaneously, also will promote the development of order models.

Description of drawings

Fig. 1 is the structured flowchart of information filtering system of the present invention.

Embodiment

Embodiment one: the described information filtering system based on ordering strategy of present embodiment comprises feature weight storehouse, training aids, filtrator, wherein:

Make x _iThe expression positive example, x _jThe expression counter-example, x _{I, j}=(x _i, x _j) the consistent ordered pair of expression, its desired value is y ' _Ij=1; x _{J, i}=(x _j, x _i) represent that inconsistent ordered pair, its desired value are y ' _Ij=-1, the order models target is to find one h ∈ H in hypothesis space H, it is satisfied minimize inconsistent ordered pair x _{J, i}, then have:

Formula two:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ (w, x_{i}, x_{j})} - - - (2)

Formula two is carried out conversion, copy document T.Joachims, Optimizing Search Engines UsingClickthrough Data (using click data optimization searching engine), Proceedings of the ACMConference on Knowledge Discovery and Data Mining (KDD), ACM, the mode of 2002 structure ordering support vector machine is with x _i-x _jAs new proper vector x, can obtain formula three:

Formula three:

h_{w}^{'} = (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ^{'} (w, x_{i} - x_{j})} - - - (3)

Described preset threshold is set as the case may be, for example sets the threshold to 0.5 usually in Logic Regression Models.

The above-mentioned method that new information element is discerned can also further be embodied as by Logic Regression Models:

Formula four:

Ψ (w, x_{i}, x_{j}) = \frac{EXP (w \cdot (x_{i} - x_{j}))}{1 + EXP (w \cdot (x_{i} - x_{j}))} - - - (4)

What wherein, the update algorithm of parameter vector weights W adopted is the existing weight update method that descends based on gradient.

Message unit described in the present embodiment can be any electronic information such as e-mail messages, short message, info web.

Normal information described in the present embodiment generally is meant legal or user's interest information, but not normal information generally is meant illegal or the uninterested information of user.

Formula four in the present embodiment adopts the method for using for reference existing (Ranking SVM) definition sort algorithms, with the difference of the feature of two classifications eigenwert as new samples.

Present embodiment makes traditional Logic Regression Models extend to can solve sequencing problem, and it is satisfied with 1-ROCA is the requirement of the information filtering of evaluation index.

Formula four in the present embodiment does not have the mechanism of control information unit score equilibrium, and this will cause the score of message unit to be offset, a deflection side wherein.In information filtering, filtrator can not be changed the judgement of having made, and the score skew of message unit can improve the 1-ROCA value, promptly influences the performance of filtrator.

Embodiment two: present embodiment is the problem that exists in the embodiment one described information filtering system based on ordering strategy in order to solve, and the another kind that provides is done further perfect based on the information filtering system of ordering strategy, be specially:

With Ψ (w, x _i, x _j) be defined as Ψ ' (w, x _i)-Ψ ' (w, x _j), promptly two classification information unit must divide poor, make Ψ (w, x _i, x _j)=sgn[Ψ ' (w, x _i)-Ψ ' (w, x _j)], wherein sgn (x) is a sign function, when x＞=0, and sgn (x)=1; Otherwise, sgn (x)=-1,

Then formula two can be rewritten into:

Formula five:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} sgn {y_{ij}^{'} \cdot [Ψ^{'} (w, x_{i}) - Ψ^{'} (w, x_{j})]}} - - - (5)

Formula six:

Ψ (w, x_{i}, x_{j}) = \frac{EXP (w \cdot x_{i})}{1 + EXP (w \cdot x_{i})} - \frac{EXP (w \cdot x_{j})}{1 + EXP (w \cdot x_{j})} - - - (6)

Order

f (w, x) = \frac{EXP (w \cdot x)}{1 + EXP (w \cdot x)},

Then have:

Formula seven:

\frac{&PartialD; Ψ}{&PartialD; w} = \frac{&PartialD; f (w, x_{i})}{&PartialD; w} - \frac{&PartialD; f (w, x_{j})}{&PartialD; w} = f (w, x_{i}) \cdot (1 - f (w, x_{i})) \cdot x_{i} - f (w, x_{j}) \cdot (1 - f (w, x_{j})) \cdot x_{j} - - - (7)

Can obtain upgrading and acquisition parameter vector weight w according to formula seven with the gradient descending method; And in view of the above new information element is predicted, according to described prediction score value of contrast and preset threshold, judge whether described new information element is non-normal information.

Can know from formula seven, carry out the feature weight adjustment, prevent model optimization result's deviation effectively, guarantee the symmetry of two class desired values in the mode of two class desired value equilibriums.

Behind formula seven acquisition best parameter W, for the message unit X of a classification the unknown, (w x) is exactly the score value of model to its prediction to Ψ '.

In described on-line study process, adopt greedy algorithm, sample to filtering sample according to time series, in a up-to-date m message unit, select training sample.

Because from the ordering framework of filtering model itself, if the ordered pair quantity that any two training samples are formed is huge, directly calculates institute's ordered pair and cause calculated amount excessive, counting yield is very low.Therefore, the present invention adopts existing TONE (Train On or Near Error) strategy to reduce the calculated amount of model training, and by adopting greedy algorithm, utilize time series to sample to filtering sample, only in a up-to-date m message unit, select training sample, all message units all participate in calculating in the message unit set to avoid, and have further reduced calculated amount, and the training speed of model is further improved.

Embodiment three: present embodiment be to embodiment two described based in the information filtering system of ordering strategy according to the further specifying of the process of formula seven and gradient descending method undated parameter vector weight w, its detailed process is:

Step Q1, weight w to be updated is initialized as 0;

Step Q2, for each training sample, promptly improper information-normal information ordered pair, all carry out the operation of step Q3 to step Q5:

Step Q3, calculating gap=Ψ (w, x _i, y _j);

Step Q4, whether judge gap less than setting threshold TONE, if judged result is for being, execution in step Q5 then, otherwise return execution in step Q3, obtain the gap of next improper information-normal information ordered pair;

Step Q5, calculating

Wherein TRAIN_RATE represents the algorithm learning rate;

The weight Δ w that step Q6, basis obtain step Q5 adds up and obtains parameter vector weight w:w+=Δ w; Return execution in step Q3 then, the improper information of the next one-normal information ordered pair is calculated.

Said process can be realized by following program:

Initialize：w＝0

parameters：TRAIN_RATE，TONE

for?each?spam-ham?pair?x _i?and?x _j：{

calculate?gap＝Ψ(w，x _i，x _j)

if?gap＜TONE?then {

Δw = (1 - gap) * TRAIN_RATE * \frac{&PartialD; Ψ}{&PartialD; w}

w+＝Δw

}

Filtrator in the present embodiment carries out the characteristic information that feature extracting methods can adopt existing any feature extracting method acquisition reception new information element to receiving new information element.

The structure of filtering model is the central factor that influences system performance, and filtering model is for the simulation of improper information filtering problem, abstract and formalized description.The core evaluation index of filtering model is 1-ROCA, and 1-ROCA is directly proportional with inconsistent ordered pair, and therefore the essence of improper information filtering problem is sequencing problem.Under this thought, the present invention changes into sequencing problem research with filtration problem, and design and realization are based on the information filtering model of ordering strategy.

On the basis of having set up the filtering model framework, need to adopt the sort algorithm that is fit to that model parameter is similar to and match.Information filtering to the requirement of sort algorithm is: can the fast processing large-scale data, and promptly the time complexity of algorithm and space complexity can not be too high; Performance requirement to sort algorithm is very high simultaneously.Existing sort algorithm can't satisfy these requirements, needs to propose new solution.The present invention has proposed relatively fast, effectively on the basis that multiple sort algorithm is used and grasped, be fit to the information filtering problem at line ordering logistic regression learning algorithm, and constructed information filtering system in view of the above.

Embodiment four: the difference of the described information filtering system based on ordering strategy of present embodiment and embodiment one to three any one embodiment is, filtrator butt joint collection of letters interest statement unit carries out feature extracting methods and adopts feature extracting method based on the byte level n-gram syntax.

The described feature extracting method based on the byte level n-gram syntax of present embodiment is: carry out the moving window operation of size for n byte to extracting object information, the byte segment sequence that to obtain m length be n is as characteristic information, and described m, n are the integer greater than 0.

M in a present embodiment length is that the choosing method of the byte segment sequence of n can adopt following several method:

A, extract m continuous length is n byte in the object information pieces of information as characteristic information, wherein i+1 byte segment is to be first byte with second byte in i the byte segment.

The byte segment sequence that preceding m length in b, the extraction object information is n is as characteristic information, perhaps m the length in the back byte segment sequence that is n is as characteristic information, and wherein i+1 byte segment is to be first byte with second byte in i the byte segment.

C, according to information gain or the cross entropy statistical method byte segment sequence that to extract m length in the object information be n as characteristic information.

Present embodiment adopts the feature extracting method based on the byte level n-gram syntax to obtain characteristic information, has simplified feature extraction, also makes filtrator can handle the ability of image, virus email, for the performance that significantly improves signal filter is laid a good foundation.

Claims

1. based on the information filtering system of ordering strategy, it is made up of training pattern, filtrator and feature weight storehouse, wherein:

Formula two:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ (w, x_{i}, x_{j})} - - - (2)

Formula three:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ^{'} (w, x_{i} - x_{j})} - - - (3)

2. the information filtering system based on ordering strategy according to claim 1 is characterized in that, filtrator carries out the feature extracting method of feature extracting methods employing based on the byte level n-gram syntax to the message unit that receives.

3. the information filtering system based on ordering strategy according to claim 1, it is characterized in that described feature extracting method based on the byte level n-gram syntax is: carry out the moving window operation of size for n byte to extracting object information, the byte segment sequence that to obtain m length be n is as characteristic information, and described m, n are the integer greater than 0.

4. the information filtering system based on ordering strategy according to claim 1, it is characterized in that, a described m length is that the byte segment sequence of n is m continuous in the object information information that length is n byte, and wherein i+1 byte segment is to be first byte with second byte in i the byte segment.

5. based on the information filtering system of ordering strategy, it is made up of training pattern, filtrator and feature weight storehouse, wherein:

Formula two:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} y_{ij}^{'} \cdot Ψ (w, x_{i}, x_{j})} - - - (2)

In the formula, w representation feature weight vectors is with Ψ (w, x _i, x _j) be defined as Ψ ' (w, x _i)-Ψ ' (w, x _j), promptly two classification information unit must divide poor, make Ψ (w, x _i, x _j)=sgn[Ψ ' (w, x _i)-Ψ ' (w, x _j)], wherein sgn (x) is a sign function, when x＞=0, and sgn (x)=1; Otherwise, sgn (x)=-1, then formula two carries out the conversion acquisition:

Formula five:

h_{w}^{'} (\overset{&OverBar;}{x}) = \arg \max {\underset{i}{Σ} \underset{j}{Σ} sgn {y_{ij}^{'} \cdot [Ψ^{'} (w, x_{i}) - Ψ^{'} (w, x_{j})]}} - - - (5)

Formula six:

Ψ (w, x_{i}, x_{j}) = \frac{EXP (w \cdot x_{i})}{1 + EXP (w \cdot x_{i})} - \frac{EXP (w \cdot x_{j})}{1 + EXP (w \cdot x_{j})} - - - (6)

Order

f (w, x) = \frac{EXP (w \cdot x)}{1 + EXP (w \cdot x)},

Then have:

Formula seven:

\frac{&PartialD; Ψ}{&PartialD; w} = \frac{&PartialD; f (w, x_{i})}{&PartialD; w} - \frac{&PartialD; f (w, x_{j})}{&PartialD; w} = f (w, x_{i}) \cdot (1 - f (w, x_{i})) \cdot x_{i} - f (w, x_{j}) \cdot (1 - f (w, x_{j})) \cdot x_{j} - - - (7)

6. the information filtering system based on ordering strategy according to claim 5 is characterized in that, according to the process of formula seven and gradient descending method undated parameter vector weight w is:

Step Q1, weight w to be updated is initialized as 0;

Step Q3, calculating gap=Ψ (w, x _i, y _j);

Step Q5, calculating Wherein TRAIN_RATE represents the algorithm learning rate;

7. the information filtering system based on ordering strategy according to claim 5 is characterized in that, filtrator butt joint collection of letters interest statement unit carries out the feature extracting method of feature extracting methods employing based on the byte level n-gram syntax.

8. the information filtering system based on ordering strategy according to claim 7, it is characterized in that described feature extracting method based on the byte level n-gram syntax is: carry out the moving window operation of size for n byte to extracting object information, the byte segment sequence that to obtain m length be n is as characteristic information, and described m, n are the integer greater than 0.

9. the information filtering system based on ordering strategy according to claim 8, it is characterized in that, a described m length is that the byte segment sequence of n is m continuous in the object information information that length is n byte, and wherein i+1 byte segment is to be first byte with second byte in i the byte segment.

10. the information filtering system based on ordering strategy according to claim 8, it is characterized in that a described m length is that the byte segment sequence of n is m byte segment sequence that length is n of byte segment sequence or back that preceding m length in the object information is n.