CN101097570A

CN101097570A - Advertisement classification method capable of automatic recognizing classified advertisement type

Info

Publication number: CN101097570A
Application number: CNA2006100283059A
Authority: CN
Inventors: 陈壮坚; 徐丽
Original assignee: SHANGHAI VEKEE ADVERTISEMENT CO Ltd
Current assignee: SHANGHAI VEKEE ADVERTISEMENT CO Ltd
Priority date: 2006-06-29
Filing date: 2006-06-29
Publication date: 2008-01-02

Abstract

The invention relates to a kind of sorting method for advertisement which can identify the type of advertisement automatically, and the characteristics are that: it is programmed by JAVA language, run in computer, and the KNN algorism is intruded to the linear analyzer based on the space vector model, which is convenient to sort the advertisement with similar keyword automatically, and the advantages of invention are that it can judge the type of advertisement according to submitted advertisement title and content, and improve the sorting accuracy.

Description

A kind of ad classification method of automatic recognition classification adline

Technical field

The present invention relates to a kind of ad classification method of automatic recognition classification adline, after use this method, can not need the type of selection sort advertisement when the user handles advertisement, belong to ad classification method and technology field.

Background technology

Classified advertisement is the novel advertisement that just grew up in recent years, do not solve very timely for some problems that produce in the classified advertisement, in the handling of classified advertisement, a lot of methods of handling of traditional industrial and commercial advertisement of still having continued to use, but because the difference of adline, traditional method is no longer suitable.In the handling of classified advertisement, distinct issues are exactly the type selecting of advertisement.The type of classified advertisement is many, upgrades than very fast, and the user differs when handling advertisement and knows that surely understanding the advertisement of oneself handling belongs to any type, in this case, if select wrong classification, can cause the effectiveness of advertisement, weaken even the advertisement inefficacy.

Present ad classification still is that main dependence is manually differentiated, as shown in Figure 1, behaviour work point class methods flowage structure synoptic diagram, when the user handles classified advertisement in client, need to judge that according to the experience of oneself advertisement that will handle belongs to any type, and then the title of typing advertisement and content, be submitted in the database at last.For example the user handles first advertisement content and is " office building taxi ", rule of thumb judge, this then advertisement belong to this type of house lease.

This sorting technique of subjective judgement that relies on is few for adline, and the adline of use is difficult for making a mistake in the time of more common, and for example Chang Yong adline has recruitment and job hunting, house lease or the like.From now on, the type of classified advertisement will get more and more, and the classification meeting of advertisement is more and more thinner, and the accuracy rate of artificial judgment will reduce when the time comes.

Summary of the invention

The objective of the invention is to invent the type that a kind of ad title that can submit to according to the user and content are judged advertisement automatically, and improve the ad classification method of the automatic recognition classification adline of classification accuracy.

For realizing above purpose, technical scheme of the present invention provides a kind of ad classification method of automatic recognition classification adline, it is characterized in that, program with the JAVA language, use mysql as database, operate in the computing machine, and the KNN algorithm is incorporated into linear classifier based on vector space model, so that disposable to keyword similar advertisement classify automatically, this sorting technique contains following steps successively when carrying out in computing machine:

One, at learning phase:

Step 1: input adline collection; Set up the adline database, each adline is added in the database, build up original adline collection; Set up two tables of data in the adline database, a table is the adline table, and one is antistop list.

Step 2: determine the attribute unit and the linear classifier type of employing, the sorter that adopts in this method is the linear classifier based on the room for improvement vector model;

Step 3: the adline collection is carried out pre-service, and pre-service comprises the cutting of Chinese statement or English stemming operation, synonym merging etc.;

Step 4: attribute extraction: the adline collection is carried out index, obtain the frequency vector of primitive attribute collection and each adline, adline represents that with D attribute frequency is represented with t, point out in the present adline and can represent the basic language unit of the type, mainly be to be made of speech or phrase, adline can be expressed as D (T1, T2 with the set of attribute frequency, Tn), wherein Tk is an attribute frequency, 1＜=k＜=N;

Step 5: the primitive attribute centralized procurement is operated with existing dimensionality reduction, be frequency, weight, obtain property set, the thought that dimensionality reduction is the most basic, utilize the method for iteration exactly, location feature vector in the dimensionality reduction space makes that distance and the diversity between them remained as much as possible, in order to reach this purpose, the square error tolerance below in iteration, needing constantly to reduce:

P=∑ [d ^*(x _i, x _j)-f (d (x _i, x _j))] ², in the formula,

x _i, x _jBe different arbitrarily samples to (i ≠ j), d (x _i, x _j) be x _iAnd x _jBetween original different degree, d ^*(x _i, x _j) be the different degree after the conversion in the lower dimensional space, f is a monotonic transformation function;

Step 6: with the type is unit, merges the frequency vector of each advertisement, and the profile that obtains type is described the frequency vector;

Step 7:, give certain weight can for usually each attribute and represent its significance level, be i.e. D=D (T1, W1 for containing the adline of n property value; T2, W2; , Tn, Wn), brief note is D=D (W1, W2 ..., Wn), vector representation for adline D, wherein Wk is the weight of Tk, 1＜=k＜=N, the computing method use characteristic frequency of weight-contrary document frequency (TF*IDF) weighing computation method, the TF*IDF method is used the frequency of occurrences of vocabulary to be similar to and is represented its significance level, and formula is

W_{ij} = tf (t_{i}, d_{j}) \times \log \frac{m}{df (t_{i})},

Wherein Wij represents the weight of vocabulary Term i at advertisement Document j, and (tj dj) represents the frequency that i occurs, df (t to Tf in j _i) expression contains the occurrence number of advertisement of vocabulary i.M is the number of all advertisements;

Step 8: in vector space model, the content degree of correlation Sim between two notion D1 and the D2 (D1, D2) use the cosine value of angle between the vector to represent that formula is:

Sim (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k = 1}^{n} W_{1 k} \times W_{2 k}}{\sqrt{(Σ_{k = 1}^{n} W_{1 k}^{2}) (Σ_{k = 1}^{n} W_{2 k}^{2})}}

Step 9:, make up the corresponding linear sorter according to following formula

Wherein The expression notion

Whether belong to classification c _j,

Value is 0 or 1, the expression notion

Whether belong to classification c _j,

The expression test concept

Notion on every side Similarity degree, the similar formula of vector space model that uses step 8 to mention is represented b _jBe the threshold values of classification, need just can be worth preferably by debugging repeatedly;

Step 10:, the sorter that previous step obtains is suddenly tested as treating classification type with a part of test-types, optimized the performance of sorter according to the step of sorting phase;

Learning phase finishes;

Two, at sorting phase:

Step 1: classified advertisement type (collection) is treated in input; Be saved in the Query Database;

Step 2: treat classified advertisement by the identical method of learning phase and carry out pre-service;

Step 3: set up index according to the property set that learning phase is set up for treating classified advertisement, obtain the adline weight vectors, see learning phase step 7;

Step 4: calculate the weight vectors for the treatment of classified advertisement;

Step 5: classify automatically by sorter, see that learning phase step 9 obtains classification results;

Sorting phase finishes.

In adline, be divided into two kinds of situations between dissimilar.First kind of situation is two types of wide aparts, and be promptly very dissimilar.In this two classes type, the keyword that they use is different fully, for example, and house lease and educational training class.Predict which kind of advertisement first belongs to wherein, only need to check which kind of property set it mainly use just passable.This can adopt the KNN algorithm to realize; Second kind of situation is very similar between the type, even use identical property set to describe subject content, at this moment only use the KNN algorithm just these type differences can not be come, more trend towards describing which adline and need to measure each attribute, integrate the type of predicting that again advertisement is affiliated then.In ad classification, most of advertisement belongs to first kind of situation, and the most difficult is second kind of situation.

There is error in the statistic of structure during statistical property aspect certain of descriptive statistics data, have only when data volume is tending towards infinity just to be tending towards described statistical property with probability 1.When data volume smaller, even data are when sparse, and error is very big between statistic and the actual value, describe the ad content that all natural languages are represented, potential property set can be very big, and the known advertisement collection (study collection) that is used for machine learning is then less relatively.Between the type of apart from each other,, can cause a large amount of sparse datas because the property set that they use disperses very much.Therefore, the statistic that obtains in this case is insecure, and statistic is complicated more, and error is big more.Between close type, because the attribute that uses is concentrated relatively, data volume can reach certain scale.The statistic that obtains between these types has higher reliability.Core concept of the present invention derives from file classification method.The problem that text classification solves is: how to make the user find the information of wanting as early as possible, how these magnanimity electronic information are effectively organized and safeguarded.The method of text classification has a variety of, for example: based on the Bayes classifier of probability model, rule-based decision tree/decision rule sorter, based on the classify K nearest neighbor classifier of experience of the mankind, linear classifier based on class description, based on the support vector machine of optimum lineoid, by sorter council that a plurality of sorting techniques are made up etc.According to file classification method, the present invention proposes the KNN nearest neighbor algorithm is incorporated in the linear classifier of room for improvement vector, be combined into a kind of new sorter.At linear classifier, vector space model is by being described as ad content by each attribute, i.e. speech, word, word string etc. are the vector of element, and computing machine can use vector operation to come ad content is operated, the length of compute vector for example, the similarity between the tolerance advertisement etc.Accurately and effectively to the automatic classification of the adline of most of wide apart.By adopting " based on the room for improvement vector model sorter of KNN algorithm ", overcome the existing issue that exists in the linear classifier based on vector space model, the result who moves on large-scale data shows that adline automatic identifying method of the present invention has improved classification accuracy significantly.

Advantage of the present invention is the type that the ad title that can submit to according to the user and content are judged advertisement automatically, and improves classification accuracy.

Description of drawings

Fig. 1 work point class methods process flow diagram of behaving;

Fig. 2 is the ad classification method flow diagram;

Fig. 3 is the learning phase program flow chart;

Fig. 4 is the sorting phase program flow chart.

Embodiment

The invention will be further described below in conjunction with drawings and Examples.

Embodiment

Being used for equipment of the present invention comprises: advertisement transact services device, advertisement word segmentation processing equipment, querying server, test server, index server, dictionary server etc.

Advertisement transact services device: handle the application program of advertisement, the user handles the server of advertisement, is used to provide the interface with the ad classification treatment facility;

Advertisement word segmentation processing equipment: server is the more stable PC of industrial computer or performance; Database is used to preserve the participle record of advertisement; The participle program is divided into word or speech with advertisement;

Querying server: polling routine i.e. Query Result from index server; If there is not the index of effective speech in the caching server, just enter in the querying server and search, querying server is inquired about adline under effective speech by index server.

Index server: in index server, set up index database, promptly set up the index of keyword to dictionary database.Set up the speed that index can improve inquiry, index database will upgrade according to the variation of dictionary thereupon.

Dictionary server: set up the advertisement dictionary database and promptly be used to deposit the ad classification speech; Be all to be stored in the database as all words of effective speech in each adline.

Testing server: test procedure.

As shown in Figure 2, be the ad classification method flow diagram, the user handles classified advertisement, the title content of typing advertisement at first, for example advertisement content is " office building taxi ", sends inquiry classification request then, enters automatic assorting process, after request is submitted to, extract effective keyword in ad content, then in the classified advertisement, effectively keyword is divided into three at this, be respectively " office building ", " taxi " and " building ", judge again whether these keywords are present in the caching server.If this inquiry was looked into, and before the out-of-service time, then directly return from caching server.If do not inquire about, then inquiry is submitted to querying server.Classification under querying server obtains from index database according to query word, according to " office building ", " building " and " taxi " three keywords, can from index database, find and belong to " house lease " type, return this result then, if keyword is many, from index database, find a plurality of types, compare weight, also deposit in the caching server for call next time through ordering back return results.The process by user's selection sort has been removed in the realization of this process of classifying automatically from, has also just avoided taking place to select the problem of inappropriate classification.

In this course, statistics to effective keyword is the step of most critical in the advertisement automatic classification method, effectively keyword select whether accurate be directly connected to ad classification accurately whether, the method of the keyword extraction that the present invention proposes is a kind of room for improvement vector sorting method based on the KNN algorithm, it is incorporated into linear analysis device based on vector space model to the KNN algorithm, so as disposable to keyword similar advertisement classify automatically; Program with the JAVA language, operate in the computing machine, this sorting technique contains following steps successively when carrying out in computing machine:

As shown in Figure 3, be the learning phase program flow chart,

1. import the adline collection; Set up the adline database, each adline is added in the database, build up original adline collection; Set up two tables of data in the adline database, a table is the adline table, and one is antistop list.

Table 1 adline table

The field of table	Data type	Major key whether	Could be sky
The field of table	Data type	Major key whether	Could be sky	The numbering of type (type_id)	Integer	Major key	Not null
The title of type (type_name)	Varchar(20)		Not null	The numbering of type (type_id)	Integer	Major key	Not null

Table 2 antistop list

The field of table	Data type	Major key whether	Could be sky
The field of table	Data type	Major key whether	Could be sky	The numbering of keyword (word_id)	Integer	Major key	Not null
The title of keyword (word_name)	Varchar(50)		Not null	The numbering of keyword (word_id)	Integer	Major key	Not null
The title of keyword (word_name)	Varchar(50)		Not null	The numbering of adline (type_id)	Integer	External key	Not null

1. set up index database according to adline, the purposes of index database is to set up the index of query word to dictionary server, and the foundation of index database can improve the speed that the user searches classification, and when change took place the dictionary server, index database also must rebulid index;

2. set up querying server, the function of querying server is to obtain the result according to query word from index database, carry out sorting operation, last return results, the effective keyword that obtains from advertisement first may have a plurality of, the weight of keyword is the important evidence that querying server is judged adline, the weight of keyword is bigger, and pairing type just comes the front when ordering, opposite, the weight of keyword is less, and its corresponding adline just comes the back;

3. adline is carried out pre-service,, form the primitive attribute collection according to the classified advertisement standard; Generating all types of attribute frequency vectors, is unit with the type, merges all types of attribute frequency vectors, generates all types of profiles and describes the frequency vector, and form is as shown in table 1, calculates all types of weight vectors, and form is as shown in table 2.Generate sorter, and definite parameter all is 1;

4. the attribute extraction of sorting phase: at first be that ad content is divided into a plurality of words, the difference of segmenting method can cause the difference of classifying, in the selection of segmenting method, used mechanical segmentation method based on string matching based on the short and small characteristics of classified advertisement length;

5. after removing invalid speech advertisement participle and finishing, advertisement is divided into a plurality of words first, and wherein some word is to the useful word of classifying, and some is the word that classification is not had effect, for example " ", some auxiliary words such as "Yes", adopt chi-square weight dimensionality reduction in this method;

6. judge that query word is whether in caching server, removed after the invalid speech, remaining keyword all is to the useful word of classifying, to judge that at first whether Already in effective keyword in the caching server, if exist, direct return results from caching server, if there is no, effective keyword is submitted in the querying server, querying server obtain that the result returns and in caching server saving result for use next time;

As shown in Figure 4, for the sorting phase program flow chart, at sorting phase, classified advertisement type (collection) is treated in input; Be saved in the Query Database; Treat classified advertisement (collection) and carry out pre-service, the input category device is classified automatically, the type that output may belong to (collection), and table 3 is the design of data query table;

Table 3 data query table

The field of table	Data type	Major key whether	Could be sky
The field of table	Data type	Major key whether	Could be sky	Treat the numbering (ad_id) of classified advertisement	Integer	Major key	Not Null
Treat the title (ad_title) of classified advertisement	Varchar(100)			Treat the numbering (ad_id) of classified advertisement	Integer	Major key	Not Null
Treat the title (ad_title) of classified advertisement	Varchar(100)			Treat the content (ad_content) of classified advertisement	Varchar(500)
Treat effective keyword (ad_key) of classified advertisement	Varchar(500)			Treat the content (ad_content) of classified advertisement	Varchar(500)
	Varchar(500)			Treat the type number (ad_type_id) of classified advertisement	Integer	External key	Not null

As:

Ad title is " retail shop's house for rent " first, and ad content is: commercial circle, Xujiahui Ai Jianyuan retail shop is now solemnly to society's house for rent, the area 5680m of retail shop ², other has business office room 1250m ², warehouse 2400m ², welcome various circles of society to come to consult phone: 64395012,64395072, contact person: Mr. Zhou, Mr. Ma.

(1) treats classified advertisement and carry out pre-service;

(2) according to the property set of determining at learning phase, treat classified advertisement and carry out the branch glossarial index, comprise 35 attributes (participle) altogether, then occurred altogether 43 times in the advertisement at this, generate the attribute frequency vector, the result is as shown in table 4.

Table 4: the attribute frequency for the treatment of classified advertisement

Attribute	Frequency	Attribute	Frequency
Attribute	Frequency	Attribute	Frequency	Retail shop	3	Society	2
Recruit	2	Area	1	Retail shop	3	Society	2
Recruit	2	Area	1	Rent	2	5680	1
The Xujiahui	1	M ²	3	Rent	2	5680	1
The Xujiahui	1	M ²	3	The merchant	1	In addition	1
Circle	1	Have	1	The merchant	1	In addition	1
Circle	1	Have	1	Like	1	Commercial affairs	1
Build	1	Office	1	Like	1	Commercial affairs	1
Build	1	Office	1	The garden	1	With	1
Existing	1	The room	1	The garden	1	With	1
Existing	1	The room	1	Solemnly	1	The warehouse	1
Right	1	2400	1	Solemnly	1	The warehouse	1
Right	1	2400	1	1250	1	Welcome	1
Each	1	The boundary	1	1250	1	Welcome	1
Each	1	The boundary	1	Come	1	Negotiation	1
64395012	1	64395072	1	Come	1	Negotiation	1
64395012	1	64395072	1	Week	1	Horse	1
Sir	2			Week	1	Horse	1

(3) calculate the weight vectors for the treatment of classified advertisement.Use characteristic frequency-contrary document frequency (TF*IDF) weighing computation method.The TF*IDF method is used the frequency of occurrences of vocabulary to be similar to and is represented its significance level, and formula is

W_{ij} = tf (t_{i}, d_{j}) \times \log \frac{m}{df (t_{i})},

Wherein Wij represents the weight of vocabulary Term i at advertisement Document j.Tf (t _j, d _j) frequency that in j, occurs of expression i.Df (t _i) expression contains the occurrence number of advertisement of vocabulary i.M is the number of all advertisements, and the result is as shown in table 5.

Table 5: the weight vectors for the treatment of classified advertisement

Attribute	Weight	Attribute	Weight
Attribute	Weight	Attribute	Weight	Retail shop	0.416225	Society	0.017321
Recruit	0.145319	Area	0.226877	Retail shop	0.416225	Society	0.017321
Recruit	0.145319	Area	0.226877	Rent	0.263862	5680	0.009753
The Xujiahui	0.096646	M ²	0.078965	Rent	0.263862	5680	0.009753
The Xujiahui	0.096646	M ²	0.078965	The merchant	0.065671	In addition	0.001356
Circle	0.023696	Have	0.096372	The merchant	0.065671	In addition	0.001356
Circle	0.023696	Have	0.096372	Like	0.002678	Commercial affairs	0.264829
Build	0.007262	Office	0.438291	Like	0.002678	Commercial affairs	0.264829
Build	0.007262	Office	0.438291	The garden	0.164923	With	0.343892
Existing	0.036385	The room	0.454829	The garden	0.164923	With	0.343892
Existing	0.036385	The room	0.454829	Solemnly	0.013648	The warehouse	0.035271
Right	0.002648	2400	0.003528	Solemnly	0.013648	The warehouse	0.035271
Right	0.002648	2400	0.003528	1250	0.005672	Welcome	0.053728
Each	0.002891	The boundary	0.002356	1250	0.005672	Welcome	0.053728
Each	0.002891	The boundary	0.002356	Come	0.042145	Negotiation	0.035728
64395012	0.000743	64395072	0.000643	Come	0.042145	Negotiation	0.035728
64395012	0.000743	64395072	0.000643	Week	0.024189	Horse	0.032157
Sir	0.147895			Week	0.024189	Horse	0.032157

(4) in vector space model, the content degree of correlation Sim between two notion D1 and the D2 (D1, D2) use the cosine value of angle between the vector to represent that formula is:

Sim (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k = 1}^{n} W_{1 k} \times W_{2 k}}{\sqrt{(Σ_{k = 1}^{n} W_{1 k}^{2}) (Σ_{k = 1}^{n} W_{2 k}^{2})}}

According to this formula, can calculate the degree of correlation, according to the formula of KNN algorithm, make up the corresponding linear sorter

Wherein

The expression notion

Whether belong to classification c _j

Value is 0 or 1, the expression notion

Whether belong to classification c _j

The expression test concept Notion on every side Similarity degree, obtain by the formula in (4).b _jBe the threshold values of classification, obtain each participle according to the KNN algorithmic formula and whether belong to the type.Through getting rid of invalid speech, the result is as shown in table 6

Table 6: effective property set

Attribute	Weight	Attribute	Weight
Attribute	Weight	Attribute	Weight	Retail shop	0.416225	Area	0.226877
Recruit	0.145319	Office	0.438291	Retail shop	0.416225	Area	0.226877
Recruit	0.145319	Office	0.438291	Rent	0.263862	Commercial affairs	0.264829
The garden	0.164923	With	0.343892	Rent	0.263862	Commercial affairs	0.264829
The garden	0.164923	With	0.343892	Sir	0.147895	The room	0.454829

(5) the weight input for the treatment of effective attribute of classified advertisement in the table 3 is classified in the sorter that learning phase generates automatically, and the output category result.

With " house lease " type is example, treat that these 10 the effective attributes in the classified advertisement all occur in " house lease " feature set that type comprised, effective property set for the treatment of classified advertisement belongs to this type of house lease, therefore treats that classified advertisement is divided into " house lease " type.This result meets the actual content for the treatment of classified advertisement, and machine sort is correct.

(6) in order to check the classifying quality of the advertisement automatic classification method that we invent, we import 50000 pieces and treat classified advertisement, and classification results is as shown in table 7:

Table 7: the classification accuracy (%) of different weighing computation methods on the different attribute collection

The property set size	Only use the KNN algorithm	Only use the model (TF*IDF) of room for improvement vector	K nearest neighbor classifier based on the room for improvement vector model
The property set size	Only use the KNN algorithm	Only use the model (TF*IDF) of room for improvement vector		10000	74.8	58.0	84.0
20000	76.7	75.0	89.3	10000	74.8	58.0	84.0
20000	76.7	75.0	89.3	30000	77.5	83.0	92.3
40000	78.3	87.1	93.8	30000	77.5	83.0	92.3
40000	78.3	87.1	93.8	50000	78.7	89.7	95.0

As can be seen from Table 4, " based on the KNN nearest neighbor classifier sorting technique of room for improvement vector model " of our invention all improved the accuracy rate of ad classification significantly on all property sets.When property set comprised whole attribute, classification accuracy was the highest, reached 95.0%, exceeded 5.3% than a TF*IDF method (89.7%) with the room for improvement vector model, than only having exceeded 16.3% with KNN algorithm (78.7%).As can be seen, the search method of room for improvement vector model only just has classifying quality preferably when property set is big, and when property set only comprised 10000 attributes, classification accuracy was very low, has only 58.0%.And we " based on the KNN nearest neighbor classifier sorting technique of room for improvement vector model " of invention all has very high classification accuracy on all properties collection.

Claims

1. the ad classification method of an automatic recognition classification adline, it is characterized in that, program with the JAVA language, operate in the computing machine, and the KNN algorithm is incorporated into linear analysis device based on vector space model, so that disposable to keyword similar advertisement classify automatically, this sorting technique contains following steps successively when carrying out in computing machine:

One, at learning phase:

Step 1: input adline collection;

Step 2: attribute unit and the linear classifier type of determining employing;

Step 3: the adline collection is carried out pre-service;

Step 4: attribute extraction: the adline collection is carried out index, obtain the frequency vector of primitive attribute collection and each adline, adline represents that with D attribute frequency is represented with t, be meant the basic language unit that appears in the adline and can represent the type, mainly be to be made of speech or phrase, adline can be expressed as D (T1, T2 with the set of attribute frequency, Tn), wherein Tk is an attribute frequency, 1＜=k＜=N;

Step 5: the primitive attribute collection is adopted existing dimensionality reduction operation, be frequency, weight, obtain property set, the thought that dimensionality reduction is the most basic, utilize the method for iteration exactly, location feature vector in the dimensionality reduction space makes that distance and the diversity between them remained by as much as possible, in order to reach this purpose, the square error tolerance below in iteration, needing constantly to reduce:

P=∑ [d ^*(x _i, x _j)-f (d (x _i, x _j))] ², in the formula,

W_{ij} = tf (t_{i}, d_{j}) \times \log \frac{m}{df (t_{i})},

Sim = (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k = 1}^{n} W_{1 k} \times W_{2 k}}{\sqrt{(Σ_{k = 1}^{n} W_{1 k}^{2}) (Σ_{k = 1}^{n} W_{2 k}^{2})}}

Step 9:, make up the corresponding linear sorter according to following formula

y (\overset{ρ}{x}, c_{j}) = \underset{di &Element; kNN}{Σ} sim (\overset{ρ}{x}, {\overset{ρ}{d}}_{i}) y ({\overset{ρ}{d}}_{i}, c_{j}) - b_{j},

Wherein The expression notion

Whether belong to classification c _j,

Value is 0 or 1, the expression notion

Whether belong to classification c _j, The expression test concept Notion on every side

Similarity degree, the similar formula of vector space model that uses (8) to mention is represented b _jBe the threshold values of classification, need just can be worth preferably by debugging repeatedly;

Learning phase finishes;

Two, at sorting phase:

Step 1: classified advertisement type (collection) is treated in input;

Step 5: classify automatically by sorter, see that learning phase step 9 obtains classification results; Sorting phase finishes.