CN101097570A - Advertisement classification method capable of automatic recognizing classified advertisement type - Google Patents

Advertisement classification method capable of automatic recognizing classified advertisement type Download PDF

Info

Publication number
CN101097570A
CN101097570A CNA2006100283059A CN200610028305A CN101097570A CN 101097570 A CN101097570 A CN 101097570A CN A2006100283059 A CNA2006100283059 A CN A2006100283059A CN 200610028305 A CN200610028305 A CN 200610028305A CN 101097570 A CN101097570 A CN 101097570A
Authority
CN
China
Prior art keywords
advertisement
adline
frequency
classification
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2006100283059A
Other languages
Chinese (zh)
Inventor
陈壮坚
徐丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI VEKEE ADVERTISEMENT CO Ltd
Original Assignee
SHANGHAI VEKEE ADVERTISEMENT CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI VEKEE ADVERTISEMENT CO Ltd filed Critical SHANGHAI VEKEE ADVERTISEMENT CO Ltd
Priority to CNA2006100283059A priority Critical patent/CN101097570A/en
Publication of CN101097570A publication Critical patent/CN101097570A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a kind of sorting method for advertisement which can identify the type of advertisement automatically, and the characteristics are that: it is programmed by JAVA language, run in computer, and the KNN algorism is intruded to the linear analyzer based on the space vector model, which is convenient to sort the advertisement with similar keyword automatically, and the advantages of invention are that it can judge the type of advertisement according to submitted advertisement title and content, and improve the sorting accuracy.

Description

A kind of ad classification method of automatic recognition classification adline
Technical field
The present invention relates to a kind of ad classification method of automatic recognition classification adline, after use this method, can not need the type of selection sort advertisement when the user handles advertisement, belong to ad classification method and technology field.
Background technology
Classified advertisement is the novel advertisement that just grew up in recent years, do not solve very timely for some problems that produce in the classified advertisement, in the handling of classified advertisement, a lot of methods of handling of traditional industrial and commercial advertisement of still having continued to use, but because the difference of adline, traditional method is no longer suitable.In the handling of classified advertisement, distinct issues are exactly the type selecting of advertisement.The type of classified advertisement is many, upgrades than very fast, and the user differs when handling advertisement and knows that surely understanding the advertisement of oneself handling belongs to any type, in this case, if select wrong classification, can cause the effectiveness of advertisement, weaken even the advertisement inefficacy.
Present ad classification still is that main dependence is manually differentiated, as shown in Figure 1, behaviour work point class methods flowage structure synoptic diagram, when the user handles classified advertisement in client, need to judge that according to the experience of oneself advertisement that will handle belongs to any type, and then the title of typing advertisement and content, be submitted in the database at last.For example the user handles first advertisement content and is " office building taxi ", rule of thumb judge, this then advertisement belong to this type of house lease.
This sorting technique of subjective judgement that relies on is few for adline, and the adline of use is difficult for making a mistake in the time of more common, and for example Chang Yong adline has recruitment and job hunting, house lease or the like.From now on, the type of classified advertisement will get more and more, and the classification meeting of advertisement is more and more thinner, and the accuracy rate of artificial judgment will reduce when the time comes.
Summary of the invention
The objective of the invention is to invent the type that a kind of ad title that can submit to according to the user and content are judged advertisement automatically, and improve the ad classification method of the automatic recognition classification adline of classification accuracy.
For realizing above purpose, technical scheme of the present invention provides a kind of ad classification method of automatic recognition classification adline, it is characterized in that, program with the JAVA language, use mysql as database, operate in the computing machine, and the KNN algorithm is incorporated into linear classifier based on vector space model, so that disposable to keyword similar advertisement classify automatically, this sorting technique contains following steps successively when carrying out in computing machine:
One, at learning phase:
Step 1: input adline collection; Set up the adline database, each adline is added in the database, build up original adline collection; Set up two tables of data in the adline database, a table is the adline table, and one is antistop list.
Step 2: determine the attribute unit and the linear classifier type of employing, the sorter that adopts in this method is the linear classifier based on the room for improvement vector model;
Step 3: the adline collection is carried out pre-service, and pre-service comprises the cutting of Chinese statement or English stemming operation, synonym merging etc.;
Step 4: attribute extraction: the adline collection is carried out index, obtain the frequency vector of primitive attribute collection and each adline, adline represents that with D attribute frequency is represented with t, point out in the present adline and can represent the basic language unit of the type, mainly be to be made of speech or phrase, adline can be expressed as D (T1, T2 with the set of attribute frequency, Tn), wherein Tk is an attribute frequency, 1<=k<=N;
Step 5: the primitive attribute centralized procurement is operated with existing dimensionality reduction, be frequency, weight, obtain property set, the thought that dimensionality reduction is the most basic, utilize the method for iteration exactly, location feature vector in the dimensionality reduction space makes that distance and the diversity between them remained as much as possible, in order to reach this purpose, the square error tolerance below in iteration, needing constantly to reduce:
P=∑ [d *(x i, x j)-f (d (x i, x j))] 2, in the formula,
x i, x jBe different arbitrarily samples to (i ≠ j), d (x i, x j) be x iAnd x jBetween original different degree, d *(x i, x j) be the different degree after the conversion in the lower dimensional space, f is a monotonic transformation function;
Step 6: with the type is unit, merges the frequency vector of each advertisement, and the profile that obtains type is described the frequency vector;
Step 7:, give certain weight can for usually each attribute and represent its significance level, be i.e. D=D (T1, W1 for containing the adline of n property value; T2, W2; , Tn, Wn), brief note is D=D (W1, W2 ..., Wn), vector representation for adline D, wherein Wk is the weight of Tk, 1<=k<=N, the computing method use characteristic frequency of weight-contrary document frequency (TF*IDF) weighing computation method, the TF*IDF method is used the frequency of occurrences of vocabulary to be similar to and is represented its significance level, and formula is
W ij = tf ( t i , d j ) × log m df ( t i ) ,
Wherein Wij represents the weight of vocabulary Term i at advertisement Document j, and (tj dj) represents the frequency that i occurs, df (t to Tf in j i) expression contains the occurrence number of advertisement of vocabulary i.M is the number of all advertisements;
Step 8: in vector space model, the content degree of correlation Sim between two notion D1 and the D2 (D1, D2) use the cosine value of angle between the vector to represent that formula is:
Sim ( D 1 , D 2 ) = cos θ = Σ k = 1 n W 1 k × W 2 k ( Σ k = 1 n W 1 k 2 ) ( Σ k = 1 n W 2 k 2 )
Step 9:, make up the corresponding linear sorter according to following formula
Figure A20061002830500073
Wherein The expression notion
Figure A20061002830500075
Whether belong to classification c j,
Figure A20061002830500076
Value is 0 or 1, the expression notion
Figure A20061002830500077
Whether belong to classification c j,
Figure A20061002830500078
The expression test concept
Figure A20061002830500079
Notion on every side Similarity degree, the similar formula of vector space model that uses step 8 to mention is represented b jBe the threshold values of classification, need just can be worth preferably by debugging repeatedly;
Step 10:, the sorter that previous step obtains is suddenly tested as treating classification type with a part of test-types, optimized the performance of sorter according to the step of sorting phase;
Learning phase finishes;
Two, at sorting phase:
Step 1: classified advertisement type (collection) is treated in input; Be saved in the Query Database;
Step 2: treat classified advertisement by the identical method of learning phase and carry out pre-service;
Step 3: set up index according to the property set that learning phase is set up for treating classified advertisement, obtain the adline weight vectors, see learning phase step 7;
Step 4: calculate the weight vectors for the treatment of classified advertisement;
Step 5: classify automatically by sorter, see that learning phase step 9 obtains classification results;
Sorting phase finishes.
In adline, be divided into two kinds of situations between dissimilar.First kind of situation is two types of wide aparts, and be promptly very dissimilar.In this two classes type, the keyword that they use is different fully, for example, and house lease and educational training class.Predict which kind of advertisement first belongs to wherein, only need to check which kind of property set it mainly use just passable.This can adopt the KNN algorithm to realize; Second kind of situation is very similar between the type, even use identical property set to describe subject content, at this moment only use the KNN algorithm just these type differences can not be come, more trend towards describing which adline and need to measure each attribute, integrate the type of predicting that again advertisement is affiliated then.In ad classification, most of advertisement belongs to first kind of situation, and the most difficult is second kind of situation.
There is error in the statistic of structure during statistical property aspect certain of descriptive statistics data, have only when data volume is tending towards infinity just to be tending towards described statistical property with probability 1.When data volume smaller, even data are when sparse, and error is very big between statistic and the actual value, describe the ad content that all natural languages are represented, potential property set can be very big, and the known advertisement collection (study collection) that is used for machine learning is then less relatively.Between the type of apart from each other,, can cause a large amount of sparse datas because the property set that they use disperses very much.Therefore, the statistic that obtains in this case is insecure, and statistic is complicated more, and error is big more.Between close type, because the attribute that uses is concentrated relatively, data volume can reach certain scale.The statistic that obtains between these types has higher reliability.Core concept of the present invention derives from file classification method.The problem that text classification solves is: how to make the user find the information of wanting as early as possible, how these magnanimity electronic information are effectively organized and safeguarded.The method of text classification has a variety of, for example: based on the Bayes classifier of probability model, rule-based decision tree/decision rule sorter, based on the classify K nearest neighbor classifier of experience of the mankind, linear classifier based on class description, based on the support vector machine of optimum lineoid, by sorter council that a plurality of sorting techniques are made up etc.According to file classification method, the present invention proposes the KNN nearest neighbor algorithm is incorporated in the linear classifier of room for improvement vector, be combined into a kind of new sorter.At linear classifier, vector space model is by being described as ad content by each attribute, i.e. speech, word, word string etc. are the vector of element, and computing machine can use vector operation to come ad content is operated, the length of compute vector for example, the similarity between the tolerance advertisement etc.Accurately and effectively to the automatic classification of the adline of most of wide apart.By adopting " based on the room for improvement vector model sorter of KNN algorithm ", overcome the existing issue that exists in the linear classifier based on vector space model, the result who moves on large-scale data shows that adline automatic identifying method of the present invention has improved classification accuracy significantly.
Advantage of the present invention is the type that the ad title that can submit to according to the user and content are judged advertisement automatically, and improves classification accuracy.
Description of drawings
Fig. 1 work point class methods process flow diagram of behaving;
Fig. 2 is the ad classification method flow diagram;
Fig. 3 is the learning phase program flow chart;
Fig. 4 is the sorting phase program flow chart.
Embodiment
The invention will be further described below in conjunction with drawings and Examples.
Embodiment
Being used for equipment of the present invention comprises: advertisement transact services device, advertisement word segmentation processing equipment, querying server, test server, index server, dictionary server etc.
Advertisement transact services device: handle the application program of advertisement, the user handles the server of advertisement, is used to provide the interface with the ad classification treatment facility;
Advertisement word segmentation processing equipment: server is the more stable PC of industrial computer or performance; Database is used to preserve the participle record of advertisement; The participle program is divided into word or speech with advertisement;
Querying server: polling routine i.e. Query Result from index server; If there is not the index of effective speech in the caching server, just enter in the querying server and search, querying server is inquired about adline under effective speech by index server.
Index server: in index server, set up index database, promptly set up the index of keyword to dictionary database.Set up the speed that index can improve inquiry, index database will upgrade according to the variation of dictionary thereupon.
Dictionary server: set up the advertisement dictionary database and promptly be used to deposit the ad classification speech; Be all to be stored in the database as all words of effective speech in each adline.
Testing server: test procedure.
As shown in Figure 2, be the ad classification method flow diagram, the user handles classified advertisement, the title content of typing advertisement at first, for example advertisement content is " office building taxi ", sends inquiry classification request then, enters automatic assorting process, after request is submitted to, extract effective keyword in ad content, then in the classified advertisement, effectively keyword is divided into three at this, be respectively " office building ", " taxi " and " building ", judge again whether these keywords are present in the caching server.If this inquiry was looked into, and before the out-of-service time, then directly return from caching server.If do not inquire about, then inquiry is submitted to querying server.Classification under querying server obtains from index database according to query word, according to " office building ", " building " and " taxi " three keywords, can from index database, find and belong to " house lease " type, return this result then, if keyword is many, from index database, find a plurality of types, compare weight, also deposit in the caching server for call next time through ordering back return results.The process by user's selection sort has been removed in the realization of this process of classifying automatically from, has also just avoided taking place to select the problem of inappropriate classification.
In this course, statistics to effective keyword is the step of most critical in the advertisement automatic classification method, effectively keyword select whether accurate be directly connected to ad classification accurately whether, the method of the keyword extraction that the present invention proposes is a kind of room for improvement vector sorting method based on the KNN algorithm, it is incorporated into linear analysis device based on vector space model to the KNN algorithm, so as disposable to keyword similar advertisement classify automatically; Program with the JAVA language, operate in the computing machine, this sorting technique contains following steps successively when carrying out in computing machine:
As shown in Figure 3, be the learning phase program flow chart,
1. import the adline collection; Set up the adline database, each adline is added in the database, build up original adline collection; Set up two tables of data in the adline database, a table is the adline table, and one is antistop list.
Table 1 adline table
The field of table Data type Major key whether Could be sky
The numbering of type (type_id) Integer Major key Not null
The title of type (type_name) Varchar(20) Not null
Table 2 antistop list
The field of table Data type Major key whether Could be sky
The numbering of keyword (word_id) Integer Major key Not null
The title of keyword (word_name) Varchar(50) Not null
The numbering of adline (type_id) Integer External key Not null
1. set up index database according to adline, the purposes of index database is to set up the index of query word to dictionary server, and the foundation of index database can improve the speed that the user searches classification, and when change took place the dictionary server, index database also must rebulid index;
2. set up querying server, the function of querying server is to obtain the result according to query word from index database, carry out sorting operation, last return results, the effective keyword that obtains from advertisement first may have a plurality of, the weight of keyword is the important evidence that querying server is judged adline, the weight of keyword is bigger, and pairing type just comes the front when ordering, opposite, the weight of keyword is less, and its corresponding adline just comes the back;
3. adline is carried out pre-service,, form the primitive attribute collection according to the classified advertisement standard; Generating all types of attribute frequency vectors, is unit with the type, merges all types of attribute frequency vectors, generates all types of profiles and describes the frequency vector, and form is as shown in table 1, calculates all types of weight vectors, and form is as shown in table 2.Generate sorter, and definite parameter all is 1;
4. the attribute extraction of sorting phase: at first be that ad content is divided into a plurality of words, the difference of segmenting method can cause the difference of classifying, in the selection of segmenting method, used mechanical segmentation method based on string matching based on the short and small characteristics of classified advertisement length;
5. after removing invalid speech advertisement participle and finishing, advertisement is divided into a plurality of words first, and wherein some word is to the useful word of classifying, and some is the word that classification is not had effect, for example " ", some auxiliary words such as "Yes", adopt chi-square weight dimensionality reduction in this method;
6. judge that query word is whether in caching server, removed after the invalid speech, remaining keyword all is to the useful word of classifying, to judge that at first whether Already in effective keyword in the caching server, if exist, direct return results from caching server, if there is no, effective keyword is submitted in the querying server, querying server obtain that the result returns and in caching server saving result for use next time;
As shown in Figure 4, for the sorting phase program flow chart, at sorting phase, classified advertisement type (collection) is treated in input; Be saved in the Query Database; Treat classified advertisement (collection) and carry out pre-service, the input category device is classified automatically, the type that output may belong to (collection), and table 3 is the design of data query table;
Table 3 data query table
The field of table Data type Major key whether Could be sky
Treat the numbering (ad_id) of classified advertisement Integer Major key Not Null
Treat the title (ad_title) of classified advertisement Varchar(100)
Treat the content (ad_content) of classified advertisement Varchar(500)
Treat effective keyword (ad_key) of classified advertisement Varchar(500)
Treat the type number (ad_type_id) of classified advertisement Integer External key Not null
As:
Ad title is " retail shop's house for rent " first, and ad content is: commercial circle, Xujiahui Ai Jianyuan retail shop is now solemnly to society's house for rent, the area 5680m of retail shop 2, other has business office room 1250m 2, warehouse 2400m 2, welcome various circles of society to come to consult phone: 64395012,64395072, contact person: Mr. Zhou, Mr. Ma.
(1) treats classified advertisement and carry out pre-service;
(2) according to the property set of determining at learning phase, treat classified advertisement and carry out the branch glossarial index, comprise 35 attributes (participle) altogether, then occurred altogether 43 times in the advertisement at this, generate the attribute frequency vector, the result is as shown in table 4.
Table 4: the attribute frequency for the treatment of classified advertisement
Attribute Frequency Attribute Frequency
Retail shop 3 Society 2
Recruit 2 Area 1
Rent 2 5680 1
The Xujiahui 1 M 2 3
The merchant 1 In addition 1
Circle 1 Have 1
Like 1 Commercial affairs 1
Build 1 Office 1
The garden 1 With 1
Existing 1 The room 1
Solemnly 1 The warehouse 1
Right 1 2400 1
1250 1 Welcome 1
Each 1 The boundary 1
Come 1 Negotiation 1
64395012 1 64395072 1
Week 1 Horse 1
Sir 2
(3) calculate the weight vectors for the treatment of classified advertisement.Use characteristic frequency-contrary document frequency (TF*IDF) weighing computation method.The TF*IDF method is used the frequency of occurrences of vocabulary to be similar to and is represented its significance level, and formula is W ij = tf ( t i , d j ) × log m df ( t i ) , Wherein Wij represents the weight of vocabulary Term i at advertisement Document j.Tf (t j, d j) frequency that in j, occurs of expression i.Df (t i) expression contains the occurrence number of advertisement of vocabulary i.M is the number of all advertisements, and the result is as shown in table 5.
Table 5: the weight vectors for the treatment of classified advertisement
Attribute Weight Attribute Weight
Retail shop 0.416225 Society 0.017321
Recruit 0.145319 Area 0.226877
Rent 0.263862 5680 0.009753
The Xujiahui 0.096646 M 2 0.078965
The merchant 0.065671 In addition 0.001356
Circle 0.023696 Have 0.096372
Like 0.002678 Commercial affairs 0.264829
Build 0.007262 Office 0.438291
The garden 0.164923 With 0.343892
Existing 0.036385 The room 0.454829
Solemnly 0.013648 The warehouse 0.035271
Right 0.002648 2400 0.003528
1250 0.005672 Welcome 0.053728
Each 0.002891 The boundary 0.002356
Come 0.042145 Negotiation 0.035728
64395012 0.000743 64395072 0.000643
Week 0.024189 Horse 0.032157
Sir 0.147895
(4) in vector space model, the content degree of correlation Sim between two notion D1 and the D2 (D1, D2) use the cosine value of angle between the vector to represent that formula is:
Sim ( D 1 , D 2 ) = cos θ = Σ k = 1 n W 1 k × W 2 k ( Σ k = 1 n W 1 k 2 ) ( Σ k = 1 n W 2 k 2 )
According to this formula, can calculate the degree of correlation, according to the formula of KNN algorithm, make up the corresponding linear sorter
Figure A20061002830500152
Wherein
Figure A20061002830500153
The expression notion
Figure A20061002830500154
Whether belong to classification c j
Figure A20061002830500155
Value is 0 or 1, the expression notion
Figure A20061002830500156
Whether belong to classification c j
Figure A20061002830500157
The expression test concept Notion on every side Similarity degree, obtain by the formula in (4).b jBe the threshold values of classification, obtain each participle according to the KNN algorithmic formula and whether belong to the type.Through getting rid of invalid speech, the result is as shown in table 6
Table 6: effective property set
Attribute Weight Attribute Weight
Retail shop 0.416225 Area 0.226877
Recruit 0.145319 Office 0.438291
Rent 0.263862 Commercial affairs 0.264829
The garden 0.164923 With 0.343892
Sir 0.147895 The room 0.454829
(5) the weight input for the treatment of effective attribute of classified advertisement in the table 3 is classified in the sorter that learning phase generates automatically, and the output category result.
With " house lease " type is example, treat that these 10 the effective attributes in the classified advertisement all occur in " house lease " feature set that type comprised, effective property set for the treatment of classified advertisement belongs to this type of house lease, therefore treats that classified advertisement is divided into " house lease " type.This result meets the actual content for the treatment of classified advertisement, and machine sort is correct.
(6) in order to check the classifying quality of the advertisement automatic classification method that we invent, we import 50000 pieces and treat classified advertisement, and classification results is as shown in table 7:
Table 7: the classification accuracy (%) of different weighing computation methods on the different attribute collection
The property set size Only use the KNN algorithm Only use the model (TF*IDF) of room for improvement vector K nearest neighbor classifier based on the room for improvement vector model
10000 74.8 58.0 84.0
20000 76.7 75.0 89.3
30000 77.5 83.0 92.3
40000 78.3 87.1 93.8
50000 78.7 89.7 95.0
As can be seen from Table 4, " based on the KNN nearest neighbor classifier sorting technique of room for improvement vector model " of our invention all improved the accuracy rate of ad classification significantly on all property sets.When property set comprised whole attribute, classification accuracy was the highest, reached 95.0%, exceeded 5.3% than a TF*IDF method (89.7%) with the room for improvement vector model, than only having exceeded 16.3% with KNN algorithm (78.7%).As can be seen, the search method of room for improvement vector model only just has classifying quality preferably when property set is big, and when property set only comprised 10000 attributes, classification accuracy was very low, has only 58.0%.And we " based on the KNN nearest neighbor classifier sorting technique of room for improvement vector model " of invention all has very high classification accuracy on all properties collection.

Claims (1)

1. the ad classification method of an automatic recognition classification adline, it is characterized in that, program with the JAVA language, operate in the computing machine, and the KNN algorithm is incorporated into linear analysis device based on vector space model, so that disposable to keyword similar advertisement classify automatically, this sorting technique contains following steps successively when carrying out in computing machine:
One, at learning phase:
Step 1: input adline collection;
Step 2: attribute unit and the linear classifier type of determining employing;
Step 3: the adline collection is carried out pre-service;
Step 4: attribute extraction: the adline collection is carried out index, obtain the frequency vector of primitive attribute collection and each adline, adline represents that with D attribute frequency is represented with t, be meant the basic language unit that appears in the adline and can represent the type, mainly be to be made of speech or phrase, adline can be expressed as D (T1, T2 with the set of attribute frequency, Tn), wherein Tk is an attribute frequency, 1<=k<=N;
Step 5: the primitive attribute collection is adopted existing dimensionality reduction operation, be frequency, weight, obtain property set, the thought that dimensionality reduction is the most basic, utilize the method for iteration exactly, location feature vector in the dimensionality reduction space makes that distance and the diversity between them remained by as much as possible, in order to reach this purpose, the square error tolerance below in iteration, needing constantly to reduce:
P=∑ [d *(x i, x j)-f (d (x i, x j))] 2, in the formula,
x i, x jBe different arbitrarily samples to (i ≠ j), d (x i, x j) be x iAnd x jBetween original different degree, d *(x i, x j) be the different degree after the conversion in the lower dimensional space, f is a monotonic transformation function;
Step 6: with the type is unit, merges the frequency vector of each advertisement, and the profile that obtains type is described the frequency vector;
Step 7:, give certain weight can for usually each attribute and represent its significance level, be i.e. D=D (T1, W1 for containing the adline of n property value; T2, W2; , Tn, Wn), brief note is D=D (W1, W2 ..., Wn), vector representation for adline D, wherein Wk is the weight of Tk, 1<=k<=N, the computing method use characteristic frequency of weight-contrary document frequency (TF*IDF) weighing computation method, the TF*IDF method is used the frequency of occurrences of vocabulary to be similar to and is represented its significance level, and formula is
W ij = tf ( t i , d j ) × log m df ( t i ) ,
Wherein Wij represents the weight of vocabulary Term i at advertisement Document j, and (tj dj) represents the frequency that i occurs, df (t to Tf in j i) expression contains the occurrence number of advertisement of vocabulary i.M is the number of all advertisements;
Step 8: in vector space model, the content degree of correlation Sim between two notion D1 and the D2 (D1, D2) use the cosine value of angle between the vector to represent that formula is:
Sim = ( D 1 , D 2 ) = cos θ = Σ k = 1 n W 1 k × W 2 k ( Σ k = 1 n W 1 k 2 ) ( Σ k = 1 n W 2 k 2 )
Step 9:, make up the corresponding linear sorter according to following formula
y ( x ρ , c j ) = Σ di ∈ kNN sim ( x ρ , d ρ i ) y ( d ρ i , c j ) - b j ,
Wherein The expression notion
Figure A2006100283050003C5
Whether belong to classification c j,
Figure A2006100283050003C6
Value is 0 or 1, the expression notion
Figure A2006100283050003C7
Whether belong to classification c j, The expression test concept Notion on every side
Figure A2006100283050003C10
Similarity degree, the similar formula of vector space model that uses (8) to mention is represented b jBe the threshold values of classification, need just can be worth preferably by debugging repeatedly;
Step 10:, the sorter that previous step obtains is suddenly tested as treating classification type with a part of test-types, optimized the performance of sorter according to the step of sorting phase;
Learning phase finishes;
Two, at sorting phase:
Step 1: classified advertisement type (collection) is treated in input;
Step 2: treat classified advertisement by the identical method of learning phase and carry out pre-service;
Step 3: set up index according to the property set that learning phase is set up for treating classified advertisement, obtain the adline weight vectors, see learning phase step 7;
Step 4: calculate the weight vectors for the treatment of classified advertisement;
Step 5: classify automatically by sorter, see that learning phase step 9 obtains classification results; Sorting phase finishes.
CNA2006100283059A 2006-06-29 2006-06-29 Advertisement classification method capable of automatic recognizing classified advertisement type Pending CN101097570A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2006100283059A CN101097570A (en) 2006-06-29 2006-06-29 Advertisement classification method capable of automatic recognizing classified advertisement type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2006100283059A CN101097570A (en) 2006-06-29 2006-06-29 Advertisement classification method capable of automatic recognizing classified advertisement type

Publications (1)

Publication Number Publication Date
CN101097570A true CN101097570A (en) 2008-01-02

Family

ID=39011405

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2006100283059A Pending CN101097570A (en) 2006-06-29 2006-06-29 Advertisement classification method capable of automatic recognizing classified advertisement type

Country Status (1)

Country Link
CN (1) CN101097570A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541863A (en) * 2010-12-14 2012-07-04 联芯科技有限公司 Webpage compression method applied to mobile terminal
CN104123349A (en) * 2014-07-09 2014-10-29 昆明理工大学 Knowledge feature extraction method based on relevance
CN104408635A (en) * 2014-12-01 2015-03-11 银联智惠信息服务(上海)有限公司 Method and device for recognizing class information of commercial tenant
CN104408095A (en) * 2014-11-15 2015-03-11 北京广利核系统工程有限公司 Improvement-based KNN (K Nearest Neighbor) text classification method
CN104572775A (en) * 2013-10-28 2015-04-29 深圳市腾讯计算机系统有限公司 Advertisement classification method, device and server
CN104750833A (en) * 2015-04-03 2015-07-01 浪潮集团有限公司 Text classification method and device
CN104750835A (en) * 2015-04-03 2015-07-01 浪潮集团有限公司 Text classification method and device
CN106101157A (en) * 2009-05-08 2016-11-09 谷歌公司 Via AD tagged content combination in media based on web
CN106777401A (en) * 2017-03-10 2017-05-31 北京搜狐新媒体信息技术有限公司 Information classification approach and device
CN107451168A (en) * 2016-05-30 2017-12-08 中华电信股份有限公司 File Classification System and Method Based on Vocabulary Statistics
CN107729489A (en) * 2017-10-17 2018-02-23 北京京东尚科信息技术有限公司 Advertisement text recognition methods and device
CN108932098A (en) * 2017-05-19 2018-12-04 深圳市掌网科技股份有限公司 Building demenstration method and device
CN109598528A (en) * 2017-09-30 2019-04-09 北京国双科技有限公司 Advertisement information processing method and device
CN110457597A (en) * 2019-08-08 2019-11-15 中科鼎富(北京)科技发展有限公司 A kind of advertisement recognition method and device
CN110555107A (en) * 2018-03-29 2019-12-10 阿里巴巴集团控股有限公司 method and device for determining business object theme and recommending business object
CN110941715A (en) * 2019-10-23 2020-03-31 北京精英系统科技有限公司 Method for judging classification of entity object
CN111669412A (en) * 2020-08-10 2020-09-15 南京江北新区生物医药公共服务平台有限公司 Machine learning paas cloud platform system providing multiple machine learning frameworks
CN112445910A (en) * 2019-09-02 2021-03-05 上海哔哩哔哩科技有限公司 Information classification method and system
CN113613079A (en) * 2021-10-11 2021-11-05 浙江德塔森特数据技术有限公司 Intelligent device video advertisement processing method and intelligent device
TWI783613B (en) * 2021-08-04 2022-11-11 中國信託商業銀行股份有限公司 Digital marketing decision-making system and digital marketing decision-making method

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106101157A (en) * 2009-05-08 2016-11-09 谷歌公司 Via AD tagged content combination in media based on web
CN102541863A (en) * 2010-12-14 2012-07-04 联芯科技有限公司 Webpage compression method applied to mobile terminal
CN104572775A (en) * 2013-10-28 2015-04-29 深圳市腾讯计算机系统有限公司 Advertisement classification method, device and server
WO2015062359A1 (en) * 2013-10-28 2015-05-07 Tencent Technology (Shenzhen) Company Limited Method and device for advertisement classification, server and storage medium
CN104572775B (en) * 2013-10-28 2019-02-15 深圳市腾讯计算机系统有限公司 Advertisement classification method, device and server
CN104123349A (en) * 2014-07-09 2014-10-29 昆明理工大学 Knowledge feature extraction method based on relevance
CN104123349B (en) * 2014-07-09 2017-09-29 昆明理工大学 A kind of method extracted based on correlation knowledge feature
CN104408095A (en) * 2014-11-15 2015-03-11 北京广利核系统工程有限公司 Improvement-based KNN (K Nearest Neighbor) text classification method
CN104408095B (en) * 2014-11-15 2017-12-05 北京广利核系统工程有限公司 One kind is based on improved KNN file classification methods
CN104408635A (en) * 2014-12-01 2015-03-11 银联智惠信息服务(上海)有限公司 Method and device for recognizing class information of commercial tenant
CN104750833A (en) * 2015-04-03 2015-07-01 浪潮集团有限公司 Text classification method and device
CN104750835A (en) * 2015-04-03 2015-07-01 浪潮集团有限公司 Text classification method and device
CN107451168A (en) * 2016-05-30 2017-12-08 中华电信股份有限公司 File Classification System and Method Based on Vocabulary Statistics
CN107451168B (en) * 2016-05-30 2023-08-04 台湾中华电信股份有限公司 File classification system and method based on vocabulary statistics
CN106777401A (en) * 2017-03-10 2017-05-31 北京搜狐新媒体信息技术有限公司 Information classification approach and device
CN108932098A (en) * 2017-05-19 2018-12-04 深圳市掌网科技股份有限公司 Building demenstration method and device
CN109598528B (en) * 2017-09-30 2023-05-23 北京国双科技有限公司 Advertisement information processing method and device
CN109598528A (en) * 2017-09-30 2019-04-09 北京国双科技有限公司 Advertisement information processing method and device
CN107729489A (en) * 2017-10-17 2018-02-23 北京京东尚科信息技术有限公司 Advertisement text recognition methods and device
CN110555107A (en) * 2018-03-29 2019-12-10 阿里巴巴集团控股有限公司 method and device for determining business object theme and recommending business object
CN110457597A (en) * 2019-08-08 2019-11-15 中科鼎富(北京)科技发展有限公司 A kind of advertisement recognition method and device
CN112445910A (en) * 2019-09-02 2021-03-05 上海哔哩哔哩科技有限公司 Information classification method and system
CN110941715A (en) * 2019-10-23 2020-03-31 北京精英系统科技有限公司 Method for judging classification of entity object
CN111669412A (en) * 2020-08-10 2020-09-15 南京江北新区生物医药公共服务平台有限公司 Machine learning paas cloud platform system providing multiple machine learning frameworks
TWI783613B (en) * 2021-08-04 2022-11-11 中國信託商業銀行股份有限公司 Digital marketing decision-making system and digital marketing decision-making method
CN113613079B (en) * 2021-10-11 2022-01-04 浙江德塔森特数据技术有限公司 Intelligent device video advertisement processing method and intelligent device
CN113613079A (en) * 2021-10-11 2021-11-05 浙江德塔森特数据技术有限公司 Intelligent device video advertisement processing method and intelligent device

Similar Documents

Publication Publication Date Title
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN112100344B (en) Knowledge graph-based financial domain knowledge question-answering method
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN101625680B (en) Document retrieval method in patent field
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN106547864B (en) A kind of Personalized search based on query expansion
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
CN110750640A (en) Text data classification method and device based on neural network model and storage medium
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN101295303A (en) Knowledge search engine based on intelligent noumenon and implementing method thereof
CN102902806A (en) Method and system for performing inquiry expansion by using search engine
CN107193883B (en) Data processing method and system
CN102567308A (en) Information processing feature extracting method
CN112257419A (en) Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof
CN103049470A (en) Opinion retrieval method based on emotional relevancy
CN112559684A (en) Keyword extraction and information retrieval method
JP5057474B2 (en) Method and system for calculating competition index between objects
Punitha et al. Performance evaluation of semantic based and ontology based text document clustering techniques
CN115098650B (en) Comment information analysis method based on historical data model and related device
CN101404033A (en) Automatic generation method and system for noumenon hierarchical structure
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN102789452A (en) Similar content extraction method
CN106503153B (en) Computer text classification system
CN114090861A (en) Education field search engine construction method based on knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication