CN101477566A

CN101477566A - Method and apparatus used for putting candidate key words advertisement

Info

Publication number: CN101477566A
Application number: CNA2009100771855A
Authority: CN
Inventors: 王震; 方高林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2009-01-19
Filing date: 2009-01-19
Publication date: 2009-07-08

Abstract

The invention discloses an advertisement delivery method used for candidate keywords, which comprises the following steps: extracting at least one candidate keyword and performing the advertisement and topical computation; confirming weight values of advertisement subjects of the candidate keywords through the computation according to the advertisement and topical computation results obtained in the computation manner, and selecting the candidate keywords for advertisement delivery. The invention further provides an advertisement delivery device used for the candidate keywords. By adopting the technical scheme of the invention, the invention can solve the problem of conflict between the priority of keywords and the priority of advertising words, and the advertisement delivery accuracy is enhanced.

Description

A kind of method and device that is used for the candidate key words advertisement putting

Technical field

The present invention relates to the internet processing technology field, more particularly, relate to a kind of method and device that is used for the candidate key words advertisement putting.

Background technology

Along with the development of Internet technology with popularize, network becomes the important media that diffuses information gradually.Wherein, online advertisement is one of most important profit model in the internet enterprise, and how the mode advertisement delivery with the best is the emphasis of business research between each website, internet or other guide.

At present, general process (as shown in Figure 1) based on the advertisement promotion of internet text content, the descriptor extraction module carries out weight calculation and ordering to the candidate key words that extracts, obtain the feasibility of each candidate key words as text keyword, the main reference technology (as shown in Figure 2) of this degree, wherein:

(1) TF-IDF (significance level of each candidate key words in text) is worth, and takes all factors into consideration two factors of inverse document frequency of vocabulary frequency and vocabulary, and this value is high more, and then this speech is thematic high more;

(2) feature of vocabulary: the constraint that adds systematicness as required own; As part of speech or vocabulary length, termini generales thematic higher; Vocabulary length is long more in certain limit thinks that this speech is important more;

(3) structured message of vocabulary in text: in the text position occurs as vocabulary, appear at thematic height on the title usually, perhaps first, first section or latter end etc.; Vocabulary distributed intelligence in the text, vocabulary is evenly distributed usually, and it is thematic high more to cover the wide vocabulary of paragraph.

From the above mentioned, by based on TF-IDF, the structured message in text quantizes to calculate in conjunction with feature of vocabulary own and vocabulary, finds and the more maximally related vocabulary of article meaning, and can obtain the feasibility of each vocabulary.

But there are some shortcomings in this advertisement recommendation system:

(a) general descriptor is and the maximally related vocabulary of textual description information, but not necessarily has advertisement value.Limited seldom the time when the descriptor number in the article, the vocabulary that can match relevant advertisements is just very limited;

(b) weighted value is represented the priority of descriptor, but does not represent the priority of advertisement value.When very light vocabulary of advertisement meaning has surmounted the strong vocabulary of advertisement meaning slightly on thematic, cause putting the cart before the horse of advertisement putting probably.

At present, the technical scheme (as shown in Figure 3) that also has a kind of use " advertisement dictionary+descriptor weight " based on the advertisement promotion of internet text content, this solution is to remain in essence according to the thematic weight calculation of carrying out also finally to sort, but in order to adapt to the demand of advertisement, after the candidate key words production process finishes, before the weight calculation, use an advertisement dictionary, make the vocabulary that enters theme calculating all have advertising by screening, guarantee that final selected vocabulary can advertisement delivery.Then according to topic relativity (being the result of the equity stock re-computation recited above) output of sorting.

The calculating of topic relativity is mainly in the TF/IDF mode.TF (Term Frequency, single text vocabulary frequency) is meant the frequency of occurrences of vocabulary under semantic environment; DF (DocumentFrequency, text frequency index) is meant the ratio of the document that contains this vocabulary, and this is the knowledge of a priori, obtains by training, has 100 pieces to contain certain speech as 10000 pieces of articles, and the DF of this speech is 0.01 so; IDF (Inverse Document Frequency, inverse document frequency) is the inverse of DF, and IDF is high more, and this speech is rare more, can represent the feature of an article more.The computing formula of the weight of vocabulary is:

Weight＝TF×log(IDF)

By above-described technical scheme, be aided with structural information in text again as vocabulary self attributes or vocabulary, draw final weights.Though it is non-advertisement word problem that this method has solved descriptor, for to make the advertisement dictionary can constantly replenish new vocabulary, the workload of its maintenance is very big; And the conflict relationship of theme priority and advertisement priority does not still solve; Because the word of theme ordering is all from the advertisement dictionary, so limited the diversity of descriptor.

Summary of the invention

In view of above-mentioned existing in prior technology problem, technical matters to be solved by this invention provides a kind of method and device that is used for the candidate key words advertisement putting, can solve the collision problem between descriptor priority and the advertising words priority.

The objective of the invention is to be achieved through the following technical solutions:

The invention provides a kind of method that is used for the candidate key words advertisement putting, technical scheme comprises:

With at least one candidate key words that extracts, carry out advertising and thematic calculating;

According to described advertising that calculates acquisition and thematic result of calculation,, carry out advertisement putting to select described candidate key words by calculating the advertising theme weighted value of determining described candidate key words.

Further, described candidate key words being carried out the detailed process that advertising calculates is:

Described candidate key words mates in the advertisement dictionary of setting up, preserve the fixed weight value of each advertising words in the described advertisement dictionary, according to the fixed weight value of described candidate key words correspondence, determine the accurate advertisement matching degree of this candidate key words in described advertisement dictionary by calculating; And, calculate the similarity of this candidate key words and described advertisement context vector according to the context vector of described candidate key words and the advertisement context vector that obtains by the advertising words in the described advertisement dictionary;

Between described accurate advertisement matching degree and described similarity, get maximal value, obtain the result of calculation of the advertising of this candidate key words.

Further, described method also comprises the process of setting up described advertisement dictionary, specifically comprises:

Vocabulary in the text message that the user is paid close attention to is put into and is come the search advertisements speech in the search engine, and the record searching advertising words and the corresponding frequency of occurrences and the grade that arrive, obtains the advertising words in the described advertisement dictionary;

According to advertisement degree weighted value and this advertising words the similarity the described advertisement context vector that obtain between of described advertising words in search engine, determine the fixed weight value of this advertising words correspondence in described advertisement dictionary, and preserve; Wherein, described advertisement degree weighted value is used for representing the degree that this advertising words is paid close attention at search engine; Described similarity is used for representing the context vector of this advertising words and obtains similarity degree between the described advertisement context vector in the data bank that stores a large amount of article datas.

Further, the detailed process that obtains described advertisement degree weighted value comprises:

Will be by the ratio of the maximum ad degree value in the advertising words of calculating the advertisement degree value of this advertising words in search engine of determining and the advertisement dictionary of determining by calculating, as the advertisement degree weighted value of this advertising words;

Described advertisement degree value is to determine by calculating according to the frequency of this advertising words advertisement of doing in search engine and grade.

Further, the process of the described definite accurate advertisement matching degree of this candidate key words in the advertisement dictionary specifically comprises:

According to the fixed weight value of described candidate key words correspondence in described advertisement dictionary and the character length of this candidate key words, calculate and determine the accurate advertisement matching degree of described this candidate key words in the advertisement dictionary;

Perhaps, by will be described candidate key words split composition vocabulary that the back obtains corresponding fixed weight value and the character length of this composition vocabulary in described advertisement dictionary, calculate and determine the accurate advertisement matching degree of this candidate's theme in the advertisement dictionary.

Further, the detailed process of described acquisition advertisement context vector comprises:

Each advertising words in the advertisement dictionary is mated in storing the data bank of a large amount of article datas, and record is from this advertising words nearest context vocabulary with substantive significance and corresponding frequency information; The context vocabulary of each advertising words and corresponding frequency information are formed the context vector of this advertising words; The context vector of all advertising words is combined and is obtained described advertisement context vector in the advertisement dictionary; The context vector of described each advertising words is all corresponding numerical value in described advertisement context vector.

Further, described calculation of similarity degree process comprises:

According to the context vector of each advertising words in the advertisement dictionary and the cosine function value between the described advertisement context vector, obtain the similarity between described each advertising words and the described advertisement context vector, and preserve;

And context vector by calculating described candidate key words and the cosine function value between the described advertisement context vector are determined the similarity between this candidate key words and the described advertisement context vector.

Further, the described detailed process that described candidate key words is carried out thematic calculating comprises:

According to calculating significance level value, the characteristic weighted value of this candidate key words itself and the weighted value of this candidate key words in text structured message of described candidate key words in text that obtains, calculate the thematic result of calculation of determining this candidate key words.

Further, the computation process of the significance level value of described candidate key words in text comprises:

For word,, calculate the described significance level value of determining this candidate key words according to inverse document frequency and single text vocabulary frequency values of this candidate key words;

Perhaps, for compound word, the overall estimation of the inverse document frequency of the composition vocabulary after splitting according to described compound word and single text vocabulary frequency values of this candidate key words calculate the described significance level value of determining this candidate key words; The overall estimation of described inverse document frequency mainly comprises average or weighted mean, is used for the inverse document frequency of the described compound word of approximate representation.

Further, the process that obtains described inverse document frequency value specifically comprises:

In the inverse document frequency training stage, the vocabulary that Words partition system is told is in preserving the data bank of a large amount of article datas, carry out the extraction of the frequency of occurrences and text frequency, obtain the inverse document frequency of each vocabulary, by calculating, the inverse document frequency of the vocabulary of Words partition system cutting is combined into contrary text index dictionary.

Further, described calculating determines that the computing formula of the advertising theme weighted value of described candidate key words comprises:

Weight(w)＝ADWeight(w)×TopicWeight(w)

Wherein, ADWeight (w) is the advertising result of calculation of described candidate key words, and TopicWeight (w) is the thematic result of calculation of described candidate key words.

The present invention also provides a kind of device that is used for the candidate key words advertisement putting, comprising:

The vocabulary computing module is used at least one candidate key words to extracting, and carries out advertising and thematic calculating;

The overall treatment module is used for the described advertising that will obtain and thematic result of calculation, by calculating the advertising theme weighted value of determining described candidate key words, carries out advertisement putting to select described candidate key words.

Preferably, described vocabulary computing module specifically comprises:

The advertising computing unit, be used for described candidate key words is mated at the advertisement dictionary of setting up, preserve the fixed weight value of each advertising words in the described advertisement dictionary, the described fixed weight value that matches is passed through to calculate, determine the accurate advertisement matching degree of this candidate key words in described advertisement dictionary; And, according to the context vector of described candidate key words and the advertisement context vector that obtains by the advertising words in the described advertisement dictionary, calculate the similarity of this candidate key words and described advertisement context vector; By between described accurate advertisement matching degree and described similarity, getting maximal value, obtain the advertisement result of calculation of this candidate key words;

Thematic computing unit, be used for described candidate key words according to the significance level value of this candidate key words that calculates at text, and characteristic weighted value of this candidate key words itself and the structured message weighted value of this descriptor in text, by calculating the thematic result of calculation of determining this candidate key words.

Preferably, described advertising computing unit specifically comprises:

The advertisement dictionary is set up subelement, and the vocabulary that is used for text message that the user is paid close attention to is put into and come the search advertisements speech in the search engine, and the record searching advertising words and the corresponding frequency of occurrences and the grade that arrive, obtains the advertising words in the described advertisement dictionary; And according to the advertisement degree weighted value of this advertising words in search engine, and the similarity between the described advertisement context vector of this advertising words and acquisition, determine the fixed weight value of described advertising words correspondence in described advertisement dictionary, and preserve;

The advertisement context obtains subelement, be used for that described advertisement dictionary is set up subelement and obtain described advertising words and mate in the data bank of preserving a large amount of article datas, record is from this advertising words nearest context vocabulary with substantive significance and corresponding frequency information; The context vocabulary and the frequency information of each advertising words in the advertisement dictionary are combined into described advertisement context vector; The context vector of described each advertising words is all corresponding numerical value in described advertisement context vector.

Preferably, described advertising computing unit specifically also comprises:

The coupling computation subunit, be used for described candidate key words is mated at described advertisement dictionary, preserve the fixed weight value of each advertising words in the described advertisement dictionary, the described fixed weight value that matches is passed through to calculate, determine the accurate advertisement matching degree of this candidate key words in described advertisement dictionary;

The similarity computation subunit is used for according to obtaining described advertisement context vector, by calculating the similarity of this candidate key words in described advertisement context vector;

The comprehensive subelement of advertising is got maximal value between the similarity that is used for calculating by the accurate advertisement matching degree that calculates in described coupling computation subunit and described similarity computation subunit, obtains the advertising result of calculation of this candidate key words.

Preferably, described advertising computing unit specifically also comprises:

Fixed weight value computation subunit, be used for according to the advertisement degree weighted value of described advertising words at search engine, and the similarity between the described advertisement context vector of this advertising words and acquisition, determine the fixed weight value of described advertising words correspondence in described advertisement dictionary, and preserve.

Preferably, described thematic computing unit specifically comprises:

Inverse document frequency obtains subelement, be used in the inverse document frequency training stage, the vocabulary that Words partition system is told extracts the frequency of occurrences and the text frequency of described vocabulary in storing the data bank of a large amount of article datas, by calculating the inverse document frequency that obtains described vocabulary, be combined into the inverse document frequency dictionary;

The significance level computation subunit is used for obtaining the described inverse document frequency that subelement obtains according to described inverse document frequency, calculates the significance level value of described candidate key words in text;

Thematic computation subunit, be used for the described candidate key words that calculates according to described significance level computation subunit significance level value at text, and characteristic weighted value and this descriptor structured message weighted value in text of this descriptor itself, by calculating the thematic result of calculation of determining described candidate key words.

Preferably, described overall treatment module specifically comprises:

The synthesis result computing unit is used for the described advertising that will obtain and thematic result of calculation, calculates the advertising theme weighted value of described candidate key words.

Sequencing unit, the advertising theme weighted value of the described candidate key words that calculates according to described synthesis result computing unit, sorting from big to small.

Beneficial effect:

Technical scheme of the present invention, by the advertising and thematic two aspects of balance vocabulary, screen the candidate key words that from text, extracts, whether have the advertising theme weight, therefrom to choose suitable candidate key words, carry out the input of advertisement at different web page contents; The judgement of candidate key words advertising theme weight is based on the basic work in the contents advertising system.

Technical scheme of the present invention, by the candidate key words that extracts being carried out the weight calculation of advertisement degree, differentiate whether a vocabulary is advertising words, overcome that descriptor is not the advertisement word problem in the prior art, and, guaranteed the matching result of candidate key words by accurately coupling and similarity calculating, can be as in the prior art, directly in the advertisement dictionary, match the vocabulary of response fully, obtain the result of single mechanization, kept the diversity feature of advertising word.

Description of drawings

Fig. 1 is the synoptic diagram of the general process of the advertisement promotion of internet text content in the prior art;

Fig. 2 is that can candidate key words as the synoptic diagram of the important technological parameters of the feasibility of text keyword in the prior art;

Fig. 3 is the synoptic diagram of " advertisement dictionary+descriptor weight " technical scheme in the prior art;

Fig. 4 is the process flow diagram of the described method of the embodiment of the invention;

Fig. 5 is the structural representation of the described device of the embodiment of the invention;

Fig. 6 is the structural drawing of vocabulary computing module in the described device of the embodiment of the invention;

Fig. 7 is the structural drawing of the described advertising computing unit of the embodiment of the invention;

Fig. 8 is the structural drawing of the described thematic computing unit of the embodiment of the invention;

Fig. 9 is the structural drawing of the described overall treatment module of the embodiment of the invention.

Embodiment

Below in conjunction with the drawings and specific embodiments technical solutions according to the invention are elaborated.

Technical scheme of the present invention, at first with the candidate key words that extracts, by the advertisement dictionary, calculate the accurate advertisement coupling weighted value of each candidate key words, promptly come the accurate advertisement coupling weighted value of rigid measurement candidate key words by accurate advertisement weighted value given in the advertisement dictionary.Then, utilize the advertisement context vocabulary of statistics gained and the described advertisement context vector that corresponding frequency information is formed, calculate the cosine function value between each candidate key words and the described advertisement context vector, as the advertisement similarity between this candidate key words and the described advertisement context vector.These two numerical value are got the advertising result of calculation of maximal value as this final candidate key words; And, obtain the final advertising theme weighted value of this candidate key words by all candidate key words are carried out thematic calculating; Weigh the advertising of candidate's advertising words with the advertising theme weighted value, the advertising theme weighted value is high more, and the advertising of this vocabulary is strong more.

As shown in Figure 4, a kind of method that is used for the candidate key words advertisement putting of the embodiment of the invention, technical scheme comprises:

Step S101: at least one candidate key words that will extract, carry out advertising and thematic calculating;

Step S102:,, carry out advertisement putting to select described candidate key words by calculating the advertising theme weighted value of determining described candidate key words according to calculating described advertising and the thematic result who obtains.

Illustrate: candidate key words is carried out advertising and thematic weighted value calculating respectively, the calculating of the two is independent of each other, there is not precedence, go out the advertising theme weighted value of each candidate key words by the numerical evaluation of final comprehensive two aspects, after sorting according to the advertising theme weighted value, several inputs of carrying out advertisement vocabulary as net result before getting as required.

Specifically, in step S101, the process of extracting candidate key words in the present embodiment can comprise:

After according to Words partition system urtext being carried out participle, described content of text is carried out semantic analysis, in text, extract the vocabulary that concrete meaning is arranged, as candidate key words;

Wherein, the processing procedure of extraction vocabulary is divided into:

(1) from word segmentation result, selects satisfactory original vocabulary;

(2) text is carried out new word discovery, excavate the not entity speech of login.

So candidate key words finally combines two kinds of results: one be exactly that Words partition system can be differentiated, (for example: the vocabulary high vocabulary of substantive significance such as noun, verb, adjective) have the substantive significance part of speech; Another is exactly the result of new word discovery.Two kinds of results are merged, and through behind the rubbish filtering, redundant filtration treatment, with the candidate of the selected vocabulary that comes out as candidate key words.

Specifically, described candidate key words being carried out the detailed process that advertising calculates is:

Specifically, described method also comprises the process of setting up described advertisement dictionary, and detailed process can comprise:

(1) vocabulary in the text message that the user is paid close attention to is put into and is come the search advertisements speech in the search engine, and the advertising words and the corresponding frequency of occurrences and the grade that search are carried out record, obtains the advertising words in the described advertisement dictionary;

The foundation of advertisement dictionary is mainly obtained by the concern information of analysis user and the popularization of each large search engine in the present embodiment, and method can be preferably:

In network, extract a part of inquiry log (Query Log), the vocabulary in the daily record is put in each large search engine excavated, the entry that has advertisement that searches is carried out record, and put down in writing data such as frequency that this advertising words occurs or rank; In the application process of present embodiment advertisement vocabulary and the corresponding frequency of occurrences and the rank noted are filtered once more, individual character, symbol, rubbish speech, wide sense speech are deleted, obtain final advertising words.As long as query log is enough big, almost can cover all popular advertising words.

(2) according to advertisement degree weighted value and this advertising words the similarity the described advertisement context vector that obtain between of described advertising words in search engine, determine the fixed weight value of this advertising words correspondence in described advertisement dictionary, and preserve;

Wherein, described advertisement degree weighted value is used for representing the degree that this advertising words is paid close attention at search engine; Described similarity is used for representing the context vector of this advertising words and obtains similarity degree between the described advertisement context vector in the data bank that stores a large amount of article datas.

Specifically, the detailed process that obtains described advertisement degree weighted value can comprise:

Will be by the ratio of the maximum ad degree value in the advertising words of calculating the advertisement degree value of this advertising words in search engine of determining and the advertisement dictionary of determining by calculating, as the advertisement degree weighted value of this advertising words; Described advertisement degree value is to determine by calculating according to the frequency of this advertising words advertisement of doing in search engine and grade.

The computing formula of the advertisement degree value of advertising words described in the embodiment of the invention can for:

If advertising words is w, the advertising frequency that w occurs in search engine is F (w), and grade is D (w):

AdSEWeight(w)＝log(F(w)+1)×(α+βD(w))

Wherein, α, β represent grade adjustment parameter, and α is used to adjust the gap between the highest and the lowest class, and β is used to adjust the influence of level data to search engine advertisement degree value;

For example: advertisement divides 7 grades (0-6), under the condition that frequency equates, if α=0.6, β=0.1, to differ in grade the parameter here be twice (0.6 for the lowest class and highest ranking so, 1.2), for fear of being that the difference of high and low grade is excessive under zero the situation in grade, so just two parameter alpha of increase and β in formula, thereby reduce gap between the high and low grade by the value of regulating α and β, meet the demands.

Illustrate: the advertisement degree value of each advertising words in the advertisement dictionary is all calculated respectively, then with the ratio of advertisement degree value with the advertisement degree value of the maximum that calculates of each advertising words, as the advertisement degree weighted value of each advertising words; Described advertisement degree weighted value is the numerical value after this advertising words advertisement degree value in the search engine number is carried out normalizing, because advertising words w frequency value corresponding may be very big, convenience in order to guarantee to calculate like this, by normalization numerical value advertisement degree weighted value all is distributed in [0,1] in the interval, this numerical value is mainly explained promotion efficiency and the user attention rate of each advertising words in search engine.

In the embodiment of the invention, the fixed weight value of described advertising words in the advertisement dictionary can be calculated by following formula:

ADWordWeight(w)＝m×AdSEWeight(w)+n×AdSimilarty(w)：

Wherein, m and n represent the advertisement degree weighted value and the shared ratio of described similarity of described advertising words respectively, satisfy condition to be m+n=1, and m * AdSEWeight (w)=n * AdSimilarity (w); AdSEWeight (w) is described advertisement degree weighted value, and AdSimilarity (w) is described similarity.

Wherein, AdSEWeight (w) is the numerical value through normalized, and AdSimilarity (w) then is the cosine function value, so the span of ADWordWeight (w) is convenient to computing in interval [0,1].

Specifically, the process of the definite accurate advertisement matching degree of this candidate key words in the advertisement dictionary specifically comprises described in the embodiment of the invention:

The computing formula of described accurate advertisement matching degree can for:

If candidate key words w can be decomposed into w ₁w ₂w ₃... .w _nThe time,

PreciseADWeight (w) = Σ_{i = 1}^{n} [\frac{AD (w_{i}) \times length (w_{i})}{length (w)}]

Wherein, AD (w _i) be described candidate key words w _iThe fixed weight value of correspondence in described advertisement dictionary, length (w _i) be after described candidate key words splits, the character length of each vocabulary; Length (w) is the character length overall of this descriptor; When described candidate key words directly can match, be equally applicable to above-mentioned formula in the advertisement dictionary; Decompose the composition vocabulary obtain when described candidate key words or by candidate key words, when coupling was less than the fixed weight value of correspondence in the advertisement dictionary, then the fixed weight value of this vocabulary was zero.

Specifically, the detailed process of the advertisement context vector of described acquisition advertisement dictionary can comprise:

Can calculate the pairing numerical value of each vocabulary in the advertisement context vector by following formula in the embodiment of the invention:

If comprise M vocabulary in the described advertisement context vector: (v ₁, v ₂..., v _M), the vocabulary frequency is respectively (F ₁, F ₂... F _M), vocabulary v so _iValue corresponding is in described advertisement context vector:

NF (v_{i}) = \frac{\log (1 + F_{i})}{\sqrt{Σ_{k = 1}^{M} \log^{2} (1 + F_{k})}} .

Illustrate: in the present embodiment, for convenience of calculation, the numerical value of each advertisement context vocabulary after value corresponding is normalization in described advertisement context vector.So the advertisement context vocabulary of each advertising words all can corresponding normalized numerical value in described advertisement context vector.

In the present embodiment, obtaining the advertisement context vector mainly is by the advertising words in the advertisement dictionary is trained gained in storing the data bank of a large amount of article datas.At first the advertising words in the advertisement dictionary is mated in preserving the data bank of a large amount of article datas, when matching the sentence that contains advertising words in the advertisement dictionary, to note from 2N nearest significant vocabulary of advertising words (above N, hereinafter N) in the sentence.After training was finished, each advertising words all had a lot of context vocabulary in the advertisement dictionary, selects significant context vocabulary and writes down their frequency information.These context vocabulary and corresponding frequency information are combined, be combined into a big vector, as the advertisement context vector, it has represented the comprehensive characteristics of all advertising words.

Specifically, in the embodiment of the invention,, obtain the advertising result of calculation of this candidate key words, can calculate by following formula by between described accurate advertisement matching degree and described advertisement similarity, getting maximal value:

ADWeight(w)＝MAX(ADSimilarity(w)，PreciseADWeight(w))

Wherein, the span of the advertisement degree weighted value of each advertising words is [0,1] in the advertisement dictionary, and therefore described candidate key words is according to the fixed weight value that matches in the advertisement dictionary, mate the numerical value of weighted value also between [0,1] by the accurate advertisement that calculates; So advertising result of calculation be value between [0,1], be used for weighing the index of vocabulary advertising power.

Further, described calculation of similarity degree process comprises:

In the embodiment of the invention, calculate the context vector of each advertising words in the described advertisement dictionary and the similarity between the described advertisement context vector, can calculate by following formula:

If advertising words is w, w advertisement context vector is (w ₁, w ₂..., w _s), frequency information is

(F_{w_{1}}, F_{w_{2}}, . . ., F_{w_{s}}) :

AdSimilarity (w) = \frac{Σ_{i = 1}^{s} [ADVector (w_{i}) \times \log (1 + F_{w_{i}})]}{| | ADVector | | \times | | w_{1}, w_{2}, . . ., w_{s} | |}

Wherein, ADVector (w _i) the advertisement context vocabulary w of expression advertising words w _iPairing numerical value in described advertisement context vector;

Illustrate: in the formula of aforementioned calculation similarity, purpose is on the advertisement degree weighted value of advertisement dictionary, add the element that the advertisement similarity is calculated, make in the calculating of final MAX value, both have comparability to accurate advertisement matching degree and advertisement similarity.

Specifically, vocabulary is thematic mainly to be the degree of correlation that is used for quantizing vocabulary and this piece article theme, concrete computing method in embodiments of the present invention, adopting TF-IDF is main algorithm, but has carried out some improvement.Because the weight calculation process is only accepted the theme vocabulary that the text analyzing candidate comes out, therefore all theme vocabulary are by rubbish filtering, vocabulary with certain meaning, this class vocabulary word frequency (TF) in the text is often even more important than frequency inverse index (IDF) for thematic calculating, therefore in the embodiment of the invention obtaining of IDF some improvement have been done, make the importance degree of its outstanding more vocabulary, and then decide the final TFIDF value of this speech by word frequency.

The described detailed process that described candidate key words is carried out thematic calculating can comprise:

Thematic computing formula described in the embodiment of the invention can for:

Topic?Weight(w)＝TFIDF(w)×Indep?Weight(w)×StructWeight(w)

Wherein, IndepWeight (w) represents the characteristic weighted value of this candidate key words itself, and StructWeight (w) represents the weighted value of this candidate key words structured message in text, and TFIDF (w) represents the significance level value of this candidate key words in text.

Wherein, the characteristic weighting of IndepWeight (w) expression candidate key words itself, for example: if noun (what of noun number compound word can come definite according to), this weight is just high, verb takes second place or the like, if vocabulary length within the specific limits, the long more weight of speech is high more; The weighting of StructWeight (w) expression candidate key words structured message in text for example is a heading, and then this weight needs significantly to promote, and first section speech takes second place, and even more weight is high more or the like if speech distributes in article.

Specifically, described candidate key words comprises the vocabulary of a plurality of participles unit, is the result of new word discovery, and the computation process of the significance level value of promptly described candidate key words in text can comprise:

Concrete formula can be TFIDF (w)=TF (w) * IDF (w)

For compound word, establish w=w ₁w ₂... .w _n, TFIDF (w)=TF (w) * AVEIDF (w ₁, w ₁..., w _n);

Wherein, AVEIDF (w ₁, w ₁..., w _n) be overall estimation to all the components speech inverse document frequency of w, mainly comprise average or weighted mean, be used for the inverse document frequency of approximate representation compound word w; IDF (w) value representation inverse document frequency value; TF (w) value representation list text vocabulary frequency values.

Specifically, the process that obtains described inverse document frequency value specifically can comprise:

Embodiment of the invention inverse document frequency computing formula comprises:

IDF (w) = \log [TF (w)] \times \log [\frac{DocumentNumber}{DF (w)}]

By calculating, the inverse document frequency of the vocabulary of Words partition system cutting is combined into contrary text index dictionary.

Specifically, described calculating determines that the computing formula of the advertising theme weighted value of described candidate key words comprises:

Weight(w)＝ADWeight(w)×TopicWeight(w)

The comprehensive weight value of all vocabulary in the described candidate key words that calculates is sorted, concrete optimal way can be arranged according to the size of this candidate key words advertising theme weighted value, with being aligned to topmost of maximum, arrange forwardly more, this candidate key words is high more as the degree of advertisement putting.

As shown in Figure 5, the present invention also provides a kind of device that is used for the candidate key words advertisement putting, comprising:

Vocabulary computing module S11 is used at least one candidate key words to extracting, and carries out advertising and thematic calculating;

Overall treatment module S12 is used for the described advertising that will obtain and thematic result of calculation, by calculating the advertising theme weighted value of determining described candidate key words, carries out advertisement putting to select described candidate key words.

As shown in Figure 6, present embodiment preferably, described vocabulary computing module specifically can also comprise:

Advertising computing unit S111, be used for described candidate key words is mated at the advertisement dictionary of setting up, preserve the fixed weight value of each advertising words in the described advertisement dictionary, the described fixed weight value that matches is passed through to calculate, determine the accurate advertisement matching degree of this candidate key words in described advertisement dictionary; And, according to obtaining described advertisement context vector, by calculating the similarity of this candidate key words in described advertisement context vector; By between described accurate advertisement matching degree and described similarity, getting maximal value, obtain the advertising result of calculation of this candidate key words;

Thematic computing unit S112, be used for described candidate key words according to the significance level value of this candidate key words that calculates at text, and characteristic weighted value of this candidate key words itself and the structured message weighted value of this descriptor in text, by calculating the thematic result of calculation of determining this candidate key words.

As shown in Figure 7, present embodiment preferably, described advertising computing unit specifically can comprise:

The advertisement dictionary is set up subelement S1111, and the vocabulary that is used for text message that the user is paid close attention to is put into and come the search advertisements speech in the search engine, and the record searching advertising words and the corresponding frequency of occurrences and the grade that arrive, obtains the advertising words in the described advertisement dictionary; And according to the advertisement degree weighted value of this advertising words in search engine, and the similarity between the described advertisement context vector of this advertising words and acquisition, determine the fixed weight value of described advertising words correspondence in described advertisement dictionary, and preserve;

The advertisement context obtains subelement S1113, be used for that described advertisement dictionary is set up subelement and obtain described advertising words and mate in the data bank of preserving a large amount of article datas, record is from this advertising words nearest context vocabulary with substantive significance and corresponding frequency information; The context vocabulary and the frequency information of advertising words described in the advertisement dictionary are combined into described advertisement context vector; Described each advertisement context vector is all corresponding numerical value in described advertisement context vector.

Present embodiment preferably, described advertising computing unit specifically can also comprise:

Coupling computation subunit S1112, be used for described candidate key words is mated at described advertisement dictionary, preserve the fixed weight value of each advertising words in the described advertisement dictionary, the described fixed weight value that matches is passed through to calculate, determine the accurate advertisement matching degree of this candidate key words in described advertisement dictionary;

Similarity computation subunit S1114 is used for according to obtaining described advertisement context vector, by calculating the similarity of this candidate key words in described advertisement context vector;

The comprehensive subelement S1115 of advertising gets maximal value between the similarity that is used for calculating by the accurate advertisement matching degree that calculates in described coupling computation subunit and described similarity computation subunit, obtains the advertising result of calculation of this candidate key words.

Fixed weight value computation subunit S1116, be used for according to the advertisement degree value of described advertising words at search engine, and the similarity of this advertising words in obtaining described advertisement context vector, determine the fixed weight value of described advertising words correspondence in described advertisement dictionary, and preserve.

As shown in Figure 8, present embodiment preferably, described thematic computing unit specifically can comprise:

Inverse document frequency obtains subelement S1121, be used in the frequency inverse index training stage, the vocabulary that Words partition system is told extracts the frequency of occurrences and the text frequency of described vocabulary in big language material, by calculating the inverse document frequency that obtains described vocabulary, be combined into the inverse document frequency dictionary;

Significance level computation subunit S1122 is used for obtaining the described inverse document frequency that subelement obtains according to described inverse document frequency, calculates the significance level value of described candidate key words in text;

Thematic computation subunit S1123, be used for the described candidate key words that calculates according to described significance level computation subunit significance level value at text, and characteristic weighted value and this descriptor structured message weighted value in text of this descriptor itself, by calculating the thematic result of calculation of determining described candidate key words.

As shown in Figure 9, present embodiment preferably, described overall treatment module S12 specifically can comprise:

Synthesis result computing unit S121 is used for the described advertising that will obtain and thematic result of calculation, calculates the advertising theme weighted value of described candidate key words.

Sequencing unit S122, the advertising theme weighted value of the described candidate key words that calculates according to described synthesis result computing unit, sorting from big to small.

The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1, a kind of method that is used for the candidate key words advertisement putting is characterized in that, comprising:

2, method according to claim 1 is characterized in that, the detailed process of described candidate key words being carried out advertising calculating comprises:

Described candidate key words mates in the advertisement dictionary of setting up, preserve the fixed weight value of each advertising words in the described advertisement dictionary, according to the fixed weight value of described candidate key words correspondence, determine the accurate advertisement matching degree of this candidate key words in described advertisement dictionary by calculating; And, calculate the similarity between this candidate key words and the described advertisement context vector according to the context vector of described candidate key words and the advertisement context vector that obtains by the advertising words in the described advertisement dictionary;

3, method according to claim 2 is characterized in that, described method also comprises the process of setting up described advertisement dictionary, specifically comprises:

According to described advertising words in search engine advertisement degree weighted value and the similarity between the described advertisement context vector of this advertising words and acquisition, determine this advertising words corresponding fixed weight value in described advertisement dictionary, and preserve;

Wherein, described advertisement degree weighted value is used for representing the degree that this advertising words is paid close attention at search engine; Described similarity is used for representing the context vector of this advertising words and by the similarity degree between the described advertisement context vector that obtains in the data bank of preserving a large amount of article datas.

4, method according to claim 3 is characterized in that, the detailed process that obtains described advertisement degree weighted value comprises:

5, method according to claim 2 is characterized in that, the process of the described definite accurate advertisement matching degree of this candidate key words in the advertisement dictionary specifically comprises:

6, method according to claim 2 is characterized in that, the detailed process of described acquisition advertisement context vector comprises:

7, method according to claim 2 is characterized in that, described calculation of similarity degree process comprises:

8, method according to claim 1 is characterized in that, the detailed process of described candidate key words being carried out thematic calculating comprises:

9, method according to claim 8 is characterized in that, the computation process of the significance level value of described candidate key words in text comprises:

10, method according to claim 9 is characterized in that, the process that obtains described inverse document frequency value specifically comprises:

11, method according to claim 1 is characterized in that, the computing formula of calculating the advertising theme weighted value of determining described candidate key words comprises:

Weight(w)＝ADWeight(w)×TopicWeight(w)

12, a kind of device that is used for the candidate key words advertisement putting is characterized in that, comprising:

13, device according to claim 12 is characterized in that, described vocabulary computing module specifically comprises:

The advertising computing unit, be used for described candidate key words is mated at the advertisement dictionary of setting up, preserve the fixed weight value of each advertising words in the described advertisement dictionary, the described fixed weight value that matches is passed through to calculate, determine the accurate advertisement matching degree of this candidate key words in described advertisement dictionary; And, according to the context vector of described candidate key words and the advertisement context vector that obtains by the advertising words in the described advertisement dictionary, calculate the similarity between this candidate key words and the described advertisement context vector; By between described accurate advertisement matching degree and described similarity, getting maximal value, obtain the advertising result of calculation of this candidate key words;

14, device according to claim 13 is characterized in that, described advertising computing unit specifically comprises:

15, device according to claim 13 is characterized in that, described advertising computing unit specifically also comprises:

The similarity computation subunit is used for according to the described advertisement context vector that obtains, by calculating the similarity between this candidate key words and the described advertisement context vector;

16, device according to claim 13 is characterized in that, described advertising computing unit specifically also comprises:

Fixed weight value computation subunit, be used for according to the advertisement degree value of described advertising words at search engine, and the similarity between the described advertisement context vector of this advertising words and acquisition, determine the fixed weight value of described advertising words correspondence in described advertisement dictionary, and preserve.

17, device according to claim 13 is characterized in that, described thematic computing unit specifically comprises:

Inverse document frequency obtains subelement, be used in the inverse document frequency training stage, the vocabulary that Words partition system is told extracts the frequency of occurrences and the text frequency of described vocabulary in preserving the data bank of a large amount of article datas, by calculating the inverse document frequency that obtains described vocabulary, be combined into the inverse document frequency dictionary;

18, device according to claim 12 is characterized in that, described overall treatment module specifically comprises: