Summary of the invention
The technical matters that embodiments of the invention will solve provides content-relevant ad recognition methods and content-related advertising server, can improve the accuracy of content-relevant ad identification.
For solving the problems of the technologies described above, embodiments of the invention provide following technical scheme:
A kind of content-relevant ad recognition methods comprises:
Obtain the feature set of words of destination document;
Classify according to described feature set of words and the described destination document of classification samples set pair of all categories, obtain the affiliated classification of described destination document;
According to the classification under the described destination document, judging whether advertisement meets with described destination document belongs to same classification, and the condition of the coupling of the feature speech in the feature set of words of its feature speech and described destination document is if confirm that then this advertisement is relevant with described destination document.
A kind of content-related advertising server comprises:
Feature speech acquiring unit is used to obtain the also feature set of words of export target document;
Taxon is used for classifying according to described feature set of words and the described destination document of classification samples set pair of all categories, obtains affiliated classification of described destination document and output;
The content-relevant ad recognition unit, be used for according to the classification under the described destination document, judging whether advertisement meets with described destination document belongs to same classification, and the condition of the coupling of the feature speech in the feature set of words of its feature speech and described destination document, if confirm that then this advertisement is relevant with described destination document.
As can be seen from the above technical solutions, in embodiments of the present invention, when whether the identification advertisement is the advertisement relevant with destination document, not only to make the keyword matching of advertisement and destination document, will guarantee that also advertisement is identical with the classification of destination document, thereby the advertisement that assurance finds and the theme of destination document also has correlativity preferably; When the keyword matching that advertisement and destination document occur, and both theme differences, even during situation about falling far short, because when the theme of advertisement and destination document is uncorrelated, both affiliated separately classifications are also inequality usually, therefore, this advertisement can't be identified as the advertisement relevant with destination document; In sum, compared with prior art, the embodiment of the invention can improve the accuracy of content-relevant ad identification.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiment of content-relevant ad recognition methods provided by the invention and content-related advertising server is described in detail.
Please refer to Fig. 1, the process flow diagram of content-relevant ad recognition methods embodiment one of the present invention, described method comprises following flow process:
A1, obtain the feature set of words of destination document;
In embodiments of the present invention, the feature speech of document can broad understanding be word, speech, phrase or word string etc., can be the keyword that extracts from document, and/or can characterize descriptor of document subject matter etc.;
In embodiments of the present invention, destination document mainly refers to offer the document of client, for example webpage etc.;
In addition, can also further obtain the weights of feature speech in the feature set of words; The weights of feature speech are used for the degree of correlation of characteristic feature speech with respect to document subject matter, and the weights of feature speech are high more, represent that this feature speech can represent document subject matter more; The weights of feature speech specifically can be the frequency that the feature speech occurs in document, and perhaps the frequency according to the feature speech calculates acquisition by specific algorithm;
The feature set of words of document, and the weights of feature speech mainly are by technology such as Word Intelligent Segmentation, the extractions of feature speech document to be handled to obtain in the set, when specific implementation, can be according to the actual requirements, related algorithm with reference to technical fields such as Word Intelligent Segmentation, the extractions of feature speech is realized, does not do at this and gives unnecessary details;
A2, described destination document is classified, obtain the classification under the described destination document according to the feature set of words of the destination document that is obtained;
Classification to destination document mainly realizes by the text automatic classification technology, wherein a kind of optional sorting technique example is: obtain the similarity of classified sample set of all categories and the feature set of words of the destination document that obtained respectively, and determine and the classified sample set of the similarity maximum of described feature set of words; Described destination document is divided into determined classified sample set corresponding class; When specific implementation, can be with reference to the related algorithm in automatic classification technology field;
Wherein, classified sample set mainly is meant the set of a plurality of feature speech relevant with the theme of respective classes, and the feature speech can be word, speech, phrase or word string etc.;
Server end is set up the classification tree according to the demand of service environment, and the classification tree can have only one-level, also can be for multistage, when classification number when being multistage, can classify the target document into the rank of desired depth according to the actual requirements, be categorized into dark more rank, the granularity of classification is thin more;
Fig. 2 provides the instance graph of the classification tree with two-stage topologies, and in the drawings, finance and economics, amusement, physical culture belong to the first order (hereinafter referred to as big class), and football, basketball, swimming are the subclasses of physical culture, belong to the second level (hereinafter referred to as group);
When classifying the target document into big time-like according to above-mentioned sorting technique example, the class of obtaining similarity in the above-mentioned sorting technique can be limited in the big class, promptly only the classified sample set of each big class is handled;
When classifying the target document into group, can realize by dual mode according to above-mentioned sorting technique example; A kind ofly be: the class of obtaining similarity in the above-mentioned sorting technique is limited in the group, promptly only the classified sample set of each group is handled, this mode is applicable to that included group in each big class do not have the situation of repetition, owing to need handle the classified sample set of all groups, so operand is bigger; Another is: earlier the classified sample set of each big class is handled, classified the target document into the respective classes of corresponding big class, then the classified sample set of all categories that comprises in this classification is handled, classify the target document into corresponding group;
Destination document is carried out the branch time-like, the weights of each feature speech in the feature set of words of all right reference target document;
A3, according to the classification under the described destination document, judging whether advertisement meets with described destination document belongs to same classification, and the condition of the coupling of the feature speech in the feature set of words of its feature speech and described destination document is if confirm that then described advertisement is relevant with described destination document;
Advertisement has category attribute, and the characteristic of correspondence set of words; Usually, according to advertisement registration information, and the contents such as relevant information from the advertisement link website determine the classification of advertisement, and the feature set of words of advertisement;
Wherein, advertisement belongs to same classification with destination document and is meant that mainly advertisement is identical in the classification under the prescribed level with destination document, and affiliated superclass is also identical, has correlativity preferably with the theme of assurance advertisement and the theme of destination document; Preferable, this prescribed level can be set have than coarseness, that is, make the degree of depth of this prescribed level less, thereby guarantee that in big relatively scope the advertisement different with the theme of destination document can not be identified as the relevant advertisement of destination document;
Wherein, the coupling of the feature speech in the feature set of words of the feature speech of described advertisement and destination document specifically can be: advertisement and destination document have one or more feature speech to be complementary; The matching degree of advertisement and destination document can be used as one of standard of advertisement putting order;
In this method embodiment, when whether the identification advertisement is the advertisement relevant with destination document, not only will make the keyword matching of advertisement and destination document, will guarantee that also advertisement is identical with the classification of destination document, thereby the advertisement that assurance finds and the theme of destination document also have correlativity preferably; When the keyword matching that advertisement and destination document occur, and both theme differences, even under the situation about falling far short, because when the theme of advertisement and destination document is uncorrelated, both affiliated separately classifications are also inequality usually, therefore, this advertisement can't be identified as the advertisement relevant with destination document; In sum, compared with prior art, the embodiment of the invention can improve the accuracy of content-relevant ad identification.
Content-relevant ad recognition methods embodiment two of the present invention; Present embodiment and the foregoing description one are basic identical, and the key distinction is, also comprises between steps A 2 and A3:
A2 ', according to the relevant information of classification under the described destination document, the feature set of words of the destination document that obtained is expanded;
Wherein, described relevant information specifically can be the classified sample set of classification under the destination document, and/or the subject information of the affiliated classification of destination document etc.;
According to the classified sample set of classification under the destination document, the feature set of words of destination document expanded specifically can be: concentrate the sample speech that conforms to a predetermined condition to be increased to the feature set of words of destination document the classification samples of classification under the destination document; The weights that the classification samples that the described sample speech that conforms to a predetermined condition specifically can be a classification under target is concentrated are bigger, and in destination document non-existent sample speech;
According to the subject information of classification under the destination document, the feature set of words of destination document expanded specifically can be: the feature set of words that the descriptor of classification under the destination document is increased to destination document;
In this method embodiment, preferable, can be according to the relevant information of classification under destination document is under than the fine granularity rank, feature set of words to destination document is expanded, feature speech in the feature set of words of the destination document after the feasible expansion is more concrete, thereby improves the coverage rate of described feature set of words.
In this method embodiment, after destination document is classified, relevant information according to classification under the destination document, feature set of words to destination document is expanded, make and not only comprise the feature speech that from described document, extracts in the feature set of words of destination document, also comprise and affiliated classification characteristic of correspondence speech, thereby improved the coverage rate of the feature set of words of destination document, therefore the theme when advertisement and destination document is more relevant, and key word is can't mate the time, can improve the possibility that this advertisement is identified as the destination document relevant advertisements, thereby further improve the accuracy of content-relevant ad identification.
Structural drawing with reference to figure 3 content-related advertising server embodiment one of the present invention; Described content-related advertising server comprises feature speech acquiring unit 310, taxon 320 and content-relevant ad recognition unit 330:
Feature speech acquiring unit 310 is used to obtain the also feature set of words of export target document;
Taxon 320 is used for according to the feature set of words of feature speech acquiring unit 310 outputs described destination document being classified, and obtains to export after the affiliated classification of destination document;
Content-relevant ad recognition unit 330, be used for according to the classification under the destination document of taxon 320 outputs, judging whether advertisement meets with described destination document belongs to same classification, and the condition of the feature speech coupling in the feature set of words of the destination document of its feature speech and 310 outputs of feature speech acquiring unit, if confirm that then this advertisement is relevant with described destination document.
Described content-related advertising server embodiment one specifically can adopt the method among the content-relevant ad recognition methods embodiment one to realize.
Structural drawing with reference to figure 4 content-related advertising server embodiment two of the present invention; Described content-related advertising server comprises feature speech acquiring unit 410, taxon 420, expanding element 430 and content-relevant ad recognition unit 440:
Feature speech acquiring unit 410 is used to obtain the also feature set of words of export target document;
Taxon 420 is used for according to the feature set of words of feature speech acquiring unit 410 outputs described destination document being classified, and obtains affiliated classification of destination document and output;
Expanding element 430 is used for the relevant information according to classification under the destination document, and the feature set of words of the destination document of feature speech acquiring unit 410 output is expanded and exported;
Content-relevant ad identification 440, be used for according to the classification under the destination document of taxon 420 outputs, judging whether advertisement meets with described destination document belongs to same classification, and the condition of the feature speech coupling in the feature set of words of the destination document of its feature speech and expanding element 430 outputs, if confirm that then this advertisement is relevant with described destination document.
Described content-related advertising server embodiment two specifically can adopt the method among the content-relevant ad recognition methods embodiment two to realize.
More than content-relevant ad recognition methods and content-related advertising server that the embodiment of the invention provided are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and thought thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.