CN103631874B - UGC label classification determining method and device for social platform - Google Patents

UGC label classification determining method and device for social platform Download PDF

Info

Publication number
CN103631874B
CN103631874B CN201310549750.XA CN201310549750A CN103631874B CN 103631874 B CN103631874 B CN 103631874B CN 201310549750 A CN201310549750 A CN 201310549750A CN 103631874 B CN103631874 B CN 103631874B
Authority
CN
China
Prior art keywords
label
seed
mark
classification
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310549750.XA
Other languages
Chinese (zh)
Other versions
CN103631874A (en
Inventor
昝艳
张俊林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimeng Chuangke Network Technology China Co Ltd
Original Assignee
Weimeng Chuangke Network Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimeng Chuangke Network Technology China Co Ltd filed Critical Weimeng Chuangke Network Technology China Co Ltd
Priority to CN201310549750.XA priority Critical patent/CN103631874B/en
Publication of CN103631874A publication Critical patent/CN103631874A/en
Application granted granted Critical
Publication of CN103631874B publication Critical patent/CN103631874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention discloses a UGC label classification determining method and device for a social platform. The method comprises the steps of respectively calculating PMI values between untagged labels in UGC labels and various seed labels of classifications for each classification, selecting the seed labels with the PMI values between the seed labels and the untagged labels larger than set values as related words of the untagged labels, for each classification, determining the sum of the related words of the untagged labels belonging to the classifications to be relevancy between the classifications and the untagged labels, and determining the classifications with the relevancy between the classifications and the untagged labels larger than a threshold value to be the classifications which the untagged labels belong to. By the application of the method, the complexity of classifying the UGC labels can be lowered.

Description

The ugc label classification of social platform determines method and apparatus
Technical field
The present invention relates to Internet technology, the ugc label classification of more particularly, to a kind of social platform determines method and apparatus.
Background technology
With the development of Internet technology, enter sharing, propagate and obtaining of row information by social platform, it has also become One of main social activity mode of numerous netizens.For example, spy is pushed away by microblogging or twitter() etc. social platform, user can lead to Cross various clients and set up personal community, with the word fresh information about 140 words, and realize the latest tendency of oneself and think Method is shared immediately.And ugc(user generated content, user-generated content) label is exactly under social platform environment The label describing the contents such as the identity of user, personality, interest emotion being generated by user.
Generally, the particular content according to label, can classify to ugc label, for example, it is possible to by " shopping ", " beautiful Hold ", the labeling such as " constellation intelligent " and " iphone " be interest class label;And by the labeling such as " doctor ", " after 90s " be Identity class label.In social platform technology, different classes of ugc label can be used as the data of different concrete application systems Source.For example, it is possible to using interest class label as the data source recommending application system, using identity class label as division user social contact Data source of circle etc..Therefore, the structure effect of the application system to social platform for the Accurate classification of ugc label is great.
In prior art, for the classification of ugc label, the general svm(support adopting based on Text eigenvector Vector machine, support vector machine) sorting technique.Specifically, because the label in ugc label is short text, therefore, exist Before classification, need first the label in ugc label to be carried out semantic extension, form corresponding Text eigenvector;Then ugc is marked The corresponding Text eigenvector of the label of mark (alternatively referred to as Seed label) in label, as classification language material, carries out mould to svm Type training, obtains disaggregated model;Finally, according to obtaining disaggregated model, calculate and do not mark label and Seed label in ugc label Between similarity, and according to the degree of association obtaining, determine the classification not marking label.
But, on the one hand, this svm sorting technique, the requirement to the language material of training is very high, and typically each classification needs By more than hundreds of language materials that is to say, that need Seed label quantity very big.And it is true that ugc label is unsupervised number According to, accordingly, it would be desirable to substantial amounts of manpower is manually marked, Seed label could be obtained, and then acquisition classification language material.The opposing party Face, after obtaining disaggregated model, needs not marking label and the respective text feature of Seed label especially by calculating Degree of association between vector, to represent the degree of association not marking between label and Seed label, and does not mark label and seed mark Sign respective Text eigenvector to obtain by semantic extension, this leads to the complexity height of svm sorting algorithm and holds Scanning frequency degree is slow.
In sum, there is ugc labeling in the svm sorting technique based on Text eigenvector of the prior art Complexity is high, execute slow-footed deficiency.
Content of the invention
The ugc label classification embodiments providing a kind of social platform determines method and apparatus, in order to reduce ugc The complexity of labeling.
According to an aspect of the invention, it is provided a kind of ugc label classification of social platform determines method, comprising:
For each classification, calculate each seed not marking label and the category in user-generated content ugc label respectively Point mutual information pmi value between label;
Choose and do not mark pmi value between label with this and do not mark the phase of label more than the Seed label of setting value as this Close word;
For each classification, determine this related term not marking label belonging to the category sum be the category with this not The degree of association of mark label;
Determine that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
It is preferred that described calculate do not mark in user-generated content ugc label label and the category each Seed label it Between point mutual information pmi value when, also include:
Calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label;And
Described choose and this does not mark pmi value between label and does not mark label more than the Seed label of setting value as this Related term, specifically include:
Choose and this does not mark the Seed label that pmi value between label is more than setting value as candidate's related term;From institute The Seed label stating the co-occurrence probabilities maximum that selection n does not mark label with this in candidate's related term does not mark label as this Related term.
It is preferred that described determine that the degree of association not marking label with this is more than after the classification of given threshold, also include:
Using the classification determined as candidate categories;And
For each candidate categories, belong to the Seed label of this candidate categories in the related term this not being marked label and be somebody's turn to do The pmi value sum not marked between label does not mark, as this, the credibility that label belongs to this candidate categories;And
This does not mark label generic and is specially the maximum candidate categories of credibility.
It is preferred that determining that the degree of association not marking label with this does not mark mark more than the classification of given threshold for this described After signing generic, also include:
Label generic is not marked for this determined, this is not marked label and belongs to the credibility of the category and set Fixed high believability threshold is compared;If described credibility is more than the high believability threshold setting, this is not marked label Seed label as the classification determined.
It is preferred that the Seed label of each classification described is predetermined:
For each classification, the sentence pattern being mated according to the Seed label of the category, find out in social platform language material Sentence corresponding with described sentence pattern, and extract word from the sentence finding out as the Seed label of the category;
Wherein, the sentence pattern that the Seed label of described classification is mated, is the kind according to several artificial marks in the category Sentence appeared in social platform language material for the subtab is determined.
It is preferred that described calculate do not mark in user-generated content ugc label label and the category each Seed label it Between point mutual information pmi value, specifically include:
For each Seed label of the category, this Seed label c and the described pmi value not marked between label t according to Equation below calculates:
pmi t , c = log f ( t , c ) × g f ( t ) × f ( c ) (formula 1)
In formula (1), the frequency that f (t) occurs in the ugc label of each user of described social platform for t;F (c) exists for c The frequency occurring in the ugc label of each user of described social platform;F (t, c) simultaneously appears in the ugc mark of a user for t and c Co-occurrence frequency in label;G is the total number of users being labelled with ugc label in described social platform;
Wherein, described f (t, c) is to simultaneously appear in the frequency in the ugc label of a user according to t and c counting in advance Secondary, with described social platform on be labelled with ugc label total number of users ratio determine.
According to another aspect of the present invention, the ugc label classification additionally providing a kind of social platform determines device, bag Include:
Pmi computing module, does not mark each of label and the category for for each classification, calculating respectively in ugc label Pmi value between Seed label;
Related term chooses module, for the pmi value being calculated according to described pmi computing module, chooses and does not mark mark with this The Seed label that pmi value between label is more than setting value does not mark the related term of label as this;
Degree of association determining module, does not mark the correlation of label for this selecting according to described related term selection module Word, for each classification, determines that the sum of this related term not marking label belonging to the category is not marked with this for the category The degree of association of label;
Category determination module, for determined according to described degree of association determining module of all categories with this do not mark label Degree of association, determines that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
It is preferred that described pmi computing module is additionally operable to not mark each seed of label and the category in calculating ugc label During pmi value between label, also calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label; And
Described related term chooses the kind that module is more than setting value specifically for the pmi value chosen and this does not mark between label Subtab is as candidate's related term;The n co-occurrence probabilities maximum not marking label with this is chosen from described candidate's related term Seed label does not mark the related term of label as this.
It is preferred that
Described, described category determination module is specifically for determining that the degree of association not marking label with this is more than given threshold Classification after, using the classification determined as candidate categories;And it is directed to each candidate categories, this is not marked the related term of label In belong to the Seed label of this candidate categories and this pmi value sum of not marking between label does not mark label as this and belongs to this The credibility of candidate categories;Therefrom select the maximum candidate categories of credibility and do not mark label generic as this.
It is preferred that the ugc label classification of described social platform determines that device also includes:
Seed label determining module, for for each classification, the sentence pattern being mated according to the Seed label of the category, Find out sentence corresponding with described sentence pattern in social platform language material, and extract word from the sentence finding out as such Other Seed label;Wherein, the sentence pattern that the Seed label of the described category is mated, is that several are manually marked according in the category Sentence appeared in social platform language material for the Seed label of note is determined;And
Described Seed label determining module is additionally operable to the classification determined for described category determination module, and this is not marked Label belongs to the credibility of the category and the high believability threshold setting is compared;Can if described credibility is more than the height setting Confidence threshold, then do not mark label as the Seed label of the classification determined using this.
In the technical scheme of the embodiment of the present invention, for each classification, according to not marking label in ugc label and can be somebody's turn to do Pmi value between each Seed label of classification, chooses the related term not marking label from the Seed label of the category and determines The category does not mark the degree of association of label with this;Further according to determine of all categories do not mark the degree of association of label with this, will be with The classification that the degree of association that this does not mark label is more than given threshold does not mark label generic for this.Compare existing based on text The svm sorting technique of characteristic vector determines degree of association according to the algorithm that pmi value determines degree of association than according to Text eigenvector Algorithm is more simple, reduces the complexity of ugc labeling, to improve ugc labeling speed.
Further, in technical scheme provided in an embodiment of the present invention, can also be according to a small amount of seed of artificial mark Label, extends the Seed label of each classification automatically, decreases the artificial workload marking Seed label.
Brief description
Fig. 1 is the Seed label flow chart that determines method of the embodiment of the present invention;
Fig. 2 is the ugc label classification flow chart that determines method of the social platform of the embodiment of the present invention;
Fig. 3 is that the ugc label classification of the social platform of the embodiment of the present invention determines the internal structure schematic diagram of device.
Specific embodiment
For making the objects, technical solutions and advantages of the present invention become more apparent, referring to the drawings and enumerate preferred reality Apply example, the present invention is described in more detail.However, it is necessary to illustrate, the many details listed in description are only The reader is made to have a thorough explanation to one or more aspects of the present invention, can also even without these specific details Realize the aspects of the invention.
The term such as " module " used in this application, " system " is intended to, including the entity related to computer, for example but not limit In hardware, firmware, combination thereof, software or executory software.For example, module can be, it is not limited to: process Process, processor, object, executable program, the thread of execution, program and/or the computer running on device.For example, count On calculation equipment, the application program running and this computing device can be modules.One or more modules may be located at executory In one process and/or thread.
The present inventor, can be according to not marking label and the category it is considered that being directed to each classification in ugc label Each Seed label between pmi(pointwise mutual information, put mutual information) value, from the seed of the category Choose the related term not marking label in label and determine that the category does not mark the degree of association of label with this;Further according to determine Of all categories do not mark the degree of association of label with this, the classification that the degree of association not marking label with this is more than given threshold be this not Mark label generic.Compare the existing svm sorting technique based on Text eigenvector, then need not be by using said method Ugc label carries out semantic extension to form Text eigenvector, and determines that the algorithm of degree of association is more special than according to text according to pmi value Levy vector and determine that the algorithm of degree of association is more simple, the complexity of ugc labeling can be reduced, to improve ugc labeling Speed.
Further, inventor is it is also contemplated that automatically can expand according to a small amount of Seed label of artificial mark Open up the Seed label of each classification, to reduce the artificial workload marking Seed label.
Describe technical scheme below in conjunction with the accompanying drawings in detail.
In the embodiment of the present invention, before determining the classification of ugc label of social platform, can be by the side of artificial mark Formula predefines the Seed label of each classification.
As a kind of more excellent embodiment, also can be to a small amount of seed mark of artificial mark in technical scheme Label are extended automatically, to obtain more Seed labels;Concrete grammar is as shown in figure 1, comprise the steps:
S101: for each classification, the Seed label according to several artificial marks in the category is in social platform language material Appeared in sentence, determine the sentence pattern that the Seed label of the category is mated.
Social platform in the embodiment of the present invention can be specifically microblogging, push away top grade, accordingly, can first pass through artificial mark The mode of note marks several Seed labels, and such as " do shopping " " beauty treatment " " constellation intelligent " etc..It is preferred that in order to preferably extend Seed label, can select classification span big as far as possible and word that is expressing different style is as the artificial Seed label marking.Then, Using the method for this Multi-Pattern Matching of wu_manber, the Seed label of artificial mark can be corresponded to microblogging language material In, find out the sentence that Seed label occurs.In practical application, because a Seed label may mate many microblogging languages Material, so, most frequent several sentence pattern patterns that the Seed label of the category can match can be extracted from the sentence occurring, Such as " who likes ... ", " ... it is good selection " etc..
S102: for each classification, the sentence pattern being mated according to the Seed label of the category, look in social platform language material Find out sentence corresponding with this sentence pattern, and extract word from the sentence finding out as the Seed label of the category.
Specifically, the sentence pattern pattern that can be determined according to this, finds out language corresponding with this sentence pattern in microblogging language material Sentence, and extract more interest class words from the sentence finding out as the Seed label of the category, such as " cuisines ", " fashion " etc..
In practical application, using the word extracting as after Seed label, can also be changed by said method further The quantity of the Seed label of the generation extension category.With regard to the number of times of iteration, then can be by those skilled in the art according to actual need Summation effect is set.So, a small amount of Seed label according to artificial mark, you can the Seed label of the extension category Quantity, decreases the artificial workload marking Seed label.
The ugc label classification embodiments providing a kind of social platform determines method, and flow process is as shown in Fig. 2 have Body comprises the steps:
S201: for each classification, calculate the pmi value not marking between label and each Seed label of the category.
For ease of description, the mentioned herein label that do not mark refers specifically to have not determined classification in ugc label Label;In this step, for each classification, for each Seed label of the category, this Seed label c with do not mark label Pmi value between t, can calculate according to equation below:
pmi t , c = log f ( t , c ) × g f ( t ) × f ( c ) (formula 1)
In formula 1, the frequency that f (t) occurs in the ugc label of each user of social platform for t;F (c) is c social flat The frequency occurring in the ugc label of each user of platform;F (t, c) simultaneously appears in the co-occurrence in the ugc label of a user for t and c Frequency;G is the total number of users being labelled with ugc label in social platform.
In practical application, f (t) is the frequency being occurred in the ugc label of each user of social platform according to the t counting in advance Secondary, with social platform on be labelled with ugc label total number of users ratio determine;F (c) is in society according to the c counting in advance The ratio of the total number of users being labelled with ugc label in the frequency occurring in the ugc label of the friendship each user of platform, with social platform is true Fixed;F (t, c) is the frequency simultaneously appearing in the ugc label of a user according to t and c counting in advance, with social platform On be labelled with ugc label total number of users ratio determine.
More preferably, for each classification, calculate in ugc label and do not mark between label and each Seed label of the category Pmi value when, the co-occurrence probabilities of each Seed label not marking label and the category in ugc label can also be calculated.
Specifically, the co-occurrence probabilities of label and Seed label are not marked in ugc label, by not marking label and this seed mark Sign the co-occurrence frequency simultaneously appearing in the ugc label of a user, with social platform on be labelled with the total number of users of ugc label Ratio determine.
S202: choose and this does not mark pmi value between label and does not mark mark more than the Seed label of setting value as this The related term signed.
In this step, for each classification, according to each seed not marking label and the category in the ugc label calculating Pmi value between label, therefrom choose and this do not mark pmi value between label more than setting value Seed label as this not The related term of mark label.
In practical application, if only limited using setting value, for some popular do not mark label, it is possible that Many related to not marking label but and this do not mark the relatively low Seed label of pmi value between label.
As a kind of more excellent embodiment, on the premise of the accuracy ensureing to determine the classification not marking label, it is Reduce the complexity of subsequent operation, for each classification, calculate each seed not marking label and the category in ugc label After pmi value between label, can choose and pmi value that this does not mark between label is more than the Seed label conduct of setting value Candidate's related term;Then, then choose the Seed label of the n co-occurrence probabilities maximum not marking label with this from candidate's related term Do not mark the related term of label as this.Wherein, quantity n of the related term of setting value and selection can be by those skilled in the art Set after considering, avoid the related term chosen and this pmi value of not marking between label too small as far as possible.
S203: for each classification, determine that the sum of this related term not marking label belonging to the category is the category Do not mark the degree of association of label with this.
In this step, for each classification, to the correlation not marking label belonging to the category being determined by step s202 The quantity of word is counted, and the sum counting is not marked the degree of association of label as the category with this.
S204: determine that the degree of association not marking label with this does not mark the affiliated class of label more than the classification of given threshold for this Not.
In this step, each classification being obtained by step s203 is carried out with given threshold with the degree of association not marking label Relatively, obtain the classification that degree of association is more than given threshold, and the category be defined as this not marking label generic.Wherein, Given threshold rule of thumb can be preset by those skilled in the art.
For example, it is assumed that n value be 11, and according to determine do not mark label " fashion control " 11 related terms and its Pmi value is as shown in table 1 below:
Table 1
Related term Pmi value
Trend 6.55812357608
Beauty treatment 5.43060103888
Fashion world 5.00216134394
Shopping 4.32233188279
American-European model 4.29291474444
Fashion Magazines 4.17438123971
Music 3.94937319753
Beauty 3.8109474926
Residence female after 90s 3.01654271469
Perfume 2.98635197297
Garment coordination 2.79629753825
In the related term not marking label word " fashion control " shown in table 1, belong to category of interest label include " beauty treatment ", " shopping ", " music ", " Fashion Magazines ", " perfume ", count in the related term not marked label word " fashion control " and belong to emerging The number of the label of interesting classification is 5;The label belonging to identity category includes " fashion world ", " beauty ", " residence female after 90s ", " America and Europe Model ", the number that statistics is not marked the label belonging to identity category in the related term of label word " fashion control " is 4;Belong to The label of popular classification includes " trend ", " garment coordination ", counts in the related term not marked label word " fashion control " and belongs to In popular classification label number be 2.
Thus, it is possible to the degree of association obtaining category of interest with " fashion control " is 5, the degree of association of identity category and " fashion control " For 4;Popular classification is 2 with the degree of association of " fashion control ".If given threshold is 4, the only phase of category of interest and " fashion control " Guan Du is more than given threshold 4, so, category of interest can be defined as by not marking label word " fashion control ".
As a kind of more excellent embodiment, it is more than the classification of given threshold in the degree of association determining with do not mark label Afterwards however, it is determined that the classification going out have multiple, then the classification that these can be determined as candidate categories, from candidate categories select one Individual classification with a high credibility does not mark the generic of label as this: can be directed to each candidate categories, this is not marked mark Belong to the Seed label of this candidate categories in the related term signed and this pmi value sum of not marking between label does not mark as this Label belongs to the credibility of this candidate categories;Therefrom select the maximum candidate categories of credibility not marking belonging to label as this Classification, that is, this does not mark label generic and is specially the maximum candidate categories of credibility.
Such as, still taking above-mentioned " the fashion control " being defined as category of interest as a example, if given threshold is 3, with " fashion control " Degree of association be more than given threshold classification include category of interest and identity category, so, can be by category of interest and identity class Not as the candidate categories not marking label word " fashion control ".
By the related term belonging to " the fashion control " of category of interest, each the pmi value and " fashion control " between is added up, and obtains The credibility belonging to category of interest to " fashion control " is about 23;By related term each the pmi value and " fashion control " between is carried out Cumulative, obtain " fashion control " and belong to the credibility of identity category being about 16.Because " fashion control " belongs to the credibility of category of interest More than fashion control " belong to the credibility of identity category, so, interest class can be defined as by not marking label word " fashion control " Not.
More preferably, label generic is not marked for this determined, this can also not marked label and belong to such Other credibility is compared with the high believability threshold setting;If credibility is more than the high believability threshold setting, should Do not mark label as the Seed label of the classification determined.
Specifically, after determining that the maximum candidate categories of credibility do not mark label generic for this, by described can Reliability is compared with the high believability threshold setting, if described credibility is less than or equal to the high believability threshold setting, Carry out the classification that the next one does not mark label;If described credibility is more than the high believability threshold setting, this is not marked mark Sign the Seed label as the classification determined, and be applied in the classification determination process that the next one does not mark label.By this not After mark label is as the Seed label of the category, when determining the classification that the next one does not mark label, due to for classifying The quantity of Seed label increased, so, in the classification determination process that the next one does not mark label, can be adaptively Increase given threshold and high believability threshold so that the classification of ugc label is more accurate.
For example, if the high believability threshold setting is as 20, because the credibility that " fashion control " belongs to category of interest is more than height Believability threshold, therefore, it can " fashion control " as the high credibility Seed label of category of interest, and is applied to the next one not In the classification determination process of mark label.
Method is determined based on the ugc label classification of above-mentioned social platform, the embodiment of the present invention additionally provides a kind of social flat The ugc label classification of platform determines device, as shown in Figure 3, comprising: pmi computing module 301, related term choose module 302, correlation Degree determining module 303 and category determination module 304.
Pmi computing module 301 is used for, for each classification, calculating respectively in ugc label and not marking label and the category Pmi value between each Seed label.
Related term chooses the pmi value that module 302 is used for calculating according to pmi computing module 301, chooses and does not mark mark with this The Seed label that pmi value between label is more than setting value does not mark the related term of label as this.
In practical application, if only limited using setting value, for some popular do not mark label, it is possible that Many related to not marking label but and this do not mark the relatively low Seed label of pmi value between label.
As a kind of more excellent embodiment, in order to reduce the complexity of subsequent operation, the technical scheme that the present invention provides In, pmi computing module 301 is additionally operable to, for each classification, calculate each seed not marking label and the category in ugc label During pmi value between label, also calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label.
Correspondingly, related term selection module 302 is more than specifically for the pmi value chosen and this does not mark between label and sets The Seed label of value is as candidate's related term;The n co-occurrence probabilities maximum not marking label with this is chosen from candidate's related term Seed label do not mark the related term of label as this.Wherein, quantity n of the related term of setting value and selection can be by ability Field technique personnel are set after considering, avoid as far as possible the related term chosen and this do not mark pmi value between label Little.
Degree of association determining module 303 does not mark the correlation of label for this selecting according to related term selection module 302 Word, for each classification, determines that the sum of this related term not marking label belonging to the category is not marked with this for the category The degree of association of label.
Category determination module 304 be used for according to degree of association determining module 303 determine of all categories with this do not mark label Degree of association, determine with this do not mark label degree of association be more than given threshold classification do not mark label generic for this.
As a kind of more excellent embodiment, category determination module 304 can be used for not marking label in determination with this After degree of association is more than the classification of given threshold, using the classification determined as candidate categories;And it is directed to each candidate categories, should Belong to the Seed label of this candidate categories in the related term not marking label and this does not mark the pmi value sum conduct between label This does not mark the credibility that label belongs to this candidate categories;And select the maximum candidate categories of credibility not mark label as this Generic.
More preferably, the ugc label classification of the social platform of the embodiment of the present invention determines that device may also include that Seed label is true Cover half block 305.
Seed label determining module 305 is used for for each classification, the sentence pattern being mated according to the Seed label of the category, Find out sentence corresponding with described sentence pattern in social platform language material, and extract word from the sentence finding out as this The Seed label of classification;Wherein, the sentence pattern that the Seed label of the category is mated, is that several manually mark according in the category Sentence appeared in social platform language material for the Seed label determine.
Further, Seed label determining module 305 is additionally operable to the classification determined for category determination module 304, if This does not mark label and belongs to the credibility of the category more than the high believability threshold setting, then this is not marked label as determination The Seed label of the classification going out.
In technical scheme, the ugc label classification of social platform determines that device is directed to each classification, by calculating Do not mark the pmi value between label and each Seed label of the category in ugc label, to determine that this that belong to the category does not mark The related term of label and the category do not mark the degree of association of label with this;And the degree of association not marking label with this is more than setting The classification of threshold value is defined as this and does not mark label generic.Compare the existing svm sorting technique based on Text eigenvector, root Determine that than according to Text eigenvector the algorithm of degree of association is more simple according to the algorithm that pmi value determines degree of association, reduce ugc mark Sign the complexity of classification, thus improve the speed of ugc labeling.
Further, in the technical scheme that the present invention provides, acceptable a small amount of Seed label according to artificial mark, from The dynamic Seed label extending each classification, decreases the artificial workload marking Seed label.
One of ordinary skill in the art will appreciate that it is permissible for realizing all or part of step in above-described embodiment method Instruct related hardware to complete by program, this program can be stored in a computer read/write memory medium, such as: Rom/ram, magnetic disc, CD etc..
The above is only the preferred embodiment of the present invention it is noted that ordinary skill people for the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (10)

1. a kind of ugc label classification of social platform determines method it is characterised in that including:
For each classification, calculate each Seed label not marking label and the category in user-generated content ugc label respectively Between point mutual information pmi value;
Choose the pmi value not marking to this between label and do not mark the related of label more than the Seed label of setting value as this Word;
For each classification, determine that the sum of this related term not marking label belonging to the category is not marked with this for the category The degree of association of label;
Determine that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
2. the method for claim 1 is not it is characterised in that mark mark in described calculating user-generated content ugc label When signing the point mutual information pmi value and each Seed label of the category between, also include:
Calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label;And
Described selection does not mark pmi value between label with this and does not mark the phase of label more than the Seed label of setting value as this Close word, specifically include:
Choose and this does not mark the Seed label that pmi value between label is more than setting value as candidate's related term;From described time The Seed label choosing the n co-occurrence probabilities maximum not marking label to this in related term is selected not mark the related of label as this Word.
3. described method as arbitrary in claim 1 or 2 is not it is characterised in that described determination marks the degree of association of label with this After the classification of given threshold, also include:
Using the classification determined as candidate categories;And
For each candidate categories, the Seed label belonging to this candidate categories in the related term this not being marked label is not marked with this Pmi value sum between note label does not mark, as this, the credibility that label belongs to this candidate categories;And
This does not mark label generic and is specially the maximum candidate categories of credibility.
4. method as claimed in claim 3 it is characterised in that described determine do not mark the degree of association of label more than setting with this The classification determining threshold value does not mark after label generic for this, also includes:
Label generic is not marked for this determined, this is not marked credibility and the setting that label belongs to the category High believability threshold is compared;If described credibility be more than set high believability threshold, this is not marked label as The Seed label of the classification determined.
5. method as claimed in claim 4 is it is characterised in that the Seed label of each classification described is predetermined:
For each classification, the sentence pattern being mated according to the Seed label of the category, find out in social platform language material and institute State the corresponding sentence of sentence pattern, and extract word from the sentence finding out as the Seed label of the category;
Wherein, the sentence pattern that the Seed label of described classification is mated, is the seed mark according to several artificial marks in the category Sign what the sentence appeared in social platform language material was determined.
6. described method as arbitrary in claim 4 is not it is characterised in that mark in described calculating user-generated content ugc label Point mutual information pmi value between note label and each Seed label of the category, specifically includes:
For each Seed label of the category, this Seed label c and the described pmi value not marked between label t are according to as follows Formula calculates:
In formula (1), the frequency that f (t) occurs in the ugc label of each user of described social platform for t;F (c) is c described The frequency occurring in the ugc label of each user of social platform;F (t, c) simultaneously appears in the ugc label of a user for t and c Co-occurrence frequency;G is the total number of users being labelled with ugc label in described social platform;
Wherein, described f (t, c) is the frequency simultaneously appearing in the ugc label of a user according to t and c counting in advance, with The ratio determination of the total number of users of ugc label is labelled with described social platform.
7. a kind of ugc label classification of social platform determines device it is characterised in that including:
Pmi computing module, for for each classification, calculating respectively in user-generated content ugc label and not marking label and be somebody's turn to do Point mutual information pmi value between each Seed label of classification;
Related term chooses module, for the pmi value being calculated according to described pmi computing module, choose and this do not mark label it Between pmi value be more than the Seed label of setting value and do not mark the related term of label as this;
Degree of association determining module, does not mark the related term of label, pin for this selecting according to described related term selection module To each classification, determine that the sum of this related term not marking label belonging to the category is that the category does not mark label with this Degree of association;
Category determination module, of all categories does not mark the related of label to this for determine according to described degree of association determining module Degree, determines that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
8. device as claimed in claim 7 it is characterised in that
Described pmi computing module is additionally operable to not mark between label and each Seed label of the category in calculating ugc label During pmi value, also calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label;And
Described related term chooses the seed mark that module is more than setting value specifically for the pmi value chosen and this does not mark between label Sign as candidate's related term;The seed of the n co-occurrence probabilities maximum not marking label with this is chosen from described candidate's related term Label does not mark the related term of label as this.
9. device as claimed in claim 7 or 8 it is characterised in that
Described category determination module is specifically in the described class determining that the degree of association not marking label with this is more than given threshold After not, using the classification determined as candidate categories;And it is directed to each candidate categories, this is not marked in the related term of label and belong to Do not mark label in the pmi value sum that Seed label and this of this candidate categories do not mark between label as this and belong to this candidate The credibility of classification;Therefrom select the maximum candidate categories of credibility and do not mark label generic as this.
10. device as claimed in claim 9 is it is characterised in that also include:
Seed label determining module, for for each classification, the sentence pattern being mated according to the Seed label of the category, in social activity Find out sentence corresponding with described sentence pattern in platform language material, and extract word from the sentence finding out as the category Seed label;Wherein, the sentence pattern that the Seed label of the described category is mated, is according to several artificial marks in the category Sentence appeared in social platform language material for the Seed label is determined;And
Described Seed label determining module is additionally operable to the classification determined for described category determination module, and this is not marked label Belong to the credibility of the category and the high believability threshold setting is compared;If described credibility is more than the high credibility setting Threshold value, then do not mark label as the Seed label of the classification determined using this.
CN201310549750.XA 2013-11-07 2013-11-07 UGC label classification determining method and device for social platform Active CN103631874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310549750.XA CN103631874B (en) 2013-11-07 2013-11-07 UGC label classification determining method and device for social platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310549750.XA CN103631874B (en) 2013-11-07 2013-11-07 UGC label classification determining method and device for social platform

Publications (2)

Publication Number Publication Date
CN103631874A CN103631874A (en) 2014-03-12
CN103631874B true CN103631874B (en) 2017-01-18

Family

ID=50212916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310549750.XA Active CN103631874B (en) 2013-11-07 2013-11-07 UGC label classification determining method and device for social platform

Country Status (1)

Country Link
CN (1) CN103631874B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376041B (en) * 2014-10-11 2018-05-18 北京中搜网络技术股份有限公司 A kind of information extraction method based on microblogging classification
CN106033445B (en) * 2015-03-16 2019-10-25 北京国双科技有限公司 The method and apparatus for obtaining article degree of association data
CN105117449B (en) * 2015-08-14 2019-08-16 百度在线网络技术(北京)有限公司 A kind of method and apparatus for generating the label of content item
CN105809478B (en) * 2016-03-07 2020-02-18 优酷网络技术(北京)有限公司 Labeling method and system for advertisement label
CN107402932B (en) * 2016-05-20 2021-04-13 腾讯科技(深圳)有限公司 Expansion processing method of user tag, text recommendation method and text recommendation device
CN106446191B (en) * 2016-09-30 2019-11-05 浙江工业大学 A kind of multiple features network flow row label prediction technique returned based on Logistic
CN109857957B (en) * 2019-01-29 2021-06-15 掌阅科技股份有限公司 Method for establishing label library, electronic equipment and computer storage medium
CN113177102B (en) * 2021-06-30 2021-08-24 智者四海(北京)技术有限公司 Text classification method and device, computing equipment and computer readable medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609459A (en) * 2009-07-21 2009-12-23 北京大学 A kind of extraction system of affective characteristic words
CN103092956A (en) * 2013-01-17 2013-05-08 上海交通大学 Method and system for topic keyword self-adaptive expansion on social network platform
EP2581868A3 (en) * 2011-10-13 2013-07-24 Aol Llc Systems and methods for managing publication of online advertisements

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009039392A1 (en) * 2007-09-21 2009-03-26 The Board Of Trustees Of The University Of Illinois A system for entity search and a method for entity scoring in a linked document database
CN103309857B (en) * 2012-03-06 2018-11-09 深圳市世纪光速信息技术有限公司 A kind of taxonomy determines method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609459A (en) * 2009-07-21 2009-12-23 北京大学 A kind of extraction system of affective characteristic words
EP2581868A3 (en) * 2011-10-13 2013-07-24 Aol Llc Systems and methods for managing publication of online advertisements
CN103092956A (en) * 2013-01-17 2013-05-08 上海交通大学 Method and system for topic keyword self-adaptive expansion on social network platform

Also Published As

Publication number Publication date
CN103631874A (en) 2014-03-12

Similar Documents

Publication Publication Date Title
CN103631874B (en) UGC label classification determining method and device for social platform
CN105808526B (en) Commodity short text core word extracting method and device
CN103646088B (en) Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN108363725B (en) Method for extracting user comment opinions and generating opinion labels
CN109582949A (en) Event element abstracting method, calculates equipment and storage medium at device
CN108595519A (en) Focus incident sorting technique, device and storage medium
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN106570496A (en) Emotion recognition method and device and intelligent interaction method and device
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN104331506A (en) Multiclass emotion analyzing method and system facing bilingual microblog text
CN106970912A (en) Chinese sentence similarity calculating method, computing device and computer-readable storage medium
CN107273348B (en) Topic and emotion combined detection method and device for text
CN102929860B (en) Chinese clause emotion polarity distinguishing method based on context
CN109978020B (en) Social network account number vest identity identification method based on multi-dimensional features
CN105069041A (en) Video user gender classification based advertisement putting method
CN104951542A (en) Method and device for recognizing class of social contact short texts and method and device for training classification models
CN106919575A (en) application program searching method and device
CN110110035A (en) Data processing method and device and computer readable storage medium
CN110209810A (en) Similar Text recognition methods and device
CN107463703A (en) English social media account number classification method based on information gain
CN104090936A (en) News recommendation method based on hypergraph sequencing
CN108346067A (en) Social networks advertisement sending method based on natural language processing
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
Suh et al. Subgraph matching using compactness prior for robust feature correspondence
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant