CN103631874B - UGC label classification determining method and device for social platform - Google Patents
UGC label classification determining method and device for social platform Download PDFInfo
- Publication number
- CN103631874B CN103631874B CN201310549750.XA CN201310549750A CN103631874B CN 103631874 B CN103631874 B CN 103631874B CN 201310549750 A CN201310549750 A CN 201310549750A CN 103631874 B CN103631874 B CN 103631874B
- Authority
- CN
- China
- Prior art keywords
- label
- seed
- mark
- classification
- category
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
The invention discloses a UGC label classification determining method and device for a social platform. The method comprises the steps of respectively calculating PMI values between untagged labels in UGC labels and various seed labels of classifications for each classification, selecting the seed labels with the PMI values between the seed labels and the untagged labels larger than set values as related words of the untagged labels, for each classification, determining the sum of the related words of the untagged labels belonging to the classifications to be relevancy between the classifications and the untagged labels, and determining the classifications with the relevancy between the classifications and the untagged labels larger than a threshold value to be the classifications which the untagged labels belong to. By the application of the method, the complexity of classifying the UGC labels can be lowered.
Description
Technical field
The present invention relates to Internet technology, the ugc label classification of more particularly, to a kind of social platform determines method and apparatus.
Background technology
With the development of Internet technology, enter sharing, propagate and obtaining of row information by social platform, it has also become
One of main social activity mode of numerous netizens.For example, spy is pushed away by microblogging or twitter() etc. social platform, user can lead to
Cross various clients and set up personal community, with the word fresh information about 140 words, and realize the latest tendency of oneself and think
Method is shared immediately.And ugc(user generated content, user-generated content) label is exactly under social platform environment
The label describing the contents such as the identity of user, personality, interest emotion being generated by user.
Generally, the particular content according to label, can classify to ugc label, for example, it is possible to by " shopping ", " beautiful
Hold ", the labeling such as " constellation intelligent " and " iphone " be interest class label;And by the labeling such as " doctor ", " after 90s " be
Identity class label.In social platform technology, different classes of ugc label can be used as the data of different concrete application systems
Source.For example, it is possible to using interest class label as the data source recommending application system, using identity class label as division user social contact
Data source of circle etc..Therefore, the structure effect of the application system to social platform for the Accurate classification of ugc label is great.
In prior art, for the classification of ugc label, the general svm(support adopting based on Text eigenvector
Vector machine, support vector machine) sorting technique.Specifically, because the label in ugc label is short text, therefore, exist
Before classification, need first the label in ugc label to be carried out semantic extension, form corresponding Text eigenvector;Then ugc is marked
The corresponding Text eigenvector of the label of mark (alternatively referred to as Seed label) in label, as classification language material, carries out mould to svm
Type training, obtains disaggregated model;Finally, according to obtaining disaggregated model, calculate and do not mark label and Seed label in ugc label
Between similarity, and according to the degree of association obtaining, determine the classification not marking label.
But, on the one hand, this svm sorting technique, the requirement to the language material of training is very high, and typically each classification needs
By more than hundreds of language materials that is to say, that need Seed label quantity very big.And it is true that ugc label is unsupervised number
According to, accordingly, it would be desirable to substantial amounts of manpower is manually marked, Seed label could be obtained, and then acquisition classification language material.The opposing party
Face, after obtaining disaggregated model, needs not marking label and the respective text feature of Seed label especially by calculating
Degree of association between vector, to represent the degree of association not marking between label and Seed label, and does not mark label and seed mark
Sign respective Text eigenvector to obtain by semantic extension, this leads to the complexity height of svm sorting algorithm and holds
Scanning frequency degree is slow.
In sum, there is ugc labeling in the svm sorting technique based on Text eigenvector of the prior art
Complexity is high, execute slow-footed deficiency.
Content of the invention
The ugc label classification embodiments providing a kind of social platform determines method and apparatus, in order to reduce ugc
The complexity of labeling.
According to an aspect of the invention, it is provided a kind of ugc label classification of social platform determines method, comprising:
For each classification, calculate each seed not marking label and the category in user-generated content ugc label respectively
Point mutual information pmi value between label;
Choose and do not mark pmi value between label with this and do not mark the phase of label more than the Seed label of setting value as this
Close word;
For each classification, determine this related term not marking label belonging to the category sum be the category with this not
The degree of association of mark label;
Determine that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
It is preferred that described calculate do not mark in user-generated content ugc label label and the category each Seed label it
Between point mutual information pmi value when, also include:
Calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label;And
Described choose and this does not mark pmi value between label and does not mark label more than the Seed label of setting value as this
Related term, specifically include:
Choose and this does not mark the Seed label that pmi value between label is more than setting value as candidate's related term;From institute
The Seed label stating the co-occurrence probabilities maximum that selection n does not mark label with this in candidate's related term does not mark label as this
Related term.
It is preferred that described determine that the degree of association not marking label with this is more than after the classification of given threshold, also include:
Using the classification determined as candidate categories;And
For each candidate categories, belong to the Seed label of this candidate categories in the related term this not being marked label and be somebody's turn to do
The pmi value sum not marked between label does not mark, as this, the credibility that label belongs to this candidate categories;And
This does not mark label generic and is specially the maximum candidate categories of credibility.
It is preferred that determining that the degree of association not marking label with this does not mark mark more than the classification of given threshold for this described
After signing generic, also include:
Label generic is not marked for this determined, this is not marked label and belongs to the credibility of the category and set
Fixed high believability threshold is compared;If described credibility is more than the high believability threshold setting, this is not marked label
Seed label as the classification determined.
It is preferred that the Seed label of each classification described is predetermined:
For each classification, the sentence pattern being mated according to the Seed label of the category, find out in social platform language material
Sentence corresponding with described sentence pattern, and extract word from the sentence finding out as the Seed label of the category;
Wherein, the sentence pattern that the Seed label of described classification is mated, is the kind according to several artificial marks in the category
Sentence appeared in social platform language material for the subtab is determined.
It is preferred that described calculate do not mark in user-generated content ugc label label and the category each Seed label it
Between point mutual information pmi value, specifically include:
For each Seed label of the category, this Seed label c and the described pmi value not marked between label t according to
Equation below calculates:
In formula (1), the frequency that f (t) occurs in the ugc label of each user of described social platform for t;F (c) exists for c
The frequency occurring in the ugc label of each user of described social platform;F (t, c) simultaneously appears in the ugc mark of a user for t and c
Co-occurrence frequency in label;G is the total number of users being labelled with ugc label in described social platform;
Wherein, described f (t, c) is to simultaneously appear in the frequency in the ugc label of a user according to t and c counting in advance
Secondary, with described social platform on be labelled with ugc label total number of users ratio determine.
According to another aspect of the present invention, the ugc label classification additionally providing a kind of social platform determines device, bag
Include:
Pmi computing module, does not mark each of label and the category for for each classification, calculating respectively in ugc label
Pmi value between Seed label;
Related term chooses module, for the pmi value being calculated according to described pmi computing module, chooses and does not mark mark with this
The Seed label that pmi value between label is more than setting value does not mark the related term of label as this;
Degree of association determining module, does not mark the correlation of label for this selecting according to described related term selection module
Word, for each classification, determines that the sum of this related term not marking label belonging to the category is not marked with this for the category
The degree of association of label;
Category determination module, for determined according to described degree of association determining module of all categories with this do not mark label
Degree of association, determines that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
It is preferred that described pmi computing module is additionally operable to not mark each seed of label and the category in calculating ugc label
During pmi value between label, also calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label;
And
Described related term chooses the kind that module is more than setting value specifically for the pmi value chosen and this does not mark between label
Subtab is as candidate's related term;The n co-occurrence probabilities maximum not marking label with this is chosen from described candidate's related term
Seed label does not mark the related term of label as this.
It is preferred that
Described, described category determination module is specifically for determining that the degree of association not marking label with this is more than given threshold
Classification after, using the classification determined as candidate categories;And it is directed to each candidate categories, this is not marked the related term of label
In belong to the Seed label of this candidate categories and this pmi value sum of not marking between label does not mark label as this and belongs to this
The credibility of candidate categories;Therefrom select the maximum candidate categories of credibility and do not mark label generic as this.
It is preferred that the ugc label classification of described social platform determines that device also includes:
Seed label determining module, for for each classification, the sentence pattern being mated according to the Seed label of the category,
Find out sentence corresponding with described sentence pattern in social platform language material, and extract word from the sentence finding out as such
Other Seed label;Wherein, the sentence pattern that the Seed label of the described category is mated, is that several are manually marked according in the category
Sentence appeared in social platform language material for the Seed label of note is determined;And
Described Seed label determining module is additionally operable to the classification determined for described category determination module, and this is not marked
Label belongs to the credibility of the category and the high believability threshold setting is compared;Can if described credibility is more than the height setting
Confidence threshold, then do not mark label as the Seed label of the classification determined using this.
In the technical scheme of the embodiment of the present invention, for each classification, according to not marking label in ugc label and can be somebody's turn to do
Pmi value between each Seed label of classification, chooses the related term not marking label from the Seed label of the category and determines
The category does not mark the degree of association of label with this;Further according to determine of all categories do not mark the degree of association of label with this, will be with
The classification that the degree of association that this does not mark label is more than given threshold does not mark label generic for this.Compare existing based on text
The svm sorting technique of characteristic vector determines degree of association according to the algorithm that pmi value determines degree of association than according to Text eigenvector
Algorithm is more simple, reduces the complexity of ugc labeling, to improve ugc labeling speed.
Further, in technical scheme provided in an embodiment of the present invention, can also be according to a small amount of seed of artificial mark
Label, extends the Seed label of each classification automatically, decreases the artificial workload marking Seed label.
Brief description
Fig. 1 is the Seed label flow chart that determines method of the embodiment of the present invention;
Fig. 2 is the ugc label classification flow chart that determines method of the social platform of the embodiment of the present invention;
Fig. 3 is that the ugc label classification of the social platform of the embodiment of the present invention determines the internal structure schematic diagram of device.
Specific embodiment
For making the objects, technical solutions and advantages of the present invention become more apparent, referring to the drawings and enumerate preferred reality
Apply example, the present invention is described in more detail.However, it is necessary to illustrate, the many details listed in description are only
The reader is made to have a thorough explanation to one or more aspects of the present invention, can also even without these specific details
Realize the aspects of the invention.
The term such as " module " used in this application, " system " is intended to, including the entity related to computer, for example but not limit
In hardware, firmware, combination thereof, software or executory software.For example, module can be, it is not limited to: process
Process, processor, object, executable program, the thread of execution, program and/or the computer running on device.For example, count
On calculation equipment, the application program running and this computing device can be modules.One or more modules may be located at executory
In one process and/or thread.
The present inventor, can be according to not marking label and the category it is considered that being directed to each classification in ugc label
Each Seed label between pmi(pointwise mutual information, put mutual information) value, from the seed of the category
Choose the related term not marking label in label and determine that the category does not mark the degree of association of label with this;Further according to determine
Of all categories do not mark the degree of association of label with this, the classification that the degree of association not marking label with this is more than given threshold be this not
Mark label generic.Compare the existing svm sorting technique based on Text eigenvector, then need not be by using said method
Ugc label carries out semantic extension to form Text eigenvector, and determines that the algorithm of degree of association is more special than according to text according to pmi value
Levy vector and determine that the algorithm of degree of association is more simple, the complexity of ugc labeling can be reduced, to improve ugc labeling
Speed.
Further, inventor is it is also contemplated that automatically can expand according to a small amount of Seed label of artificial mark
Open up the Seed label of each classification, to reduce the artificial workload marking Seed label.
Describe technical scheme below in conjunction with the accompanying drawings in detail.
In the embodiment of the present invention, before determining the classification of ugc label of social platform, can be by the side of artificial mark
Formula predefines the Seed label of each classification.
As a kind of more excellent embodiment, also can be to a small amount of seed mark of artificial mark in technical scheme
Label are extended automatically, to obtain more Seed labels;Concrete grammar is as shown in figure 1, comprise the steps:
S101: for each classification, the Seed label according to several artificial marks in the category is in social platform language material
Appeared in sentence, determine the sentence pattern that the Seed label of the category is mated.
Social platform in the embodiment of the present invention can be specifically microblogging, push away top grade, accordingly, can first pass through artificial mark
The mode of note marks several Seed labels, and such as " do shopping " " beauty treatment " " constellation intelligent " etc..It is preferred that in order to preferably extend
Seed label, can select classification span big as far as possible and word that is expressing different style is as the artificial Seed label marking.Then,
Using the method for this Multi-Pattern Matching of wu_manber, the Seed label of artificial mark can be corresponded to microblogging language material
In, find out the sentence that Seed label occurs.In practical application, because a Seed label may mate many microblogging languages
Material, so, most frequent several sentence pattern patterns that the Seed label of the category can match can be extracted from the sentence occurring,
Such as " who likes ... ", " ... it is good selection " etc..
S102: for each classification, the sentence pattern being mated according to the Seed label of the category, look in social platform language material
Find out sentence corresponding with this sentence pattern, and extract word from the sentence finding out as the Seed label of the category.
Specifically, the sentence pattern pattern that can be determined according to this, finds out language corresponding with this sentence pattern in microblogging language material
Sentence, and extract more interest class words from the sentence finding out as the Seed label of the category, such as " cuisines ",
" fashion " etc..
In practical application, using the word extracting as after Seed label, can also be changed by said method further
The quantity of the Seed label of the generation extension category.With regard to the number of times of iteration, then can be by those skilled in the art according to actual need
Summation effect is set.So, a small amount of Seed label according to artificial mark, you can the Seed label of the extension category
Quantity, decreases the artificial workload marking Seed label.
The ugc label classification embodiments providing a kind of social platform determines method, and flow process is as shown in Fig. 2 have
Body comprises the steps:
S201: for each classification, calculate the pmi value not marking between label and each Seed label of the category.
For ease of description, the mentioned herein label that do not mark refers specifically to have not determined classification in ugc label
Label;In this step, for each classification, for each Seed label of the category, this Seed label c with do not mark label
Pmi value between t, can calculate according to equation below:
In formula 1, the frequency that f (t) occurs in the ugc label of each user of social platform for t;F (c) is c social flat
The frequency occurring in the ugc label of each user of platform;F (t, c) simultaneously appears in the co-occurrence in the ugc label of a user for t and c
Frequency;G is the total number of users being labelled with ugc label in social platform.
In practical application, f (t) is the frequency being occurred in the ugc label of each user of social platform according to the t counting in advance
Secondary, with social platform on be labelled with ugc label total number of users ratio determine;F (c) is in society according to the c counting in advance
The ratio of the total number of users being labelled with ugc label in the frequency occurring in the ugc label of the friendship each user of platform, with social platform is true
Fixed;F (t, c) is the frequency simultaneously appearing in the ugc label of a user according to t and c counting in advance, with social platform
On be labelled with ugc label total number of users ratio determine.
More preferably, for each classification, calculate in ugc label and do not mark between label and each Seed label of the category
Pmi value when, the co-occurrence probabilities of each Seed label not marking label and the category in ugc label can also be calculated.
Specifically, the co-occurrence probabilities of label and Seed label are not marked in ugc label, by not marking label and this seed mark
Sign the co-occurrence frequency simultaneously appearing in the ugc label of a user, with social platform on be labelled with the total number of users of ugc label
Ratio determine.
S202: choose and this does not mark pmi value between label and does not mark mark more than the Seed label of setting value as this
The related term signed.
In this step, for each classification, according to each seed not marking label and the category in the ugc label calculating
Pmi value between label, therefrom choose and this do not mark pmi value between label more than setting value Seed label as this not
The related term of mark label.
In practical application, if only limited using setting value, for some popular do not mark label, it is possible that
Many related to not marking label but and this do not mark the relatively low Seed label of pmi value between label.
As a kind of more excellent embodiment, on the premise of the accuracy ensureing to determine the classification not marking label, it is
Reduce the complexity of subsequent operation, for each classification, calculate each seed not marking label and the category in ugc label
After pmi value between label, can choose and pmi value that this does not mark between label is more than the Seed label conduct of setting value
Candidate's related term;Then, then choose the Seed label of the n co-occurrence probabilities maximum not marking label with this from candidate's related term
Do not mark the related term of label as this.Wherein, quantity n of the related term of setting value and selection can be by those skilled in the art
Set after considering, avoid the related term chosen and this pmi value of not marking between label too small as far as possible.
S203: for each classification, determine that the sum of this related term not marking label belonging to the category is the category
Do not mark the degree of association of label with this.
In this step, for each classification, to the correlation not marking label belonging to the category being determined by step s202
The quantity of word is counted, and the sum counting is not marked the degree of association of label as the category with this.
S204: determine that the degree of association not marking label with this does not mark the affiliated class of label more than the classification of given threshold for this
Not.
In this step, each classification being obtained by step s203 is carried out with given threshold with the degree of association not marking label
Relatively, obtain the classification that degree of association is more than given threshold, and the category be defined as this not marking label generic.Wherein,
Given threshold rule of thumb can be preset by those skilled in the art.
For example, it is assumed that n value be 11, and according to determine do not mark label " fashion control " 11 related terms and its
Pmi value is as shown in table 1 below:
Table 1
Related term | Pmi value |
Trend | 6.55812357608 |
Beauty treatment | 5.43060103888 |
Fashion world | 5.00216134394 |
Shopping | 4.32233188279 |
American-European model | 4.29291474444 |
Fashion Magazines | 4.17438123971 |
Music | 3.94937319753 |
Beauty | 3.8109474926 |
Residence female after 90s | 3.01654271469 |
Perfume | 2.98635197297 |
Garment coordination | 2.79629753825 |
In the related term not marking label word " fashion control " shown in table 1, belong to category of interest label include " beauty treatment ",
" shopping ", " music ", " Fashion Magazines ", " perfume ", count in the related term not marked label word " fashion control " and belong to emerging
The number of the label of interesting classification is 5;The label belonging to identity category includes " fashion world ", " beauty ", " residence female after 90s ", " America and Europe
Model ", the number that statistics is not marked the label belonging to identity category in the related term of label word " fashion control " is 4;Belong to
The label of popular classification includes " trend ", " garment coordination ", counts in the related term not marked label word " fashion control " and belongs to
In popular classification label number be 2.
Thus, it is possible to the degree of association obtaining category of interest with " fashion control " is 5, the degree of association of identity category and " fashion control "
For 4;Popular classification is 2 with the degree of association of " fashion control ".If given threshold is 4, the only phase of category of interest and " fashion control "
Guan Du is more than given threshold 4, so, category of interest can be defined as by not marking label word " fashion control ".
As a kind of more excellent embodiment, it is more than the classification of given threshold in the degree of association determining with do not mark label
Afterwards however, it is determined that the classification going out have multiple, then the classification that these can be determined as candidate categories, from candidate categories select one
Individual classification with a high credibility does not mark the generic of label as this: can be directed to each candidate categories, this is not marked mark
Belong to the Seed label of this candidate categories in the related term signed and this pmi value sum of not marking between label does not mark as this
Label belongs to the credibility of this candidate categories;Therefrom select the maximum candidate categories of credibility not marking belonging to label as this
Classification, that is, this does not mark label generic and is specially the maximum candidate categories of credibility.
Such as, still taking above-mentioned " the fashion control " being defined as category of interest as a example, if given threshold is 3, with " fashion control "
Degree of association be more than given threshold classification include category of interest and identity category, so, can be by category of interest and identity class
Not as the candidate categories not marking label word " fashion control ".
By the related term belonging to " the fashion control " of category of interest, each the pmi value and " fashion control " between is added up, and obtains
The credibility belonging to category of interest to " fashion control " is about 23;By related term each the pmi value and " fashion control " between is carried out
Cumulative, obtain " fashion control " and belong to the credibility of identity category being about 16.Because " fashion control " belongs to the credibility of category of interest
More than fashion control " belong to the credibility of identity category, so, interest class can be defined as by not marking label word " fashion control "
Not.
More preferably, label generic is not marked for this determined, this can also not marked label and belong to such
Other credibility is compared with the high believability threshold setting;If credibility is more than the high believability threshold setting, should
Do not mark label as the Seed label of the classification determined.
Specifically, after determining that the maximum candidate categories of credibility do not mark label generic for this, by described can
Reliability is compared with the high believability threshold setting, if described credibility is less than or equal to the high believability threshold setting,
Carry out the classification that the next one does not mark label;If described credibility is more than the high believability threshold setting, this is not marked mark
Sign the Seed label as the classification determined, and be applied in the classification determination process that the next one does not mark label.By this not
After mark label is as the Seed label of the category, when determining the classification that the next one does not mark label, due to for classifying
The quantity of Seed label increased, so, in the classification determination process that the next one does not mark label, can be adaptively
Increase given threshold and high believability threshold so that the classification of ugc label is more accurate.
For example, if the high believability threshold setting is as 20, because the credibility that " fashion control " belongs to category of interest is more than height
Believability threshold, therefore, it can " fashion control " as the high credibility Seed label of category of interest, and is applied to the next one not
In the classification determination process of mark label.
Method is determined based on the ugc label classification of above-mentioned social platform, the embodiment of the present invention additionally provides a kind of social flat
The ugc label classification of platform determines device, as shown in Figure 3, comprising: pmi computing module 301, related term choose module 302, correlation
Degree determining module 303 and category determination module 304.
Pmi computing module 301 is used for, for each classification, calculating respectively in ugc label and not marking label and the category
Pmi value between each Seed label.
Related term chooses the pmi value that module 302 is used for calculating according to pmi computing module 301, chooses and does not mark mark with this
The Seed label that pmi value between label is more than setting value does not mark the related term of label as this.
In practical application, if only limited using setting value, for some popular do not mark label, it is possible that
Many related to not marking label but and this do not mark the relatively low Seed label of pmi value between label.
As a kind of more excellent embodiment, in order to reduce the complexity of subsequent operation, the technical scheme that the present invention provides
In, pmi computing module 301 is additionally operable to, for each classification, calculate each seed not marking label and the category in ugc label
During pmi value between label, also calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label.
Correspondingly, related term selection module 302 is more than specifically for the pmi value chosen and this does not mark between label and sets
The Seed label of value is as candidate's related term;The n co-occurrence probabilities maximum not marking label with this is chosen from candidate's related term
Seed label do not mark the related term of label as this.Wherein, quantity n of the related term of setting value and selection can be by ability
Field technique personnel are set after considering, avoid as far as possible the related term chosen and this do not mark pmi value between label
Little.
Degree of association determining module 303 does not mark the correlation of label for this selecting according to related term selection module 302
Word, for each classification, determines that the sum of this related term not marking label belonging to the category is not marked with this for the category
The degree of association of label.
Category determination module 304 be used for according to degree of association determining module 303 determine of all categories with this do not mark label
Degree of association, determine with this do not mark label degree of association be more than given threshold classification do not mark label generic for this.
As a kind of more excellent embodiment, category determination module 304 can be used for not marking label in determination with this
After degree of association is more than the classification of given threshold, using the classification determined as candidate categories;And it is directed to each candidate categories, should
Belong to the Seed label of this candidate categories in the related term not marking label and this does not mark the pmi value sum conduct between label
This does not mark the credibility that label belongs to this candidate categories;And select the maximum candidate categories of credibility not mark label as this
Generic.
More preferably, the ugc label classification of the social platform of the embodiment of the present invention determines that device may also include that Seed label is true
Cover half block 305.
Seed label determining module 305 is used for for each classification, the sentence pattern being mated according to the Seed label of the category,
Find out sentence corresponding with described sentence pattern in social platform language material, and extract word from the sentence finding out as this
The Seed label of classification;Wherein, the sentence pattern that the Seed label of the category is mated, is that several manually mark according in the category
Sentence appeared in social platform language material for the Seed label determine.
Further, Seed label determining module 305 is additionally operable to the classification determined for category determination module 304, if
This does not mark label and belongs to the credibility of the category more than the high believability threshold setting, then this is not marked label as determination
The Seed label of the classification going out.
In technical scheme, the ugc label classification of social platform determines that device is directed to each classification, by calculating
Do not mark the pmi value between label and each Seed label of the category in ugc label, to determine that this that belong to the category does not mark
The related term of label and the category do not mark the degree of association of label with this;And the degree of association not marking label with this is more than setting
The classification of threshold value is defined as this and does not mark label generic.Compare the existing svm sorting technique based on Text eigenvector, root
Determine that than according to Text eigenvector the algorithm of degree of association is more simple according to the algorithm that pmi value determines degree of association, reduce ugc mark
Sign the complexity of classification, thus improve the speed of ugc labeling.
Further, in the technical scheme that the present invention provides, acceptable a small amount of Seed label according to artificial mark, from
The dynamic Seed label extending each classification, decreases the artificial workload marking Seed label.
One of ordinary skill in the art will appreciate that it is permissible for realizing all or part of step in above-described embodiment method
Instruct related hardware to complete by program, this program can be stored in a computer read/write memory medium, such as:
Rom/ram, magnetic disc, CD etc..
The above is only the preferred embodiment of the present invention it is noted that ordinary skill people for the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of ugc label classification of social platform determines method it is characterised in that including:
For each classification, calculate each Seed label not marking label and the category in user-generated content ugc label respectively
Between point mutual information pmi value;
Choose the pmi value not marking to this between label and do not mark the related of label more than the Seed label of setting value as this
Word;
For each classification, determine that the sum of this related term not marking label belonging to the category is not marked with this for the category
The degree of association of label;
Determine that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
2. the method for claim 1 is not it is characterised in that mark mark in described calculating user-generated content ugc label
When signing the point mutual information pmi value and each Seed label of the category between, also include:
Calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label;And
Described selection does not mark pmi value between label with this and does not mark the phase of label more than the Seed label of setting value as this
Close word, specifically include:
Choose and this does not mark the Seed label that pmi value between label is more than setting value as candidate's related term;From described time
The Seed label choosing the n co-occurrence probabilities maximum not marking label to this in related term is selected not mark the related of label as this
Word.
3. described method as arbitrary in claim 1 or 2 is not it is characterised in that described determination marks the degree of association of label with this
After the classification of given threshold, also include:
Using the classification determined as candidate categories;And
For each candidate categories, the Seed label belonging to this candidate categories in the related term this not being marked label is not marked with this
Pmi value sum between note label does not mark, as this, the credibility that label belongs to this candidate categories;And
This does not mark label generic and is specially the maximum candidate categories of credibility.
4. method as claimed in claim 3 it is characterised in that described determine do not mark the degree of association of label more than setting with this
The classification determining threshold value does not mark after label generic for this, also includes:
Label generic is not marked for this determined, this is not marked credibility and the setting that label belongs to the category
High believability threshold is compared;If described credibility be more than set high believability threshold, this is not marked label as
The Seed label of the classification determined.
5. method as claimed in claim 4 is it is characterised in that the Seed label of each classification described is predetermined:
For each classification, the sentence pattern being mated according to the Seed label of the category, find out in social platform language material and institute
State the corresponding sentence of sentence pattern, and extract word from the sentence finding out as the Seed label of the category;
Wherein, the sentence pattern that the Seed label of described classification is mated, is the seed mark according to several artificial marks in the category
Sign what the sentence appeared in social platform language material was determined.
6. described method as arbitrary in claim 4 is not it is characterised in that mark in described calculating user-generated content ugc label
Point mutual information pmi value between note label and each Seed label of the category, specifically includes:
For each Seed label of the category, this Seed label c and the described pmi value not marked between label t are according to as follows
Formula calculates:
In formula (1), the frequency that f (t) occurs in the ugc label of each user of described social platform for t;F (c) is c described
The frequency occurring in the ugc label of each user of social platform;F (t, c) simultaneously appears in the ugc label of a user for t and c
Co-occurrence frequency;G is the total number of users being labelled with ugc label in described social platform;
Wherein, described f (t, c) is the frequency simultaneously appearing in the ugc label of a user according to t and c counting in advance, with
The ratio determination of the total number of users of ugc label is labelled with described social platform.
7. a kind of ugc label classification of social platform determines device it is characterised in that including:
Pmi computing module, for for each classification, calculating respectively in user-generated content ugc label and not marking label and be somebody's turn to do
Point mutual information pmi value between each Seed label of classification;
Related term chooses module, for the pmi value being calculated according to described pmi computing module, choose and this do not mark label it
Between pmi value be more than the Seed label of setting value and do not mark the related term of label as this;
Degree of association determining module, does not mark the related term of label, pin for this selecting according to described related term selection module
To each classification, determine that the sum of this related term not marking label belonging to the category is that the category does not mark label with this
Degree of association;
Category determination module, of all categories does not mark the related of label to this for determine according to described degree of association determining module
Degree, determines that the degree of association not marking label with this does not mark label generic more than the classification of given threshold for this.
8. device as claimed in claim 7 it is characterised in that
Described pmi computing module is additionally operable to not mark between label and each Seed label of the category in calculating ugc label
During pmi value, also calculate the co-occurrence probabilities of each Seed label not marking label and the category in ugc label;And
Described related term chooses the seed mark that module is more than setting value specifically for the pmi value chosen and this does not mark between label
Sign as candidate's related term;The seed of the n co-occurrence probabilities maximum not marking label with this is chosen from described candidate's related term
Label does not mark the related term of label as this.
9. device as claimed in claim 7 or 8 it is characterised in that
Described category determination module is specifically in the described class determining that the degree of association not marking label with this is more than given threshold
After not, using the classification determined as candidate categories;And it is directed to each candidate categories, this is not marked in the related term of label and belong to
Do not mark label in the pmi value sum that Seed label and this of this candidate categories do not mark between label as this and belong to this candidate
The credibility of classification;Therefrom select the maximum candidate categories of credibility and do not mark label generic as this.
10. device as claimed in claim 9 is it is characterised in that also include:
Seed label determining module, for for each classification, the sentence pattern being mated according to the Seed label of the category, in social activity
Find out sentence corresponding with described sentence pattern in platform language material, and extract word from the sentence finding out as the category
Seed label;Wherein, the sentence pattern that the Seed label of the described category is mated, is according to several artificial marks in the category
Sentence appeared in social platform language material for the Seed label is determined;And
Described Seed label determining module is additionally operable to the classification determined for described category determination module, and this is not marked label
Belong to the credibility of the category and the high believability threshold setting is compared;If described credibility is more than the high credibility setting
Threshold value, then do not mark label as the Seed label of the classification determined using this.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310549750.XA CN103631874B (en) | 2013-11-07 | 2013-11-07 | UGC label classification determining method and device for social platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310549750.XA CN103631874B (en) | 2013-11-07 | 2013-11-07 | UGC label classification determining method and device for social platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103631874A CN103631874A (en) | 2014-03-12 |
CN103631874B true CN103631874B (en) | 2017-01-18 |
Family
ID=50212916
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310549750.XA Active CN103631874B (en) | 2013-11-07 | 2013-11-07 | UGC label classification determining method and device for social platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103631874B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376041B (en) * | 2014-10-11 | 2018-05-18 | 北京中搜网络技术股份有限公司 | A kind of information extraction method based on microblogging classification |
CN106033445B (en) * | 2015-03-16 | 2019-10-25 | 北京国双科技有限公司 | The method and apparatus for obtaining article degree of association data |
CN105117449B (en) * | 2015-08-14 | 2019-08-16 | 百度在线网络技术(北京)有限公司 | A kind of method and apparatus for generating the label of content item |
CN105809478B (en) * | 2016-03-07 | 2020-02-18 | 优酷网络技术(北京)有限公司 | Labeling method and system for advertisement label |
CN107402932B (en) * | 2016-05-20 | 2021-04-13 | 腾讯科技(深圳)有限公司 | Expansion processing method of user tag, text recommendation method and text recommendation device |
CN106446191B (en) * | 2016-09-30 | 2019-11-05 | 浙江工业大学 | A kind of multiple features network flow row label prediction technique returned based on Logistic |
CN109857957B (en) * | 2019-01-29 | 2021-06-15 | 掌阅科技股份有限公司 | Method for establishing label library, electronic equipment and computer storage medium |
CN113177102B (en) * | 2021-06-30 | 2021-08-24 | 智者四海(北京)技术有限公司 | Text classification method and device, computing equipment and computer readable medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609459A (en) * | 2009-07-21 | 2009-12-23 | 北京大学 | A kind of extraction system of affective characteristic words |
CN103092956A (en) * | 2013-01-17 | 2013-05-08 | 上海交通大学 | Method and system for topic keyword self-adaptive expansion on social network platform |
EP2581868A3 (en) * | 2011-10-13 | 2013-07-24 | Aol Llc | Systems and methods for managing publication of online advertisements |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009039392A1 (en) * | 2007-09-21 | 2009-03-26 | The Board Of Trustees Of The University Of Illinois | A system for entity search and a method for entity scoring in a linked document database |
CN103309857B (en) * | 2012-03-06 | 2018-11-09 | 深圳市世纪光速信息技术有限公司 | A kind of taxonomy determines method and apparatus |
-
2013
- 2013-11-07 CN CN201310549750.XA patent/CN103631874B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609459A (en) * | 2009-07-21 | 2009-12-23 | 北京大学 | A kind of extraction system of affective characteristic words |
EP2581868A3 (en) * | 2011-10-13 | 2013-07-24 | Aol Llc | Systems and methods for managing publication of online advertisements |
CN103092956A (en) * | 2013-01-17 | 2013-05-08 | 上海交通大学 | Method and system for topic keyword self-adaptive expansion on social network platform |
Also Published As
Publication number | Publication date |
---|---|
CN103631874A (en) | 2014-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103631874B (en) | UGC label classification determining method and device for social platform | |
CN105808526B (en) | Commodity short text core word extracting method and device | |
CN103646088B (en) | Product comment fine-grained emotional element extraction method based on CRFs and SVM | |
CN108363725B (en) | Method for extracting user comment opinions and generating opinion labels | |
CN109582949A (en) | Event element abstracting method, calculates equipment and storage medium at device | |
CN108595519A (en) | Focus incident sorting technique, device and storage medium | |
CN108984530A (en) | A kind of detection method and detection system of network sensitive content | |
CN106570496A (en) | Emotion recognition method and device and intelligent interaction method and device | |
CN107038480A (en) | A kind of text sentiment classification method based on convolutional neural networks | |
CN104331506A (en) | Multiclass emotion analyzing method and system facing bilingual microblog text | |
CN106970912A (en) | Chinese sentence similarity calculating method, computing device and computer-readable storage medium | |
CN107273348B (en) | Topic and emotion combined detection method and device for text | |
CN102929860B (en) | Chinese clause emotion polarity distinguishing method based on context | |
CN109978020B (en) | Social network account number vest identity identification method based on multi-dimensional features | |
CN105069041A (en) | Video user gender classification based advertisement putting method | |
CN104951542A (en) | Method and device for recognizing class of social contact short texts and method and device for training classification models | |
CN106919575A (en) | application program searching method and device | |
CN110110035A (en) | Data processing method and device and computer readable storage medium | |
CN110209810A (en) | Similar Text recognition methods and device | |
CN107463703A (en) | English social media account number classification method based on information gain | |
CN104090936A (en) | News recommendation method based on hypergraph sequencing | |
CN108346067A (en) | Social networks advertisement sending method based on natural language processing | |
CN108345612A (en) | A kind of question processing method and device, a kind of device for issue handling | |
Suh et al. | Subgraph matching using compactness prior for robust feature correspondence | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |