CN105893478A

CN105893478A - Tag extraction method and equipment

Info

Publication number: CN105893478A
Application number: CN201610186950.7A
Authority: CN
Inventors: 许志鹏
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2016-03-29
Filing date: 2016-03-29
Publication date: 2016-08-24
Anticipated expiration: 2036-03-29
Also published as: CN105893478B

Abstract

The embodiment of the invention discloses a tag extraction method and equipment, wherein the method is realized by the following steps of obtaining a UGC (User Generated Content) associated with a content provided by a content issuer; performing word segmentation on the UGC; using words obtained after the word segmentation as optional tags; calculating the weight values of each word in the optional tags; selecting the words from the optional tags in a sequence from great weight values to small weight values of each word in the optional tag to be used as candidate words; and using the candidate words as a second tag. The tag is obtained through extraction by extracting the UGC, so that the special input of the tag by the user is not needed; the UGC can be from various users; the partition can be performed on the basis of weight values; the tag extraction can be automatically completed; therefore the tag contents can be enriched; and the tags can be diversified and accurate.

Description

A kind of tag extraction method and apparatus

Technical field

The present invention relates to communication technical field, particularly to a kind of tag extraction method and apparatus.

Background technology

In this application, label refers to user's evaluation label to content publisher.

As a example by video website, the label of the video in present each big video website is substantially all to be sent out by video Cloth person or web editor are stamped, unavoidably can be with subjectivity and one-sidedness.That is tag extraction is Realize by the way of receiving video distribution person or web editor.

For the deficiency improving video distribution person and web editor labels, user is allowed to participate in labelling and can compare Preferably；Then can improve the deficiency only labelled by author, program tag extraction is by receiving user Mode realize.

But user is relatively low to the participation that labels, cause user to content provider or content provider The label of the content provided is less, is even difficult to obtain.

Summary of the invention

Embodiments provide a kind of tag extraction method and apparatus, can for extract that user provides Select label, enrich label substance, and make label more diversification with accurate.

On the one hand a kind of tag extraction method is embodiments provided, including:

The user original content UGC that the content that obtaining provides with content issuer is associated, to described UGC The vocabulary that participle carries out participle, obtain is as optional label；

Calculate the weighted value of each vocabulary in described optional label, according to the power of each vocabulary in described optional label Weight values selects vocabulary as candidate word from high to low from described optional label；

Using described candidate word as the second label.

In a possible implementation, in calculating described optional label before the weighted value of each vocabulary, Described method also includes:

Select noun and/or the vocabulary of noun phrase in described optional label, and remove the vocabulary of the repetition meaning of one's words And invalid vocabulary obtains remaining vocabulary；

The weighted value of each vocabulary in the described optional label of described calculating, according to each vocabulary in described optional label Weighted value from high to low from described optional label select vocabulary as candidate word；Including:

Described residue vocabulary is carried out weight calculation and obtains the first weighted value of each vocabulary in described residue vocabulary, And select vocabulary as candidate word from high to low from described residue vocabulary according to described first weighted value.

In a possible implementation, according to described first weighted value from high to low from described residue Before selecting vocabulary as candidate word in vocabulary, described method also includes:

Obtain the first label that described content issuer provides；Calculate described candidate word and described first label The degree of association obtain the second weighted value；

Selecting vocabulary as candidate word from high to low from described residue vocabulary according to described first weighted value Afterwards, described method also includes:

Select vocabulary as the second label from high to low from described candidate word according to the second weighted value；Or, Calculate described first weighted value and the comprehensive weight of the second weighted value, according to described comprehensive weight from high to low Select vocabulary as the second label from described candidate word.

In a possible implementation, the described weight calculation that carries out described residue vocabulary obtains described In residue vocabulary, the weighted value of each vocabulary includes:

Add up each vocabulary occurrence number in described UGC in described residue vocabulary, and determine and each vocabulary The weighted value that occurrence number in described UGC is corresponding.

In a possible implementation, described described UGC carried out participle include:

Obtain the sentence of described UGC, described sentence is grown most coupling and the most anti- To the longest coupling, take the less result of participle amount as word segmentation result, take when participle amount is identical described instead To the result of the longest coupling as word segmentation result.

The two aspect embodiment of the present invention additionally provide a kind of tag extraction equipment, including:

Contents acquiring unit, the user for obtaining with the content of content issuer offer is associated is original interior Hold UGC；

Bilingual lexicon acquisition unit, the vocabulary that participle is used for that described UGC is carried out participle, obtain is as optional Label；

Weight calculation unit, for calculating the weighted value of each vocabulary in described optional label；

Lexical choice unit, is used for according to the weighted value of each vocabulary in described optional label from high to low from institute State and optional label selects vocabulary as candidate word；

Tag determination unit, is used for described candidate word as the second label.

In a possible implementation, described tag extraction equipment also includes:

Vocabulary screening unit, specifically for noun and/or the vocabulary of noun phrase in the described optional label of selection, And remove the vocabulary of the repetition meaning of one's words and invalid vocabulary obtains remaining vocabulary；

Described weight calculation unit, obtains described surplus specifically for described residue vocabulary carries out weight calculation First weighted value of each vocabulary in remaining vocabulary；

Described lexical choice unit, specifically for according to described first weighted value from high to low from described residue Vocabulary select vocabulary as candidate word.

In a possible implementation, described tag extraction equipment also includes:

Label acquiring unit, for obtaining the first label that described content issuer provides；

Described weight calculation unit, the degree of association being additionally operable to calculate described candidate word and described first label obtains To the second weighted value；Or, the degree of association calculating described candidate word and described first label obtains the second power Weight values, then calculates described first weighted value and the comprehensive weight of the second weighted value；

Described tag determination unit, specifically for foundation the second weighted value from high to low from described candidate word Select vocabulary as the second label according to described comprehensive weight from high to low from described candidate word select vocabulary As the second label.

In a possible implementation, described weight calculation unit, specifically for adding up described residue Each vocabulary occurrence number in described UGC in vocabulary, and determine with each vocabulary in described UGC The weighted value that occurrence number is corresponding.

In a possible implementation, described bilingual lexicon acquisition unit, specifically for obtaining described UGC Sentence, described sentence is grown most coupling and reverse the longest coupling from right to left, takes point The less result of word amount, as word segmentation result, takes the result of described reversely the longest coupling when participle amount is identical As word segmentation result.

As can be seen from the above technical solutions, the embodiment of the present invention has the advantage that by former to user Wound content UGC extract, thus extract obtain label, so can the special input label of user, UGC may come from numerous user, carries out subregion based on weighted value, and tag extraction is automatically performed；Therefore, Label substance can be enriched, and make label more diversification with accurate.

Accompanying drawing explanation

For the technical scheme being illustrated more clearly that in the embodiment of the present invention, in embodiment being described below The required accompanying drawing used is briefly introduced, it should be apparent that, the accompanying drawing in describing below is only this Some bright embodiments, from the point of view of those of ordinary skill in the art, are not paying creative work On the premise of, it is also possible to other accompanying drawing is obtained according to these accompanying drawings.

Fig. 1 is embodiment of the present invention method flow schematic diagram；

Fig. 2 is embodiment of the present invention method flow schematic diagram；

Fig. 3 is embodiment of the present invention device structure schematic diagram；

Fig. 4 is embodiment of the present invention device structure schematic diagram；

Fig. 5 is embodiment of the present invention device structure schematic diagram；

Fig. 6 is embodiment of the present invention server architecture schematic diagram；

Fig. 7 is embodiment of the present invention device structure schematic diagram.

Detailed description of the invention

In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to this Invention is described in further detail, it is clear that described embodiment is only that some of the present invention is implemented Example rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art exist Do not make all other embodiments obtained under creative work premise, broadly fall into present invention protection Scope.

Embodiments provide a kind of tag extraction method, as it is shown in figure 1, include:

101: the user original content UGC that the content that obtaining provides with content issuer is associated, to above-mentioned The vocabulary that UGC carries out participle, obtained by participle is as optional label；

At communication technical field, content issuer is the publisher of Internet resources, such as: net cast Publisher；User refers to that the user of Internet resources, user original content UGC are that video is delivered by user Suggestion, can be generally Word message, such as: barrage or comment etc..In theory if audio frequency, Need to do speech recognition, it is also possible to realizing, data processing amount can be bigger.

The Word message of user original content UGC is carried out participle, and concrete segmentation methods is referred to Through more ripe segmentation methods, the present invention implements that this does not make uniqueness and limits.

102: calculate the weighted value of each vocabulary in above-mentioned optional label, according to each vocabulary in above-mentioned optional label Weighted value from high to low from above-mentioned optional label select vocabulary as candidate word；

After the Word message of user original content UGC is carried out participle, some optional vocabulary can be obtained, These vocabulary can use as label, but the optional label that participle obtains can be the most, it would be desirable to Select a portion as label；Therefore each vocabulary can be distinguished, specifically in the way of using weighted value How to determine the weighted value of vocabulary, empirical value can be used to determine, it is also possible to determine based on statistical magnitude, This embodiment of the present invention is not made uniqueness limit.

103: using above-mentioned candidate word as the second label.

" first " and " second " is the label in order to distinguish type in embodiments of the present invention, should not It is interpreted as that there is the implication that other technologies limit.Wherein the first label is the label that content issuer provides, Second label is the label using the embodiment of the present invention to carry out tag extraction acquisition.

The embodiment of the present invention, by extracting user original content UGC, thus extracts and obtains label, So can the special input label of user, UGC may come from numerous user, enters based on weighted value Row subregion, tag extraction is automatically performed；Therefore, it can enrich label substance, and make label the most polynary Change and accurate.

Further, due to UGC wide material sources, such as: barrage function may have a lot of people sending out Barrage, this is it would appear that more word, and these words there may be what the meaning of one's words repeated, it is also possible to occurs Some cannot function as the invalid word of label, and the embodiment of the present invention can be removed these words and improve label further The accuracy extracted, specific as follows: computationally to state in optional label before the weighted value of each vocabulary, on Method of stating also includes:

Select noun and/or the vocabulary of noun phrase in above-mentioned optional label, and remove the vocabulary of the repetition meaning of one's words And invalid vocabulary obtains remaining vocabulary；

The weighted value of each vocabulary in the above-mentioned optional label of above-mentioned calculating, according to each vocabulary in above-mentioned optional label Weighted value from high to low from above-mentioned optional label select vocabulary as candidate word；Including:

Above-mentioned residue vocabulary is carried out weight calculation and obtains the first weighted value of each vocabulary in above-mentioned residue vocabulary, And select vocabulary as candidate word from high to low from above-mentioned residue vocabulary according to above-mentioned first weighted value.

In general noun can use as label, and verb, measure word etc. are general unsuitable for conduct Label, it is possible to noun and noun phrase are extracted, then removes the vocabulary of the repetition meaning of one's words.Weight The multiple meaning of one's words, is to reduce the vocabulary of close implication as label, causing the label duplicated；This In bright embodiment, " first " is additionally operable to distinguish two different weighted values, wherein the first power with " second " Weight values is the weighted value of residue vocabulary, and the second weighted value is the weighted value of candidate word；It has been not construed as Other technologies implication.Invalid vocabulary, can be the sensitive vocabulary being forbidden to occur in label legally, it is possible to To be nonsensical vocabulary itself；Can be removed by the form of lexicon.It addition, title and noun Phrase, it is also possible to set up effective lexicon, title here and noun phrase are when needing in effective lexicon The vocabulary in face.

Further, UGC being carried out tag extraction, directivity is relatively low, and the tag orientation that may extract has Deviation, in order to reduce this deviation, the embodiment of the present invention additionally provides following solution: in foundation State the first weighted value from high to low from above-mentioned residue vocabulary select vocabulary as candidate word before, above-mentioned side Method also includes:

Obtain the first label that foregoing publisher provides；Calculate above-mentioned candidate word and above-mentioned first label The degree of association obtain the second weighted value；

Selecting vocabulary as candidate word from high to low from above-mentioned residue vocabulary according to above-mentioned first weighted value Afterwards, said method also includes:

Select vocabulary as the second label from high to low from above-mentioned candidate word according to the second weighted value；Or, Calculate above-mentioned first weighted value and the comprehensive weight of the second weighted value, according to above-mentioned comprehensive weight from high to low Select vocabulary as the second label from above-mentioned candidate word.

The present embodiment, the power of candidate word assessed by the first label be given by content providers as direction Weight, so that the thinking of content issuer pressed close on direction by the label extracted, abundant content issuer Label, and embody other users evaluation to Internet resources.In the present embodiment, the second label is permissible Only with reference to the second weight, use comprehensive weight can be at the base taking into account the tag orientation that content providers provides Balancing the result extracted based on UGC automated tag on plinth, the situation reducing label one-sidedness occurs.

Alternatively, above-mentioned above-mentioned residue vocabulary is carried out weight calculation obtain each vocabulary in above-mentioned residue vocabulary Weighted value include:

Add up each vocabulary occurrence number in above-mentioned UGC in above-mentioned residue vocabulary, and determine and each vocabulary The weighted value that occurrence number in above-mentioned UGC is corresponding.

Weighted value calculation can have a variety of, using occurrence number as statistics knot in the embodiment of the present invention Fruit determines weighted value, relatively simple and can embody the most users evaluation to Internet resources, meets mark The requirement signed, enables the label extracted to reflect the evaluation of user.

Alternatively, above-mentioned above-mentioned UGC carried out participle include:

Obtain the sentence of above-mentioned UGC, above-mentioned sentence is grown most coupling and the most anti- To the longest coupling, take the less result of participle amount as word segmentation result, take when participle amount is identical above-mentioned instead To the result of the longest coupling as word segmentation result.

In the present embodiment, the amount of calculation of participle can increase along with the quantity of UGC, for be likely to occur Magnanimity UGC, this step can use distributed arithmetic to improve calculating speed.Select at the present embodiment Amount of calculation is relatively small and is more suitable for the algorithm of tag extraction, to improve participle efficiency and to obtain relatively For word segmentation result accurately.

The embodiment of the present invention, mainly according to the UGC content that video is relevant, such as: comment and barrage, is carried out Data mining obtains some valuable labels, as video tab.On the one hand can make up in product side Video distribution side unilaterally labels with subjective and unilateral deficiency, on the other hand to user's unaware, Without threshold, evade the enthusiasm problem how guiding and providing user to label well.

As in figure 2 it is shown, be the main body frame of the embodiment of the present invention；Including:

201: obtain publisher's label；

This step is to obtain the label that publisher beats.

202: go out the user tag of Weight according to UGC content mining；

User tag is that the UGC of viewing side's offer of Internet resources excavates the label obtained.

203: publisher's label weights with user tag similarity；

204: screening denoising, export result.

Screening denoising is to screen the label obtained.

Above step, is embodied in subsequent embodiment and describes in detail respectively.

One, publisher's label:

First, video author (or website editor) is allowed to label to video.

It should be noted that embodiment of the present invention scheme can not only label according to user's UGC content, It is also based on the basis of video author labels, introducing user tag, making label more polynary and accurate. Allow author label in advance and be used as the candidate factors of final label, be to make to excavate the mark obtained subsequently Sign also with certain theme tendency, so can better ensure that the quality of label, and make it symbol Close the planning of video website self.

Two, the user tag of Weight is gone out according to UGC content mining.

1, UGC content is carried out participle:

The UGC contents such as the comment of video and barrage are carried out participle, and the algorithm of Chinese word segmentation can select one The participle development library increased income a bit is to complete participle function, such as Sfanford, IKAnalyzer, Word etc.. Can consider to select the minimum segmentation methods of forward and reverse maximum match in terms of segmentation methods, it may be assumed that from sentence The combination algorithm of the longest left-to-right coupling and from right to left reverse the longest coupling, and take that participle amount is minimum one Individual result, negate when participle number is identical to segmenting method.It is pointed out that Chinese word segmentation is often Computationally intensive, it may be necessary to consider to use Spark cluster etc. to carry out distributed arithmetic to improve calculating speed.

2, key word is extracted:

Keyword extraction can be divided into three below step:

A) part-of-speech tagging and selection:

In view of label mostly based on noun, therefore can only extract noun therein or noun phrase As candidate word.Participle instrument is from part-of-speech tagging function at present, therefore can be used directly to carry Take out all of noun.

B) invalid word filters:

Invalid word filters and refers in the set of candidate word, weeds out some and labels video and have little significance Word, such as: unhealthy word, sensitive word etc..Invalid word filters can be according to invalid word list check and correction Mode realizes, and invalid word list can be set up and " the screening denoising " of each label generation process by artificial Step carries out being continuously replenished upgrading.

C) meaning of one's words duplicate removal:

Owing to the label of the identical meaning of one's words is the most useful, it is possible to the candidate word obtained after participle, carry out language Meaning duplicate removal.Meaning of one's words duplicate removal can be realized by the method for near-synonym, it is not necessary to is concerned about near synonym identification Algorithm, it is only necessary to the Chinese near synonym storehouse ready-made with some carries out processing.Belonging near synonym Word is marked classification, then will belong to the word of a class, replaces with occurrence number in classification most One.

3, weight calculation:

After keyword extraction is complete, key word is carried out weight calculation, finally extract weighted value forward Some key words, as candidate word.The weight calculation of key word has many conventional algorithms, such as tf-idf. Owing to video tab is classified or not quite alike with data, a video can play the label of multiple kind, The most here we can calculate weight only by statistics tf, i.e. word frequency, the namely weight of key word Can be determined by the occurrence number calculating key word, ratio is in lists of keywords, and word A has 10, word B has 2, then the weight coefficient of word A is 10, and the weight coefficient of word B is 2, and this weight suspense is x.

3.1: publisher's label weights with the similarity of user tag:

By above-mentioned step, we have obtained author's label and user tag.If author's label and use Have some same or like labels between the label of family, then illustrate on this point, the judgement of author and The judgement of user matches, and then it is believed that this label possesses accuracy more higher than other labels, Higher weight should be given.We can be according to the dictionary definition of author's label and the dictionary of user tag Lexical or textual analysis, as language material, does Similarity measures, obtains the similarity weight y of each label, further according to user Word frequency weight x of label, obtains final weight w by certain ratiometric conversion.

3.2: screening denoising:

Some author's labels with similarity gain weight, and band is had been obtained for based on aforementioned processing There is the user tag of weight.For further ensuring that the quality of label and making label meet the planning of website, After can by use artificial in the way of from these candidate's labels, select final label.Can in screening process To doing suitable meaning of one's words duplicate removal between author and user tag, can consider to add for some invalid labels Enter to " invalid word list ".The principle of label filtration is preferentially to select the higher content of similarity gain to provide Side's label, and the user tag that weight is higher, because these labels possess the highest accuracy.And some Vision unique, fresh and be no lack of representational label can also be selected so that label more diversification.

The beneficial effect that embodiment of the present invention technical scheme is brought:

The present invention traditional labelled by video author (publisher or editor) on the basis of, add The unit of user tag usually makes label more accurately with polynary.By the way of weight is measured, make the standard of label Really property is quantified, and expands so that label more horn of plenty is polynary by adding user tag.

The embodiment of the present invention additionally provides a kind of tag extraction equipment, as it is shown on figure 3, include:

Contents acquiring unit 301, original for the user obtained with the content of content issuer offer is associated Content UGC；

Bilingual lexicon acquisition unit 302, for above-mentioned UGC is carried out participle, vocabulary that participle is obtained as Optional label；

Weight calculation unit 303, for calculating the weighted value of each vocabulary in above-mentioned optional label；

Lexical choice unit 304, for according to the weighted value of each vocabulary in above-mentioned optional label from high to low from Above-mentioned optional label select vocabulary as candidate word；

Tag determination unit 305, is used for above-mentioned candidate word as the second label.

Further, due to UGC wide material sources, such as: barrage function may have a lot of people sending out Barrage, this is it would appear that more word, and these words there may be what the meaning of one's words repeated, it is also possible to occurs Some cannot function as the invalid word of label, and the embodiment of the present invention can be removed these words and improve label further The accuracy extracted, specific as follows: as shown in Figure 4, above-mentioned tag extraction equipment also includes:

Vocabulary screening unit 401, specifically for selecting noun in above-mentioned optional label and/or noun phrase Vocabulary, and remove the vocabulary of the repetition meaning of one's words and invalid vocabulary obtains remaining vocabulary；

Above-mentioned weight calculation unit 303, obtains above-mentioned specifically for above-mentioned residue vocabulary is carried out weight calculation First weighted value of each vocabulary in residue vocabulary；

Above-mentioned lexical choice unit 304, specifically for remaining from above-mentioned from high to low according to above-mentioned first weighted value Remaining vocabulary select vocabulary as candidate word.

Further, UGC being carried out tag extraction, directivity is relatively low, and the tag orientation that may extract has Deviation, in order to reduce this deviation, the embodiment of the present invention additionally provides following solution: such as Fig. 5 institute Showing, above-mentioned tag extraction equipment also includes:

Label acquiring unit 501, for obtaining the first label that foregoing publisher provides；

Above-mentioned weight calculation unit 303, is additionally operable to the degree of association calculating above-mentioned candidate word with above-mentioned first label Obtain the second weighted value；Or, the degree of association calculating above-mentioned candidate word and above-mentioned first label obtains second Weighted value, then calculates above-mentioned first weighted value and the comprehensive weight of the second weighted value；

Above-mentioned tag determination unit 305, specifically for foundation the second weighted value from high to low from above-mentioned candidate word Middle selection vocabulary as the second label according to above-mentioned comprehensive weight from high to low from above-mentioned candidate word select word Converge as the second label.

Alternatively, above-mentioned weight calculation unit 303, exist specifically for adding up each vocabulary in above-mentioned residue vocabulary Occurrence number in above-mentioned UGC, and determine corresponding with each vocabulary occurrence number in above-mentioned UGC Weighted value.

Alternatively, above-mentioned bilingual lexicon acquisition unit 302, specifically for obtaining the sentence of above-mentioned UGC, by upper State sentence and grow most coupling and reverse the longest coupling from right to left, take the less knot of participle amount Fruit, as word segmentation result, takes the result of above-mentioned reversely the longest coupling as word segmentation result when participle amount is identical.

The embodiment of the present invention additionally provides a kind of tag extraction equipment, including: Fig. 6 is the embodiment of the present invention The server architecture schematic diagram provided, this server 600 can produce bigger because of configuration or performance difference Difference, one or more central processing units (central processing units, CPU) can be included 622 (such as, one or more processors) and memorizeies 632, one or more storages should With the storage medium 630 (such as one or more mass memory units) of program 642 or data 644. Wherein, memorizer 632 and storage medium 630 can be of short duration storage or persistently store.It is stored in storage The program of medium 630 can include one or more modules (diagram does not marks), and each module is permissible Including to a series of command operatings in server.Further, central processing unit 622 can be arranged For communicating with storage medium 630, server 600 performs a series of instructions in storage medium 630 Operation.

Server 600 can also include one or more power supplys 626, one or more wired or Radio network interface 650, one or more input/output interfaces 658, and/or, one or one with Upper operating system 641, such as Windows Server TM, Mac OS XTM, Unix TM, Linux TM, FreeBSDTM etc..

In above-described embodiment, method step can be based on the server architecture shown in this Fig. 6.

The embodiment of the present invention additionally provides another kind of tag extraction equipment, as it is shown in fig. 7, comprises: receive Equipment 701, transmitting equipment 702, processor 703 and storage device 704；

Wherein processor 703, the user for obtaining with the content of content issuer offer is associated is original interior Holding UGC, the vocabulary that above-mentioned UGC carries out participle, obtain for participle is as optional label；In calculating State the weighted value of each vocabulary in optional label, according to the weighted value of each vocabulary in above-mentioned optional label from height to Low from above-mentioned optional label select vocabulary as candidate word；Using above-mentioned candidate word as the second label.

Further, due to UGC wide material sources, such as: barrage function may have a lot of people sending out Barrage, this is it would appear that more word, and these words there may be what the meaning of one's words repeated, it is also possible to occurs Some cannot function as the invalid word of label, and the embodiment of the present invention can be removed these words and improve label further The accuracy extracted, specific as follows: above-mentioned processor 703, it is additionally operable to computationally state in optional label each Before the weighted value of vocabulary, select noun and/or the vocabulary of noun phrase in above-mentioned optional label, and remove The vocabulary and the invalid vocabulary that repeat the meaning of one's words obtain remaining vocabulary；

Further, UGC being carried out tag extraction, directivity is relatively low, and the tag orientation that may extract has Deviation, in order to reduce this deviation, the embodiment of the present invention additionally provides following solution: above-mentioned process Device 703, is additionally operable to selecting vocabulary to make from high to low from above-mentioned residue vocabulary according to above-mentioned first weighted value Before candidate word, obtain the first label that foregoing publisher provides；Calculate above-mentioned candidate word with upper The degree of association stating the first label obtains the second weighted value；

Selecting vocabulary as candidate word from high to low from above-mentioned residue vocabulary according to above-mentioned first weighted value Afterwards, select vocabulary as the second label from high to low from above-mentioned candidate word according to the second weighted value；Or Person, calculates above-mentioned first weighted value and the comprehensive weight of the second weighted value, according to above-mentioned comprehensive weight from height Select vocabulary as the second label from above-mentioned candidate word to low.

Alternatively, above-mentioned processor 703, obtain above-mentioned surplus for above-mentioned residue vocabulary being carried out weight calculation In remaining vocabulary, the weighted value of each vocabulary includes: add up in above-mentioned residue vocabulary each vocabulary in above-mentioned UGC Occurrence number, and determine the weighted value corresponding with each vocabulary occurrence number in above-mentioned UGC.

Alternatively, above-mentioned processor 703, include for above-mentioned UGC is carried out participle:

It should be noted that in the said equipment embodiment, included unit is simply patrolled according to function Volume carry out dividing, but be not limited to above-mentioned division, as long as being capable of corresponding function； It addition, the specific name of each functional unit is also only to facilitate mutually distinguish, it is not limited to this Bright protection domain.

It addition, one of ordinary skill in the art will appreciate that realize whole in above-mentioned each method embodiment or Part steps can be by program and completes to instruct relevant hardware, and corresponding program can be stored in one In kind of computer-readable recording medium, storage medium mentioned above can be read only memory, disk or CD etc..

These are only the present invention preferably detailed description of the invention, but protection scope of the present invention is not limited to This, any those familiar with the art, can in the technical scope that the embodiment of the present invention discloses The change readily occurred in or replacement, all should contain within protection scope of the present invention.Therefore, the present invention Protection domain should be as the criterion with scope of the claims.

Claims

1. a tag extraction method, it is characterised in that including:

Using described candidate word as the second label.

Method the most according to claim 1, it is characterised in that each word in calculating described optional label Before the weighted value converged, described method also includes:

Method the most according to claim 2, it is characterised in that according to described first weighted value from height To low from described residue vocabulary select vocabulary as candidate word before, described method also includes:

4. according to method described in Claims 2 or 3, it is characterised in that described described residue vocabulary is entered Row weight calculation obtains the weighted value of each vocabulary in described residue vocabulary and includes:

5. according to method described in claims 1 to 3 any one, it is characterised in that described to described UGC carries out participle and includes:

6. a tag extraction equipment, it is characterised in that including:

Tag extraction equipment the most according to claim 6, it is characterised in that described tag extraction equipment Also include:

Tag extraction equipment the most according to claim 7, it is characterised in that described tag extraction equipment Also include:

9. according to tag extraction equipment described in claim 7 or 8, it is characterised in that

Described weight calculation unit, specifically for each vocabulary in the described residue vocabulary of statistics in described UGC Occurrence number, and determine the weighted value corresponding with each vocabulary occurrence number in described UGC.

10. according to tag extraction equipment described in claim 7 to 8 any one, it is characterised in that

Described bilingual lexicon acquisition unit, specifically for obtain described UGC sentence, by described sentence from a left side to Coupling and reverse the longest coupling from right to left are grown most in the right side, take the less result of participle amount and tie as participle Really, the result of described reversely the longest coupling is taken when participle amount is identical as word segmentation result.