CN108595660A

CN108595660A - Label information generation method, device, storage medium and the equipment of multimedia resource

Info

Publication number: CN108595660A
Application number: CN201810400431.5A
Authority: CN
Inventors: 王聪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2018-09-28

Abstract

The invention discloses label information generation method, device, storage medium and the equipment of a kind of multimedia resource, belong to Internet technical field.The method includes：The comment information for obtaining destination multimedia resource carries out word segmentation processing to the comment information；Obtain the term vector of at least one vocabulary obtained after participle；The term vector of at least one vocabulary is clustered, multiple classified vocabularies are obtained, different classified vocabularies has different subject informations；In at least one vocabulary obtained after participle, the key vocabularies of the destination multimedia resource are extracted；Subject information based on the key vocabularies and the multiple classified vocabulary generates label information for the destination multimedia resource.The present invention realizes full automation when generating label information, intelligent preferable without consuming a large amount of manpower and time；And the label information of generation is more accurate, improves the precision subsequently when carrying out multimedia resource recommendation.

Description

Label information generation method, device, storage medium and the equipment of multimedia resource

Technical field

The present invention relates to Internet technical field, more particularly to the label information generation method of a kind of multimedia resource, dress It sets, storage medium and equipment.

Background technology

With the rapid development of Internet technology, at present major website be dedicated to how efficiently and accurately to user into Row multimedia resource is recommended, to promote user experience.Wherein, the above-mentioned multimedia resource referred to can cover film, TV play, small It says, article etc..Under normal conditions, it before carrying out multimedia resource recommendation, generally also needs to be first that multimedia resource generates phase The label information answered, and then recommended to complete multimedia resource by label information.Wherein, label information is used to provide multimedia Source is identified, in order to which user screens subject matter type or the core subject etc. of multimedia resource.

Based on it is described above it is found that multimedia resource label information to carry out multimedia resource recommend it is particularly significant, be This, how to generate label information for multimedia resource becomes a focus of those skilled in the art's concern at present.Wherein, phase Pass technology is completely dependent on when generating label information for multimedia resource and is accomplished manually.By taking multimedia resource is film as an example, then Referring to Figure 1A, if film is " The Shawshank Redemption ", then staff may be manually it add " plot ", " crime " this The label information of sample.

In the implementation of the present invention, the relevant technologies have at least the following problems：

Label information is dependent on manually generated, and the number magnanimity of multimedia resource, so the generation of this kind of label information Mode can consume a large amount of manpower and time, not smart enoughization；In addition, that there are accuracies is poor for manually generated label information Defect, this can cause subsequently, and when carrying out multimedia resource recommendation based on label information, precision substantially reduces.

Invention content

An embodiment of the present invention provides a kind of label information generation method of multimedia resource, device, storage medium and set Standby, not smart enoughization and accuracy are poor when solving generation label information existing for the relevant technologies, so as to cause recommending The problem of recommending precision to be also greatly reduced when multimedia resource.The technical solution is as follows：

On the one hand, a kind of label information generation method of multimedia resource is provided, the method includes：

The comment information for obtaining destination multimedia resource carries out word segmentation processing to the comment information；

Obtain the term vector of at least one vocabulary obtained after participle；

The term vector of at least one vocabulary is clustered, multiple classified vocabularies, the different vocabulary point are obtained Class has different subject informations；

In at least one vocabulary obtained after participle, the key vocabularies of the destination multimedia resource are extracted；

Subject information based on the key vocabularies and the multiple classified vocabulary generates for the destination multimedia resource Label information.

On the other hand, a kind of label information generating means of multimedia resource are provided, described device includes：

First acquisition module, the comment information for obtaining destination multimedia resource segment the comment information Processing；

Second acquisition module, the term vector for obtaining at least one vocabulary obtained after participle；

Cluster module is clustered for the term vector at least one vocabulary, obtains multiple classified vocabularies, different The classified vocabulary have different subject informations；

Extraction module at least one vocabulary for being obtained after participle, extracts the pass of the destination multimedia resource Keyword converges；

Generation module is used for the subject information based on the key vocabularies and the multiple classified vocabulary, is the target Multimedia resource generates label information.

On the other hand, provide a kind of storage medium, be stored at least one instruction in the storage medium, it is described at least One instruction is loaded by processor and is executed to realize the label information generation method of above-mentioned multimedia resource.

On the other hand, a kind of equipment for generating label information is provided, the equipment includes processor and memory, At least one instruction is stored in the memory, at least one instruction is loaded by the processor and executed to realize such as The label information generation method of above-mentioned multimedia resource.

The advantageous effect that technical solution provided in an embodiment of the present invention is brought is：

Full automation is realized when generating label information for multimedia resource, due to being not necessarily to put into manpower into row label The addition of information, so without consuming a large amount of manpower and time, it is intelligent preferable；And the embodiment of the present invention is based on multimedia The comment information of resource is got for the subject information of multiple classified vocabularies of this multimedia resource and for commenting this Multiple key vocabularies of item multimedia resource are come to generate label information for this multimedia resource with this, not only make generation Label information is more accurate, and improves the subsequently precision when carrying out multimedia resource recommendation.

Description of the drawings

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Figure 1A is a kind of interface schematic diagram for showing label information that background technology provides；

Figure 1B is the implementation involved by a kind of label information generation method of multimedia resource provided in an embodiment of the present invention The configuration diagram of environment.

Fig. 2 is a kind of disposed of in its entirety flow of the label information generation method of multimedia resource provided in an embodiment of the present invention Figure；

Fig. 3 is a kind of flow chart of the label information generation method of multimedia resource provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of weighted value calculating label information provided in an embodiment of the present invention；

Fig. 5 is a kind of flow chart of the label information generation method of multimedia resource provided in an embodiment of the present invention；

Fig. 6 is a kind of interface schematic diagram for showing label information provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of the label information generating means of multimedia resource provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram for generating the equipment of label information provided in an embodiment of the present invention.

Specific implementation mode

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Before to the embodiment of the present invention carrying out that explanation is explained in detail, first to the present embodiments relate to some names Word is explained.

Multimedia resource：Its form of expression includes but not limited to textual form, visual form, speech form, image format, It can cover film, TV play, novel, article, audio fragment, variety video etc., and the embodiment of the present invention is to this without tool Body limits.

And multimedia resource can be presented by the visual user interface of electronic equipment to user.Wherein, electronic equipment can be The equipment that smart mobile phone, tablet computer, television set, laptop, desktop computer etc. arbitrarily have display screen.

Label information：For being identified to multimedia resource, in order to which user screens the subject matter type of multimedia resource Or core subject etc..

By taking film as an example, the label information of film may include：Plot, action, love, drama, venture, war, is shied at crime Horrified, suspense, terror, science fiction, song and dance, history, family, swordsman, ethics, record, biography etc..

As it was noted above, major website is dedicated to how efficiently and accurately pushing away to user's progress multimedia resource at present It recommends, and the major premise for carrying out multimedia resource recommendation is：Mass multimedia resource is precisely divided by label information Class.However, since the relevant technologies take the artificial mode for adding label information for multimedia resource, so would generally bring following Similar problem：

1), since label information manually adds, so it is difficult to controlling the mark that different operating personnel define label information Accurate and granularity of classification；In addition, the number magnanimity of multimedia resource, therefore can manually be consumed for multimedia resource addition label information A large amount of manpower and time, lack of wisdom.

2) the label information accuracy, manually added is generally poor, so carrying out multimedia based on such label information Resource recommendation can have that recommendation effect is bad.

3) territory for the definition for tag information, manually added is generally too extensive.By taking film as an example, belong to plot and There are many film quantity of this scope of crime, therefore are carrying out related shadow based on label information as such as " plot, crime " When piece is recommended, can there is a problem of that the film recommended is not accurate enough.

To solve the above-mentioned problems, it is that multimedia resource adds automatically that the embodiment of the present invention, which proposes one kind based on big data, The method of label information, and also achieve the label information based on generation and carry out similar multimedia resource recommendation.Wherein, above-mentioned Big data refers to the comment information that mass users comment on multimedia resource.

Figure 1B is the implementation involved by a kind of label information generation method of multimedia resource provided in an embodiment of the present invention The structure chart of environment.

Referring to Figure 1B, which includes terminal 101 and server 102.Wherein, terminal 101 is for showing more matchmakers The label information of body resource, and similar with this multimedia resource multimedia resource of displaying, the type of terminal 101 include but It is not limited to smart mobile phone, tablet computer, television set, laptop, desktop computer etc., the embodiment of the present invention is to this without tool Body limits.Server 102 is used to add label information automatically for multimedia resource, and is determined and be somebody's turn to do based on the label information of addition Other similar multimedia resources of item multimedia resource.

In another embodiment, the embodiment of the present invention is based on the big data on internet, using machine learning side Method, realizes the label information that weight is carried for multimedia resource addition, and the label information of addition can well reflect more than one The subject matter type or core subject of media resource.

For example, for " The Shawshank Redemption " this film, other than plot and crime this two label informations, the present invention Embodiment can for its addition such as [citizen's right, 0.325], [prison, 0.212], [freely, 0.23], [conviction, 0.14], Label information as [life and death, 0.093].Wherein, before square brackets it is specific label information, being behind square brackets should The corresponding weight of item label information.

In conclusion the embodiment of the present invention realizes：

1), the addition full automation of label information, is added without putting into manpower, intelligent preferable.

2) big data on internet, has been used to carry out the addition of label information so that the label of addition is more smart It is accurate.

3), the label information added carries weight, and multimedia resource recommendation is being carried out based on the label information with weight When, better recommendation effect can be obtained.

In short, the embodiment of the present invention is based on the big data on internet, it can be automatic by machine learning method The label information of weight is carried for multimedia resource addition, and it is similar to carry out to be further based on the label information with weight The recommendation of multimedia resource.

In another embodiment, the embodiment of the present invention will be mainly reflected in two aspects, a side in product side angle degree Face is the displaying of label information, is the application of label information on the other hand, i.e., can carry out phase by the label information of addition As multimedia resource recommend.With the label information of " The Shawshank Redemption " for " citizen's right, prison, freedom, conviction, life and death " For, then the way of recommendation provided in an embodiment of the present invention is used, accurately can recommend such as " Once Upon a Time in America, perfection to user The world, godfather 3, trainspotting " etc. films, rather than belong under the scope of plot and crime it is a series of be not very relevant shadow Piece.

It should be noted that the personalized recommendation mode that the embodiment of the present invention proposes, can be widely applied to newly reach the standard grade more Media resource.Because for the multimedia resource newly reached the standard grade, the number of users of viewing may be insufficient to, so can not pass through The behavioral data of user carries out associated multimedia resource recommendation, therefore can be taken based on the similar of content between multimedia resource To complete to recommend.Certainly, the above-mentioned personalized recommendation mode that the embodiment of the present invention proposes also can be applicable under other scenes, this hair Bright embodiment is to this without specifically limiting.

In another embodiment, first the disposed of in its entirety flow of the embodiment of the present invention is briefly described.

Referring to Fig. 2, the process flow that the embodiment of the present invention includes is as follows：

A, data acquire；

The step is for acquiring the big data on internet.It is by taking multimedia resource is film as an example, then collected Big data is the film review information that mass users evaluate film.

B, data processing；

The step is mainly used for being processed collected data, for example, the poor comment information of filter quality, to commenting Word segmentation processing etc. is carried out by information.

C, term vector is trained；

The step is used to carry out term vector training at least one vocabulary obtained after participle, and training result is by each word Remittance is expressed as a unified vector of dimension.

D, term vector clusters, and extracts theme；

The step is used to cluster each term vector that step c is obtained, and to clustering obtained each classified vocabulary Carry out the mark of subject information.

E, the key vocabularies extraction of multimedia resource；

The step is used to extract part vocabulary in the comment information of multimedia resource according to certain way, and will extraction The part vocabulary gone out is as the key vocabularies for commenting on this multimedia resource.Specific extracting mode refers to be retouched hereinafter It states.

F, it is that multimedia resource adds label information automatically；

The step is that multimedia resource adds label information automatically for the result based on step d and step e.

G, the label information for being based upon multimedia resource generation carries out similar multimedia resource recommendation.

Explanation is explained in detail to each step of foregoing description in particular embodiments below.

Fig. 3 is a kind of flow chart of the label information generation method of multimedia resource provided in an embodiment of the present invention.Referring to Fig. 3, method flow provided in an embodiment of the present invention include：

301, server obtains the comment information of destination multimedia resource.

In embodiments of the present invention, the multimedia resource of label information to be added is referred to as destination multimedia resource.And Comment information is different according to the type of multimedia resource, usually has different appellations.By taking multimedia resource is film as an example, then Above-mentioned comment information is also referred to as film review information, and by taking multimedia resource is TV play as an example, then above-mentioned comment information can also claim Be dramatic criticism information.One comment information generally refers to the comment to a multimedia resource that a user delivers.

Wherein, the comment information embodiment of the present invention of destination multimedia resource can be from the data source with a large amount of comment datas It obtains, such as each World Jam, website, community etc., the embodiment of the present invention is to this without specifically limiting.In addition, being commented on obtaining The reptile software scrapy that increases income specifically can be used to realize when information.

By taking destination multimedia resource is film as an example, it is assumed that A community-specifics in the film review information for accumulating each film, In, these film review information describe cognition of the different user to same portion's film from different perspectives, then being directed to destination multimedia For resource, the film review letter that the reptile software scrapy that increases income crawls mass users to it from the communities A can be used in the embodiment of the present invention Breath.

It should be noted that for different data sources, some data sources other than recording comment information itself, Can may also record relevant evaluation of the user to each single item comment information, for example, each user to the scoring of each film review information, Each user determines that the whether useful polled data of this film review information, the embodiment of the present invention are carrying out crawling for comment information When, the relevant evaluation of comment information can together will also be crawled, with use it for subsequently handling the data crawled Step.

302, server carries out word segmentation processing to the comment information got.

In embodiments of the present invention, if the data crawled include the relevant evaluation for comment information, the present invention Embodiment is also supported according to these evaluations come the poor comment information of filter quality, to purify data.

In specific be filtered, may be selected scoring being more than useful ballot no more than default score value or useless votes Several comment informations filter out, because these film reviews is of low quality, bad shadow may be brought to being subsequently generated label information It rings.It it is 5 points in the case of full marks, then it can be 1 point or 2 points to preset score value, and the embodiment of the present invention is to this without specifically limiting.

Wherein, the branch that word segmentation processing belongs to progress data processing in above-mentioned steps b is carried out to comment information.Right After the comment information crawled completes filtering, the jieba to increase income participle tools can be used to believe filtered comment for server Breath is segmented.

Needing at illustrate first point is, jieba participle tools of increasing income mainly support three kinds of participle patterns：One kind is accurate mould Formula, it is intended to sentence most accurately be cut, text analyzing is primarily adapted for use in；Another kind is syntype, by it is all in sentence can Come with all being scanned at the word of word, although speed is very fast, ambiguity problem cannot be solved；It is last a kind of for search engine Pattern, to long word cutting again, improves recall rate on the basis of accurate model.The embodiment of the present invention can be based on last a kind of Participle pattern carries out word segmentation processing to comment information.

To need the second point illustrated be, due to the embodiment of the present invention it is desirable that some can be described, be summarized more than one The descriptive words of media resource, therefore after being segmented to comment information, usually only retain the vocabulary with target part of speech. Wherein, target part of speech includes but not limited to noun, adjective and verb.

As an example it is assumed that the segment word in comment information is " finally to see whole movie when midnight and be over, certainly Surely it goes to buy book.The plot that whole movie does not have any violence bloody, although keynote is always gloomy, prevailing scenario is shark after all Fort --- prison.Peace the innocent of enlightening is put in prison, and the name to murder wife and her sweet heart is judged to two life imprisonment and closes into shark fort, one Be full male prisoner, rotten dirt prison.The makings peace enlightening totally different with many prisoners, several years ago received what kind for the treatment of very few band It crosses, but the injury suffered by him is envisioned that.He allows me to remember Sirius Black because he know oneself be it is not guilty, This conviction is not really fine and is not just siphoned away by dementor, maintains awake and finally escapes from A Zikaban.I thinks, if raw Hit has a kind of conviction, has a branch of radiance never to extinguish in the heart, and hundred foldings are not forgiven will be bright.There are one the scene under dusk, labor It has moved one day prisoner and has drunk beer on vacant lot, matched that sentence that A Rui is said, " I thinks that he merely desires to review freedom, even only Have in a flash." very beautiful, just so a moment is thought to have escaped constraint, enjoys freely ", then participle of the embodiment of the present invention to it As a result it is：

" it is finally whole see be over buy the bloody plot keynote of whole violence be always gloomy scene shark fort prison peace enlightening without Crime, which is put in prison, murders the name of wife and her sweet heart and is judged to life imprisonment to close into shark fort to be that the rotten dirty prison makings prisoner of male prisoner is widely different entirely It is that his the be hurt imagination allows me to want to be not really fine for his not guilty conviction and just do not take the photograph that kind for the treatment of is different peace enlightening it is very few to have received Soul, which siphons away, awake final to be escaped from card class I thinks a kind of scene labor under life has conviction radiance never to extinguish there are one light of not forgiving Dynamic prisoner vacant lot beer A Rui says I think to merely desire to review freely just so escape constraint enjoy it is free ".

303, server obtains the term vector of at least one vocabulary obtained after participle.

Due to being no associated between at least one vocabulary for being obtained after participle, so the embodiment of the present invention passes through calculating Similarity between two term vectors, to obtain the similarity between two vocabulary.Change a kind of expression way, the embodiment of the present invention Will determine that between two vocabulary whether semantic similarity the problem of, be converted into and calculate asking for the similarity between two term vectors Topic.

In embodiments of the present invention, server is using word2vec (word steering volume) tool increased income, to being obtained after participle At least one vocabulary carry out term vector training, obtain at least one term vector.Wherein, word2vec tools can turn vocabulary It changes vector into, and ensures that relative similarity and semantic similarity between vector are relevant.

In other words, word2vec technologies are a kind of highly effective algorithm models that word is characterized as to real number value vector, are utilized Deep learning thought will be reduced to the vector operation in K dimensional vector spaces by training to the processing of content of text, and vector is empty Between on similarity can be used for indicating that text is similar semantically.

In embodiments of the present invention, the training result of term vector is the vector that each vocabulary is expressed as to K dimensions.Its In, the value of K can be 400, and the embodiment of the present invention is to this without specifically limiting.

In another embodiment, the training parameter of word2vec tools can be as described in Table 1：

Table 1

304, server clusters the term vector of at least one vocabulary, obtains multiple classified vocabularies, different vocabulary Classification has different subject informations.

In embodiments of the present invention, after obtaining multiple term vectors by above-mentioned steps 303, it is also necessary to pass through the side of cluster The similar vocabulary of term vector is gathered into a set by method.And why reason for this is that：Different user pair matchmaker more than one When body resource is commented on, the vocabulary used is discrepant, but the meaning of different lexical representations may be semantically phase Close, so this step can get together the vocabulary of semantic similarity, can be manually each vocabulary point that cluster obtains optionally Not one theme of label, i.e. each classified vocabulary correspond to a theme vocabulary.

Wherein, theme vocabulary can be that highest word of frequency of occurrence in a classified vocabulary, or to a vocabulary Each vocabulary is summarized the word of summary in classification, and the embodiment of the present invention is to this without specifically limiting.

The embodiment of the present invention takes K-means algorithms to be clustered to obtain multiple term vectors to above-mentioned steps 303, cluster Parameter can be as described in Table 2：

Table 2

Parameter	N_clusters=200, max_iter=300, n_init=10
		Parameter declaration	Cluster is 200 clusters, most iteration 300 times, barycenter initial point selection 10 times

Wherein, barycenter seed refers to the center of mass point that is initialized before being clustered, clusters as 200 clusters, then also can be just 200 center of mass point of beginningization.By above-mentioned table 2 it is found that the embodiment of the present invention by the multiple words clusterings obtained after participle be 200 Cluster, i.e. cluster are 200 classified vocabularies.

This step is illustrated by taking following Table 3 as an example below.6 classified vocabularies are shown in table 3, wherein every Include semantic similar multiple vocabulary in one classified vocabulary, and each classified vocabulary is respectively provided with a subject information, Subject information is different between different classified vocabularies.Such as the cluster ID classified vocabularies for being 1 and cluster ID be 2 classified vocabulary between it is main It is just different to inscribe information, one is to save the nation from extinction, another is that spy is fought.

In addition, where subject information shows the core concept and purport of classified vocabulary.The vocabulary for being 1 with cluster ID For classification, subject information is " saving the nation from extinction ", and correspondingly, the vocabulary for including in the classified vocabulary is related to saving the nation from extinction, for example wraps Include " rescue, braves dangers, take back, flee from, recover, run away, and escapes from and, to rescue " etc. vocabulary.

Table 3

305, at least one vocabulary that server obtains after participle, the pass for commenting on destination multimedia resource is extracted Keyword converges.

For the step, the embodiment of the present invention uses TF-IDF (Term Frequency-Inverse Document Frequency, term frequency-inverse document frequency) technology closed at least one vocabulary for commenting on destination multimedia resource Key word retrieval.

In the specific implementation, at least one vocabulary is integrated into a document by the embodiment of the present invention first, and TF is for counting The frequency that some vocabulary occurs, i.e. TF include for characterizing in the number and the document that a vocabulary occurs in the document The ratio of total word number；IDF is inverse document word frequency, the significance level for characterizing a vocabulary.

By taking the first probability score refers to TF as an example, then for each vocabulary at least one vocabulary, the vocabulary The calculation of first probability score is as follows：

First, occurrence number of the vocabulary at least one vocabulary is obtained；Later, the vocabulary is based at least one word The vocabulary quantity that occurrence number and at least one vocabulary in remittance include, obtains the first probability score of the vocabulary.

Change a kind of expression way, number/vocabulary total number that mono- vocabulary of TF=occurs.

By taking the second probability score refers to IDF as an example, then for each vocabulary at least one vocabulary, the vocabulary The calculation of second probability score is as follows：

For each vocabulary at least one vocabulary, server first determines packet in whole documents of database purchase Include at least one document of the vocabulary；Later, the number of whole documents of quantity and database purchase based at least one document Amount obtains the second probability score.

Change a kind of expression way, IDF=log (number of files+1 of total number of documents/the include vocabulary)

It should be noted that the comment information of each single item multimedia resource of storage is integrated into one by the embodiment of the present invention Document is stored.That is, a document is corresponding with a multimedia resource.

In conclusion for for a vocabulary, the probability total score of the vocabulary is the first probability based on the vocabulary point What value and the second probability score obtained, i.e. TF-IDF=TF*IDF.

In another embodiment, in obtaining at least one vocabulary after the probability total score of each vocabulary, the present invention is real Descending sequence can be carried out to the probability total score of each vocabulary by applying example；Later, probability total score is come into preceding present count Key vocabularies of the vocabulary of mesh position as destination multimedia resource.

Wherein, the value of preset number can be 10 or 20, and the embodiment of the present invention is to this without specifically limiting.

306, the subject information of key vocabularies and multiple classified vocabulary of the server based on destination multimedia resource is target Multimedia resource generates label information.

In the subject information that through the above steps 304 get multiple classified vocabularies, and 305 obtain through the above steps To after the key vocabularies of destination multimedia resource, this step is specifically that above-mentioned key vocabularies are mapped on each subject information, Corresponding subject information lookup is carried out using key vocabularies, and then using the subject information found as destination multimedia resource Label information.

That is, being that destination multimedia resource generates label information in the subject information based on key vocabularies and multiple classified vocabularies When, the embodiment of the present invention takes following manner to realize：First in the subject information of multiple classified vocabularies, destination multimedia is determined The corresponding subject information of key vocabularies of resource；Later, the corresponding subject information of the key vocabularies of destination multimedia resource is made For the label information of destination multimedia resource.

Wherein, for the specific reality of the corresponding subject information of determining key vocabularies in the subject information of multiple classified vocabularies Existing mode, and following step can be subdivided into：

A, whether for any one key vocabularies, it includes the key vocabularies to search in multiple classified vocabularies；

If b, a classified vocabulary includes the key vocabularies, the subject information of the classified vocabulary is determined as the key The corresponding subject information of vocabulary.

In another embodiment, the embodiment of the present invention can also be that weight is arranged in each single item label information generated.Wherein, The source of weight is as follows：It is for each single item label information of generation, the probability of key vocabularies corresponding with the label information is total Score value, the weighted value as the label information.Specifically, if the corresponding key vocabularies number of the label information is at least two It is a, then by the sum of the probability total score of each key vocabularies corresponding with the label information, the weighted value as the label information.

Below by taking Fig. 4 as an example, generation and weight setting to above-mentioned label information are illustrated.

By taking film " eavesdropping storm " as an example, 4 key vocabularies " eavesdropping, monitoring, secret police, monitoring " of the film are right " eavesdropping " this subject information is answered, so a label information of the film is " eavesdropping ", and the weight of the label information is 0.149+0.131+0.129+0.052=0.461.Wherein, 0.149 be key vocabularies " eavesdropping " probability total score, 0.131 is The probability total score of key vocabularies " monitoring ", 0.129 is the probability total score of key vocabularies " secret police ", and 0.052 is key The probability total score of vocabulary " monitoring ".

It should be noted that server repeat above-mentioned steps 301 to step 306 can be database in store it is each Item multimedia resource adds label information automatically.And after extracting the label information of multimedia resource, one effectively using just It is the recommendation for carrying out similar multimedia resource.

In another embodiment, referring to Fig. 5, the multimedia resource way of recommendation provided in an embodiment of the present invention includes following Step：

501, server obtains the primary vector information of destination multimedia resource.

502, server obtains the secondary vector information of other multimedia resources.

Wherein, other multimedia resources are the resource other than destination multimedia resource of database purchase.

In embodiments of the present invention, in order to calculate in a multimedia resource and database between other multimedia resources Similarity also needs every multimedia resource vectorization first.Wherein, the process of vectorization includes：

(1), for any one multimedia resource, term vector instruction is carried out to every label information of this multimedia resource Practice, obtains the term vector of every label information.

For arbitrary label information W, the term vector of W is represented by [W1v1, W1v2 ... W1v400].That is, each word to Amount can be indicated with the matrix of a 1*400.

(2), for each single item label information, multiplying for the term vector of the label information and the weighted value of the label information is obtained Product operation result, by the sum of the product calculation result of every label information, the vector information as the multimedia resource.

Assuming that the label information of certain film be respectively " eavesdropping, secret service, performance, human nature, life, politics and law, artist, oneself By and history ", then the term vector * weights of term vector * weights+secret service of vector=eavesdropping of portion's film+...+history Term vector * weights.

If each term vector is indicated with the matrix of a 1*400, the vector of portion's film is similarly one A 1*400 sizes.

503, the second of primary vector information and other multimedia resources of the server based on destination multimedia resource to Information is measured, the similarity between destination multimedia resource and other multimedia resources is calculated.

It is for any other multimedia resource is B films, then of the invention with destination multimedia resource for A films Cosine similarity algorithm can be used to calculate the similarity between A films and B films in embodiment：

Wherein, i and n is positive integer, and n refers to the dimension of the vector information of two films, for example the value of n is 400.

504, server chooses the specified multimedia resource that similarity is more than predetermined threshold value in other multimedia resources.

Assuming that destination multimedia resource is " The Shawshank Redemption ", then other stored in portion's film and database are calculated The similarity of all films between any two.Wherein, the size of predetermined threshold value can be 0.8 or 0.9 etc., the present invention implement to this not into Row is specific to be limited.Continue by taking " The Shawshank Redemption " as an example, as described in Table 4, which can be led to the similar of other films Degree is ranked up according to numerical values recited.

Table 4

Film title	Similarity
		Once Upon a Time in America	0.816
The perfect world	0.811
		Godfather 3	0.805
Trainspotting	0.802
		It collides	0.802
You shut up at bifurcation！	0.742
		Aerial prison	0.724
21 grams	0.723
		11 arhats	0.723
This killer is not too cold	0.720

505, server is recommended specified multimedia resource as resource similar with destination multimedia resource.

Assuming that the size of predetermined threshold value is 0.8, then by " Once Upon a Time in America, the perfect world, godfather 3, trainspotting, collision " etc. Several films are recommended as the similar film with " The Shawshank Redemption ".

506, terminal is when showing the label information of destination multimedia resource, while showing similar to destination multimedia resource Resource.

Continue so that destination multimedia resource is " The Shawshank Redemption " as an example, then terminal is in the label letter for showing portion's film Breath and when with its similar resource, can be shown, the embodiment of the present invention is to this without tool according to mode as shown in FIG. 6 Body limits.

In conclusion method provided in an embodiment of the present invention has the advantages that：

1) full automation, is realized when adding label for multimedia resource, due to being not necessarily to put into manpower into row label The addition of information, so without consuming a large amount of manpower and time, it is intelligent preferable.

2) big data on internet, has been crawled, the comment information of multimedia resource has been obtained with this, and also complete Second-rate comment information has been filtered out in the comment information in portion, and is based further on filtered comment information to generate mark Information is signed, so the label information generated is more accurate, and then subsequently carries out multimedia resource in the label information based on generation When recommendation, recommendation effect is more preferably.

3), the label information generated carries weight, is recommending similar multimedia resource based on the label information with weight When, it is ensured that good recommendation effect.

Fig. 7 is a kind of structural schematic diagram of the label information generating means of multimedia resource provided in an embodiment of the present invention. Referring to Fig. 7, which includes：

First acquisition module 701, the comment information for obtaining destination multimedia resource divide the comment information Word processing；

Second acquisition module 702, the term vector for obtaining at least one vocabulary obtained after participle；

Cluster module 703 clusters for the term vector at least one vocabulary, obtains multiple classified vocabularies, The different classified vocabularies has different subject informations；

Extraction module 704 at least one vocabulary for being obtained after participle, extracts the destination multimedia resource Key vocabularies；

Generation module 705 is used for the subject information based on the key vocabularies and the multiple classified vocabulary, is the mesh It marks multimedia resource and generates label information.

Device provided in an embodiment of the present invention realizes full automation when generating label information for multimedia resource, Due to carrying out the addition of label information without putting into manpower, so without consuming a large amount of manpower and time, it is intelligent preferable； And comment information of the embodiment of the present invention based on multimedia resource, get multiple vocabulary point for this multimedia resource The subject information of class and multiple key vocabularies for commenting this multimedia resource to give birth to for this multimedia resource with this At label information, not only so that the label information generated is more accurate, and improves and subsequently carrying out multimedia resource recommendation When precision.

In another embodiment, extraction module is additionally operable to, for each vocabulary at least one vocabulary, obtain The first probability score and the second probability score of the vocabulary, first probability score are used to characterize the appearance frequency of the vocabulary Rate, second probability score are used to characterize the significance level of the vocabulary；Based on first probability score and described second Probability score obtains the probability total score of the vocabulary；According to descending sequence, acquisition probability total score is default before coming The vocabulary of number position is as the key vocabularies.

In another embodiment, extraction module is additionally operable at least one vocabulary being integrated into a document；For Each vocabulary at least one vocabulary, determination includes at least the one of the vocabulary in whole documents of database purchase A document；The quantity of whole documents of quantity and the database purchase based at least one document obtains institute's predicate The second probability score converged.

In another embodiment, generation module is additionally operable in the subject information of the multiple classified vocabulary, determines institute State the corresponding subject information of key vocabularies；Using the corresponding subject information of the key vocabularies as the destination multimedia resource Label information.

In another embodiment, each classified vocabulary includes semantic similar at least one vocabulary；Generate mould Block is additionally operable to for any one key vocabularies, and whether search in the multiple classified vocabulary includes the key vocabularies；If one A classified vocabulary includes the key vocabularies, then the subject information of the classified vocabulary is determined as any one described key The corresponding subject information of vocabulary.

In another embodiment, which further includes：

Setup module will be with the mark for each single item label information for being generated for the destination multimedia resource Sign the probability total score of the corresponding key vocabularies of information, the weighted value as the label information.

In another embodiment, setup module, if it is extremely to be additionally operable to the corresponding key vocabularies number of the label information It is two few, then by the sum of the probability total score of each key vocabularies corresponding with the label information, as the label information Weighted value.

In another embodiment, which further includes：

Recommending module, the primary vector information for obtaining the destination multimedia resource；Obtain other multimedia resources Secondary vector information, other described multimedia resources be database purchase the money other than the destination multimedia resource Source；Based on the primary vector information and the secondary vector information, obtain the destination multimedia resource with it is described other Similarity between multimedia resource；Resource similar with the destination multimedia resource is carried out according to the similarity got to push away It recommends.

In another embodiment, recommending module is additionally operable to, for any one multimedia resource, obtain the multimedia The term vector of every label information of resource；Based on the term vector and weighted value of every label information, more matchmakers are obtained The vector information of body resource.

The alternative embodiment that any combination forms the disclosure may be used, herein no longer in above-mentioned all optional technical solutions It repeats one by one.

It should be noted that：The label information generating means for the multimedia resource that above-described embodiment provides are generating label letter It, only the example of the division of the above functional modules, can be as needed and by above-mentioned function in practical application when breath Distribution is completed by different function modules, i.e., the internal structure of device is divided into different function modules, to complete above retouch The all or part of function of stating.In addition, the label information generating means for the multimedia resource that above-described embodiment provides and more matchmakers The label information generation method embodiment of body resource belongs to same design, and specific implementation process refers to embodiment of the method, here It repeats no more.

Fig. 8 is a kind of structural schematic diagram for generating the equipment of label information provided in an embodiment of the present invention, the equipment 800 can generate bigger difference because configuration or performance are different, may include one or more processors (central Processing units, CPU) 801 and one or more memory 802, wherein it is stored in the memory 802 There are at least one instruction, at least one instruction to be loaded by the processor 801 and executed to realize that above-mentioned each method is real The label information generation method of the multimedia resource of example offer is provided.Certainly, which can also have wired or wireless network The components such as interface, keyboard and input/output interface, to carry out input and output, which can also include other for real The component of existing functions of the equipments, this will not be repeated here.

In the exemplary embodiment, a kind of computer readable storage medium, such as the memory including instruction are additionally provided, Above-metioned instruction can be executed by the processor in terminal to complete the label information generation side of the multimedia resource in above-described embodiment Method.For example, the computer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of label information generation method of multimedia resource, which is characterized in that the method includes：

Obtain the term vector of at least one vocabulary obtained after participle；

The term vector of at least one vocabulary is clustered, multiple classified vocabularies, different classified vocabularies tool are obtained There is different subject informations；

Subject information based on the key vocabularies and the multiple classified vocabulary generates label for the destination multimedia resource Information.

2. according to the method described in claim 1, it is characterized in that, at least one vocabulary obtained after participle, carry The key vocabularies of the destination multimedia resource are taken, including：

For each vocabulary at least one vocabulary, the first probability score and the second probability point of the vocabulary are obtained Value, first probability score are used to characterize the frequency of occurrences of the vocabulary, and second probability score is for characterizing institute's predicate The significance level of remittance；

Based on first probability score and second probability score, the probability total score of the vocabulary is obtained；

According to descending sequence, the vocabulary of preset number position is as the key vocabularies before acquisition probability total score comes.

3. according to the method described in claim 2, it is characterized in that, obtain the second probability score process, including：

At least one vocabulary is integrated into a document；

For each vocabulary at least one vocabulary, determination includes the vocabulary in whole documents of database purchase At least one document；

The quantity of whole documents of quantity and the database purchase based at least one document, obtains the vocabulary Second probability score.

4. according to the method described in claim 1, it is characterized in that, described based on the key vocabularies and the multiple vocabulary point The subject information of class generates label information for the destination multimedia resource, including：

In the subject information of the multiple classified vocabulary, the corresponding subject information of the key vocabularies is determined；

Using the corresponding subject information of the key vocabularies as the label information of the destination multimedia resource.

5. according to the method described in claim 4, it is characterized in that, each classified vocabulary includes that semanteme is similar at least One vocabulary；

It is described to determine the corresponding subject information of the key vocabularies in the subject information of the multiple classified vocabulary, including：

Whether for any one key vocabularies, it includes the key vocabularies to search in the multiple classified vocabulary；

If a classified vocabulary includes the key vocabularies, the subject information of the classified vocabulary is determined as described arbitrary The corresponding subject information of one key vocabularies.

6. according to the method described in claim 1, it is characterized in that, the method further includes：

It, will keyword corresponding with the label information for each single item label information generated for the destination multimedia resource The probability total score of remittance, the weighted value as the label information.

7. according to the method described in claim 6, it is characterized in that, described by key vocabularies corresponding with the label information Probability total score, as the weighted value of the label information, including：

If the corresponding key vocabularies number of the label information is at least two, will each pass corresponding with the label information The sum of the probability total score that keyword converges, the weighted value as the label information.

8. the method according to any claim in claim 1 to 7, which is characterized in that the method further includes：

Obtain the primary vector information of the destination multimedia resource；

The secondary vector information of other multimedia resources is obtained, other described multimedia resources are database purchase in addition to described Resource except destination multimedia resource；

Based on the primary vector information and the secondary vector information, obtain the destination multimedia resource with it is described other Similarity between multimedia resource；

Resource recommendation similar with the destination multimedia resource is carried out according to the similarity got.

9. according to the method described in claim 8, it is characterized in that, the vector information acquisition process of any one multimedia resource Including：

For any one multimedia resource, the term vector of every label information of the multimedia resource is obtained；

Based on the term vector and weighted value of every label information, the vector information of the multimedia resource is obtained.

10. a kind of label information generating means of multimedia resource, which is characterized in that described device includes：

First acquisition module, the comment information for obtaining destination multimedia resource carry out word segmentation processing to the comment information；

Cluster module is clustered for the term vector at least one vocabulary, obtains multiple classified vocabularies, different institutes Stating classified vocabulary has different subject informations；

Extraction module at least one vocabulary for being obtained after participle, extracts the keyword of the destination multimedia resource It converges；

Generation module is used for the subject information based on the key vocabularies and the multiple classified vocabulary, is the more matchmakers of the target Body resource generates label information.

11. a kind of storage medium, which is characterized in that it is stored at least one instruction in the storage medium, described at least one Instruction is loaded by processor and is executed to realize the multimedia resource as described in any of claim 1 to 9 claim Label information generation method.

12. a kind of equipment for generating label information, which is characterized in that the equipment includes processor and memory, described At least one instruction is stored in memory, at least one instruction is loaded by the processor and executed to realize such as right It is required that the label information generation method of the multimedia resource described in any of 1 to 9 claim.