CN104915447A

CN104915447A - Method and device for tracing hot topics and confirming keywords

Info

Publication number: CN104915447A
Application number: CN201510372462.0A
Authority: CN
Inventors: 乔奇
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2015-06-30
Filing date: 2015-06-30
Publication date: 2015-09-16
Anticipated expiration: 2035-06-30
Also published as: CN104915447B

Abstract

The present invention discloses a method for tracing hot topics and confirming keywords. The method comprises that hot data is obtained through a set website, and the hot data is classified and clustered so as to confirm the keywords of the hot topics in time. According to the method for tracing hot topics, a video file corresponding to each hot topic can be obtained through the text similarity of the keywords of the hot topics and the description of the video file. When the technical scheme adopted by the method provided by the embodiment of the invention is applied, the confirmation of the keywords of each hot topic and the tracing of the hot topics are automatically completed by a server, so that the delay caused by confirming the hot topics through manpower is effectively avoided; even though some hot topic can be composed of a plurality of phased events, the server can repeatedly implement the technical scheme so as to periodically update the video file corresponding to each hot topic, so that the labor operation cost is reduced.

Description

A kind of much-talked-about topic is followed the trail of and keyword defining method and device

Technical field

The present invention relates to Internet technical field, particularly a kind of much-talked-about topic is followed the trail of and keyword defining method and device.

Background technology

Much-talked-about topic, refers in certain hour, certain limit, the hot issue that the public is concerned about the most.In video website, UGC (User Generated Content, user's production content) mostly class video be that user issued in the very first time that focus incident or topic occur, have certain ageing, this kind of video is higher by the degree of concern of user.But because UGC class video has the features such as magnanimity, numerous and jumbled property, high duplication in video website, inconvenient user obtains important information in time.A lot of video website can carry out discovery and the tracking of much-talked-about topic for this kind of video, with by the video aggregation of same much-talked-about topic together, facilitate user to check.

Current video website is all discovery and the tracking of being carried out much-talked-about topic by operation personnel.Operation personnel, by analyzing the descriptor such as title, brief introduction of video file, determines current much-talked-about topic, then determines the video corresponding to each much-talked-about topic further.

The much-talked-about topic that the descriptor analyzing video file by operation personnel is determined often has hysteresis quality, and some much-talked-about topics may be made up of all multistage negotiation events, have comparatively long span in time, need operation personnel to give more sustained attention and analysis, artificial operation cost is higher.

Summary of the invention

For solving the problem, the embodiment of the invention discloses a kind of much-talked-about topic and following the trail of and keyword defining method and device.Technical scheme is as follows:

A kind of much-talked-about topic keyword defining method, be applied to server, described method comprises:

Participle is carried out to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;

Respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;

According to the basic word that determined attribute is named entity, set up the text model of every bar hot spot data;

According to the text similarity of the text model of every two hot spot datas, cluster is carried out to obtained all hot spot datas, obtains at least one class bunch;

For each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.

In a kind of embodiment of the present invention, described obtain the set of the basic word of every bar hot spot data after, described in determine setting up the text model of this hot spot data attribute be the basic word of named entity before, also comprise:

Respectively stop words filtration treatment is carried out to the basic word in the set of the basic word of every bar hot spot data.

In a kind of embodiment of the present invention, also comprise:

For each class bunch, in the keyword of such bunch determined, search at least one keyword that the frequency is the highest;

A title is selected, as such bunch of corresponding much-talked-about topic in the title of the hot spot data at least one the keyword place found.

Based on a much-talked-about topic method for tracing for above-mentioned much-talked-about topic keyword defining method, be applied to server, described method comprises:

For each much-talked-about topic, determine the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;

According to determined text similarity, follow the trail of the video file that this much-talked-about topic is corresponding.

In a kind of embodiment of the present invention, described according to determined text similarity, follow the trail of the video file that this much-talked-about topic is corresponding, comprising:

Whether be greater than default first threshold according to determined text similarity, determine the candidate video collection of this much-talked-about topic;

Concentrate at the candidate video of this much-talked-about topic, carry out the process of video duplicate removal;

According to duplicate removal result, determine the video file that this much-talked-about topic is corresponding.

In a kind of embodiment of the present invention, described according to duplicate removal result, after determining the video file that this much-talked-about topic is corresponding, also comprise:

Judge whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold;

If so, then successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold;

According to the quality of video in each classification, determine the representative video that each classification is corresponding;

Representative video corresponding for each classification is defined as the associated video of this much-talked-about topic.

In a kind of embodiment of the present invention, the described candidate video in this much-talked-about topic is concentrated, and carries out the process of video duplicate removal, comprising:

According to the issue moment of the video file that the candidate video of this much-talked-about topic is concentrated, according to issuing moment order from morning to night, the video that described candidate video is concentrated is sorted;

Judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then concentrate to retain at the candidate video of this much-talked-about topic and issue moment video file early, delete the video file issuing evening in moment.

In a kind of embodiment of the present invention, described judge to issue moment adjacent two video files be whether palinopsia frequently, comprising:

The text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;

Or,

The visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;

Or;

The text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.

A kind of much-talked-about topic keyword determining device, be applied to server, described device comprises:

Basis set of words obtains module, for carrying out participle to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;

Named entity attribute basis word determination module, for respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;

Text model sets up module, for being the basic word of named entity according to determined attribute, sets up the text model of every bar hot spot data;

Hot spot data cluster module, for the text similarity of the text model according to every two hot spot datas, carries out cluster to obtained all hot spot datas, obtains at least one class bunch;

Much-talked-about topic keyword determination module, for for each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.

In a kind of embodiment of the present invention, described basic set of words obtain module also for:

In a kind of embodiment of the present invention, also comprise:

Much-talked-about topic title determination module, for for each class bunch, searches at least one keyword that the frequency is the highest in the keyword of such bunch determined; A title is selected, as such bunch of corresponding much-talked-about topic in the title of the hot spot data at least one the keyword place found.

Based on a much-talked-about topic follow-up mechanism for above-mentioned much-talked-about topic keyword determining device, be applied to server, described device comprises:

Text similarity determination module, for for each much-talked-about topic, determines the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;

Video file tracing module, for according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding.

In a kind of embodiment of the present invention, described video file tracing module, comprising:

Candidate video collection determination submodule, for whether being greater than default first threshold according to determined text similarity, determines the candidate video collection of this much-talked-about topic;

Duplicate removal process submodule, concentrates for the candidate video in this much-talked-about topic, carries out the process of video duplicate removal;

Video file determination submodule, for according to duplicate removal result, determines the video file that this much-talked-about topic is corresponding.

In a kind of embodiment of the present invention, also comprise:

Judging submodule, for judging whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold, if so, then triggering clustering processing submodule;

Described clustering processing submodule, for successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold;

Represent video determination submodule, for the quality according to video in each classification, determine the representative video that each classification is corresponding;

Associated video determination submodule, for being defined as the associated video of this much-talked-about topic by representative video corresponding for each classification.

In a kind of embodiment of the present invention, described duplicate removal process submodule, comprising:

Video sequencing unit, for issue moment of video file of concentrating according to the candidate video of this much-talked-about topic, sorts to the video that described candidate video is concentrated according to issuing moment order from morning to night;

Repeat video judging unit, for judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then trigger duplicate removal processing unit;

Described duplicate removal processing unit, issuing moment video file early for concentrating at the candidate video of this much-talked-about topic to retain, deleting the video file issuing evening in moment.

In a kind of embodiment of the present invention, described repetition video judging unit, specifically for:

Or,

Or;

The technical scheme that the application embodiment of the present invention provides, by obtaining hot spot data in setting website, and by carrying out the process such as participle, cluster to these hot spot datas, determine the keyword of much-talked-about topic and much-talked-about topic in time, for each much-talked-about topic, by the text similarity of the keyword of this much-talked-about topic and the descriptor of video file, follow the trail of and obtain video file corresponding to this much-talked-about topic.So far, the determination of much-talked-about topic keyword and the tracking of much-talked-about topic complete automatically by server, effectively prevent and determine by manually runing the lag issues that much-talked-about topic is brought, even and if certain much-talked-about topic may be made up of all multistage negotiation events, by server multiple exercise the technical program, can video file corresponding to this much-talked-about topic of regular update, save artificial operation cost.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of implementing procedure figure of much-talked-about topic keyword defining method in the embodiment of the present invention;

Fig. 2 is a kind of implementing procedure figure of much-talked-about topic method for tracing in the embodiment of the present invention;

Fig. 3 is the another kind of implementing procedure figure of much-talked-about topic method for tracing in the embodiment of the present invention;

Fig. 4 is a kind of structural representation of much-talked-about topic keyword determining device in the embodiment of the present invention;

Fig. 5 is a kind of structural representation of much-talked-about topic follow-up mechanism in the embodiment of the present invention.

Embodiment

Technical scheme in the embodiment of the present invention is understood better in order to make those skilled in the art, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Shown in Figure 1, be the implementing procedure figure of a kind of much-talked-about topic keyword defining method that the embodiment of the present invention provides, the method is applied to server, can comprise the following steps:

S110: participle is carried out to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data.

Internet era, information is able to fast propagation, and in certain period, the focus of people is often more concentrated.Microblog topic list, website such as search roll of the hour etc. have converged the hot spot data that people pay close attention to, and have embodied a concentrated reflection of the view of people for certain topic in current slot or something.

In actual applications, same topic may have different language performances, but its theme showed is identical.Regularly or aperiodically can pass through web crawlers or other information acquiring pattern, from the website preset (than websites such as microblog topic list described above, search roll of the hours), obtain hot spot data.The every bar hot spot data obtained can comprise title, summary, the text message such as detailed description and/or linked contents of this hot spot data.

Be understandable that, different websites may be different for the expression way of hot spot data, so for the hot spot data got, can first by its according to preset format conversion be unified data representation, such as according to following form: title, description, time, relevant textual information.The text message amount of every bar hot spot data is comparatively large, needs to carry out word segmentation processing to the text message of every bar hot spot data, can obtain the set of the basic word of every bar hot spot data.

In a kind of embodiment of the present invention, stop words filtration treatment can also be carried out to the basic word in the set of the basic word of every bar hot spot data respectively.In actual applications, stop words dictionary can be pre-set, the function word such as " ", " ", " obtaining " in this dictionary, can be comprised, the word that " high definition ", " low clear " etc. are determined by operation personnel can also be comprised.

S120: respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity.

In embodiments of the present invention, can named entity recognition be carried out, determine that attribute is the basic word of named entity, the so-called attribute basic word that to be the basic word of named entity can be with semanteme, as name, place name, building name etc., it can also be the basic word etc. with verb or noun part-of-speech.The information of the frequency occurred in the text message of each basic word at this hot spot data can be comprised in the set of the basic word of every bar hot spot data of step S110 acquisition, thus the frequency that the basic word that can be named entity according to attribute occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity.Or, search in the text message of this hot spot data, obtaining attribute is the frequency that the basic word of named entity occurs in the text message of this hot spot data, thus according to frequency information, the attribute determining the text model setting up this hot spot data is the basic word of named entity.It should be noted that, the recognition methods with the basic word of named entity attribute is prior art, repeats no more in embodiments of the present invention.

In actual applications, can, in the basic set of words of this hot spot data, the attribute that frequency of occurrence is higher in the text message of this hot spot data be selected to be that the basic word of named entity is for characterizing this hot spot data.For every bar hot spot data, the frequency that can occur in the text message of this hot spot data according to each basic word with named entity attribute of this hot spot data, sort according to the height of the frequency, attribute top n being defined as the text model for setting up this hot spot data is the basic word of named entity, or the attribute of the text model being defined as front x% for setting up this hot spot data is the basic word of named entity.It should be noted that, N or x can carry out setting and adjusting according to actual conditions here, and the embodiment of the present invention does not limit this.

S130: the basic word according to determined attribute being named entity, sets up the text model of every bar hot spot data.

The attribute of the text model determined for setting up this hot spot data in step S120 is the basic word of named entity, is the basic word of named entity, can sets up the text model of every bar hot spot data according to determined attribute.Concrete, can express the text model of every bar hot spot data with the form of vector space model (VSM), wherein can record determined attribute is the frequency that the basic word of named entity occurs in the text message of corresponding hot spot data.It should be noted that, vector space model is prior art, and the embodiment of the present invention repeats no more this.

S140: according to the text similarity of the text model of every two hot spot datas, carries out cluster to obtained all hot spot datas, obtains at least one class bunch.

The text model of every bar hot spot data is established in step S130, can according to prior art, as conventional cosine angle theorem, Jaccard distance metric etc., calculate the text similarity of the text model of every two hot spot datas, the present invention repeats no more this.

According to the text similarity of the text model of every two hot spot datas calculated, cluster is carried out to obtained all hot spot datas, obtains at least one class bunch.As, for a certain bar hot spot data, be classified as a class bunch by with the text similarity of its text model higher than the hot spot data of a certain predetermined threshold value.Use the text model of hot spot data to carry out the calculating of text similarity, compare the calculating using all text messages of hot spot data to carry out text similarity, calculated amount is little, and counting yield is high.

For convenience of understanding, lifting a simple case and being described.Have four hot spot datas, corresponding text model is respectively A, B, C, D, and wherein, the text similarity between A, B, C, D is as shown in table 1:

Table 1

Text model	A	B	C	D
					A	1	0.8	0.3	0.5
B	0.8	1	0.3	0.2
					C	0.3	0.3	1	0.5
D	0.5	0.2	0.5	1

Suppose, if two the text similarity of the text model of hot spot data is higher than 0.5, then these two hot spot datas are classified as a class bunch, so according to this condition, the hot spot data that A is corresponding and hot spot data corresponding to B are a class bunch, the hot spot data that C is corresponding is a class bunch, and the hot spot data that D is corresponding is a class bunch.

S150: for each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.

In step S140, carry out cluster to obtained all hot spot datas, obtain at least one class bunch, the hot spot data in each class bunch can characterize a much-talked-about topic.For each class bunch, in the text message of the hot spot data in such bunch, the basic word of frequency of occurrence higher (meeting default threshold condition) can as the keyword of such bunch, thus the keyword of such bunch can be defined as the keyword of such bunch of corresponding much-talked-about topic.

In one embodiment of the invention, can for each class bunch, at least one keyword that the frequency is the highest is searched in the keyword of such bunch determined, and in the title of the hot spot data at least one the keyword place found, select a title, as such bunch of corresponding much-talked-about topic.

The technical scheme that the application embodiment of the present invention provides, hot spot data is obtained in setting website by server, and by carrying out the process such as participle, cluster to these hot spot datas, determine the keyword of much-talked-about topic and much-talked-about topic in time, effectively prevent and determine by manually runing the lag issues that much-talked-about topic is brought.

Based on above-mentioned much-talked-about topic keyword defining method, the embodiment of the present invention additionally provides much-talked-about topic method for tracing, shown in Figure 2, and the method is applied to server, can comprise the following steps:

S210: for each much-talked-about topic, determines the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;

Based on above-mentioned much-talked-about topic keyword defining method, determine the keyword of current much-talked-about topic and each much-talked-about topic.The descriptor of the video file in video website can be title or the brief introduction of this video file.For certain much-talked-about topic, the text similarity of the descriptor of the keyword of this much-talked-about topic and certain video file is higher, represent this video file and this much-talked-about topic more close.It should be noted that, those skilled in the art can calculate the text similarity of the keyword of each much-talked-about topic and the descriptor of each video file by prior art, the embodiment of the present invention repeats no more this.

S220: according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding.

Step S210 determines the text similarity of the keyword of each much-talked-about topic and the descriptor of each video file, according to determined text similarity, can follow the trail of and obtain video file corresponding to each much-talked-about topic.

In a kind of embodiment of the present invention, step S220 can comprise the following steps:

S221: whether be greater than default first threshold according to determined text similarity, determines the candidate video collection of this much-talked-about topic.

Be understandable that, the numerical value of the text similarity of the keyword of much-talked-about topic and the descriptor of video file is higher, represent this video file and this much-talked-about topic more close.In actual applications, if determined text similarity is greater than default first threshold, then the candidate video that corresponding video file can be belonged to this much-talked-about topic is concentrated.

S222: concentrate at the candidate video of this much-talked-about topic, carry out the process of video duplicate removal;

Be understandable that, video file in video website, especially UGC (User Generated Content, user's production content) class video file mainly uploaded by user, different user may be not quite similar for the description of the video of identical content, so, in video website, there is more repetition or close video file.In actual applications, the video can concentrated for the candidate video of each much-talked-about topic carries out duplicate removal process.

In a kind of embodiment of the present invention, step S222 can comprise the following steps:

Step one: according to the issue moment of the video file that the candidate video of this much-talked-about topic is concentrated, sorts to the video that described candidate video is concentrated according to issuing moment order from morning to night;

Step 2: judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then concentrate to retain at the candidate video of this much-talked-about topic and issue moment video file early, delete the video file issuing evening in moment.

For convenience of describing, above-mentioned two steps being combined and is described.

In embodiments of the present invention, the moment of user's uploaded videos file is the issue moment of video file, for ensureing the rights and interests of the user of first uploaded videos file, for repetition video, preferentially can retain and issue moment video file early, delete the video file issuing evening in moment.

In actual applications, the issue moment of the video file can concentrated according to the candidate video of this much-talked-about topic, according to issuing moment order from morning to night, the video that candidate video is concentrated is sorted, by judging whether two video files in adjacent issue moment are palinopsia frequency, carry out duplicate removal process.

Judge whether issue moment adjacent two video files is that palinopsia determination methods frequently can have following several:

The first: the text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;

As previously mentioned, the text similarity for the descriptor of two video files can be calculated by prior art.If the text similarity issuing the descriptor of moment adjacent two video files is higher than predetermined threshold value, then can determine these two video files be palinopsia frequently.

The second: the visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;

In video website, intuitively check video file for the convenience of the user, generally video file can be showed user with thumbnail form.The descriptor of the video file of identical content may be different, but the visual signature of its thumbnail may be identical or close, and the visual signature of thumbnail is as the feature such as color, texture, shape of thumbnail.So, the visual signature similarity of the thumbnail issuing moment adjacent two video files can be calculated, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency, concrete, visual signature similarity can be defined as repetition video higher than two video files of a certain predetermined threshold value.Visual signature similarity for the thumbnail of different video file can calculate according to prior art, as contrast color histogram obtains visual signature similarity.

In actual applications, respectively the key frame picture in two video files can also be extracted, by contrasting the visual signature of the key frame picture of two video files, carry out the calculating of the visual signature similarity of these two video files, thus according to result of calculation, can determine whether these two video files are palinopsia frequency.Illustrate, video file A and video file B issues moment adjacent two video files, first from video file A, extract M key frame, N number of key frame is extracted from video file B, M and N can be identical or different, then the visual signature of each key frame is extracted respectively, for each key frame, express with a high dimensional feature vector, matching degree between the proper vector calculating key frame under the constraint of Time and place relation, with determine video file A and video file B be whether palinopsia frequently.Such as, with the key frame of video file A for benchmark, the key frame of video file A is contrasted with the key frame of video file B according to the order of sequence successively, if the visual signature similarity of a jth key frame of i-th of video file A key frame and video file B is greater than predetermined threshold value, then think and find initial matching point to i and j, after initial matching point, the visual signature similarity that sequentially key frame of calculating video file A, video file B is right, until terminate or do not mate, the coupling picture pair that to obtain with (i, j) be starting point.Repeat above-mentioned steps, find the maximum coupling picture pair of the overall situation, if the right number of overall maximum matching graph sheet accounts for the ratio of total number of pictures higher than another preset matching threshold value, then can determine these two video files be palinopsia frequently.

Certainly, those skilled in the art also can according to prior art, and use other frame of video of video file to carry out the calculating of visual signature similarity, the embodiment of the present invention repeats no more this.

The third: the text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.

In actual applications, consider the text similarity of the descriptor of two video files and the visual signature similarity of these two video files, can determine whether these two video files are palinopsia frequency, improve duplicate removal precision more accurately.Concrete, can respectively for text similarity and visual signature similarity predetermined threshold value, when the two is all higher than predetermined threshold value corresponding to it, these two video files are defined as repetition video, or, text similarity and the certain weight of visual signature similarity can be given respectively, when the weighted sum of the two is higher than certain predetermined threshold value, determine these two video files be palinopsia frequently.

It should be noted that, above-mentioned three kinds judge whether two video files are that palinopsia method frequently can carry out choice for use according to actual conditions, and the setting wherein for threshold value can be carried out according to actual conditions.

S223: according to duplicate removal result, determines the video file that this much-talked-about topic is corresponding.

According to step S222 duplicate removal result, the video file that this much-talked-about topic is corresponding can be determined.

In actual applications, after carrying out duplicate removal process, the quantity of the video file that determined much-talked-about topic is corresponding may be still a lot, show that this much-talked-about topic is comparatively wide in range, may have certain ductility in time.If all represented as the associated video of this much-talked-about topic, make troubles, because user needs constantly to ransack the video file that just can find and want to check to checking of user.

Based on this, in one embodiment of the invention, after step S223, can also comprise the following steps:

First step: judge whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold, if so, then performs second step.

In actual applications, for each much-talked-about topic, the maximum quantity of the associated video of this much-talked-about topic can be pre-set, i.e. Second Threshold, if the quantity of the video file that the determined much-talked-about topic of step S223 is corresponding is greater than default Second Threshold, then perform second step.Concrete setting and the adjustment of Second Threshold can be carried out according to actual conditions.

Second step: successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold.

Be understandable that, the issue moment of the different video file that the candidate video after duplicate removal process is concentrated may be identical or different.If certain much-talked-about topic has certain ductility in time, so the issue time at intervals of the video file of its correspondence may be larger.Can successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process.Each classification can as an important stage in this much-talked-about topic evolutionary process.

Such as, for certain much-talked-about topic, after duplicate removal process, the quantity of the video file that its candidate video is concentrated still exceedes default Second Threshold, sooner or later sequentially these video files can be launched according to the issue moment, sequence is video file 1, video file 2, video file 3, video file 4, video file 5, wherein, video file 1 is 1 hour with the issue time at intervals of video file 2, the issue time at intervals of video file 2 and video file 3 is 3 days, the issue time at intervals of video file 3 and video file 4 is 5 days, the issue time at intervals of video file 4 and video file 5 is 2 hours.

Suppose that cluster condition is: be not more than 1 day, the video file being not more than 1 day by issue time at intervals is classified as a classification, and the result obtained is:

{ video file 1, video file 2}, { video file 3}, { video file 4, video file 5}, totally three classifications.

If the classification number 3 obtained still is greater than default Second Threshold, then can revise cluster condition is: be not more than 3 days, and the video file being not more than 3 days by issue time at intervals is classified as a classification, and the result obtained is:

{ video file 1, video file 2, video file 3}, { video file 4, video file 5}, totally two classifications.

It should be noted that, after these video files are carried out hierarchical clustering process, the classification number finally obtained needs to be less than default Second Threshold.

3rd step: according to the quality of video in each classification, determines the representative video that each classification is corresponding.

In actual applications, after video file concentrated for the candidate video through duplicate removal process is carried out hierarchical clustering process, in each classification, multiple video file is contained.Be understandable that, the quality of different video files is uneven, and the identity of uploader may be different, and beholder may be different to its fancy grade, etc.Each side factor can be considered, in multiple video files that each classification comprises, select one represent video.

4th step: associated video representative video corresponding for each classification being defined as this much-talked-about topic.

Representative video corresponding for each classification is defined as the associated video of this much-talked-about topic.When there being displaying demand, the associated video of this much-talked-about topic is showed user.

The technical scheme that the application embodiment of the present invention provides, for each much-talked-about topic, by the text similarity of the keyword of this much-talked-about topic and the descriptor of video file, follow the trail of and obtain video file corresponding to this much-talked-about topic, automatically certain much-talked-about topic completed by server, even if may be made up of all multistage negotiation events, by server multiple exercise the technical program, can video file corresponding to this much-talked-about topic of regular update, save artificial operation cost.

Corresponding to embodiment of the method shown in Fig. 1, the embodiment of the present invention additionally provides a kind of much-talked-about topic keyword determining device, is applied to server, and shown in Figure 4, this device can comprise with lower module:

Basis set of words obtains module 310, for carrying out participle to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;

Named entity attribute basis word determination module 320, for respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;

Text model sets up module 330, for being the basic word of named entity according to determined attribute, sets up the text model of every bar hot spot data;

Hot spot data cluster module 340, for the text similarity of the text model according to every two hot spot datas, carries out cluster to obtained all hot spot datas, obtains at least one class bunch;

Much-talked-about topic keyword determination module 350, for for each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.

In one embodiment of the invention, described basic set of words acquisition module 310 can also be used for:

In one embodiment of the invention, this device can also comprise with lower module:

The device that the application embodiment of the present invention provides, hot spot data is obtained in setting website by server, and by carrying out the process such as participle, cluster to these hot spot datas, determine the keyword of much-talked-about topic and much-talked-about topic in time, effectively prevent and determine by manually runing the lag issues that much-talked-about topic is brought.

Corresponding to embodiment of the method shown in Fig. 2, the embodiment of the present invention additionally provides a kind of much-talked-about topic follow-up mechanism, is applied to server, shown in Figure 5, and this device can comprise with lower module:

Text similarity determination module 410, for for each much-talked-about topic, determines the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;

Video file tracing module 420, for according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding.

In a kind of embodiment of the present invention, described video file tracing module 420, can comprise following submodule:

In a kind of embodiment of the present invention, this device can also comprise following submodule:

In a kind of embodiment of the present invention, described duplicate removal process submodule, can comprise with lower unit:

Or,

Or;

The device that the application embodiment of the present invention provides, for each much-talked-about topic, by the text similarity of the keyword of this much-talked-about topic and the descriptor of video file, follow the trail of and obtain video file corresponding to this much-talked-about topic, automatically certain much-talked-about topic completed by server, even if may be made up of all multistage negotiation events, by server multiple exercise the technical program, can video file corresponding to this much-talked-about topic of regular update, save artificial operation cost.

It should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

Each embodiment in this instructions all adopts relevant mode to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

One of ordinary skill in the art will appreciate that all or part of step realized in said method embodiment is that the hardware that can carry out instruction relevant by program has come, described program can be stored in computer read/write memory medium, here the alleged storage medium obtained, as: ROM/RAM, magnetic disc, CD etc.

The foregoing is only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.All any amendments done within the spirit and principles in the present invention, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims

1. a much-talked-about topic keyword defining method, is characterized in that, is applied to server, and described method comprises:

2. method according to claim 1, is characterized in that, described obtain the set of the basic word of every bar hot spot data after, described in determine setting up the text model of this hot spot data attribute be the basic word of named entity before, also comprise:

3. method according to claim 1 and 2, is characterized in that, also comprises:

4., based on a much-talked-about topic method for tracing for the much-talked-about topic keyword defining method described in any one of claims 1 to 3, it is characterized in that, be applied to server, described method comprises:

5. method according to claim 4, is characterized in that, described according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding, comprising:

6. method according to claim 5, is characterized in that, described according to duplicate removal result, after determining the video file that this much-talked-about topic is corresponding, also comprises:

7. the method according to claim 5 or 6, is characterized in that, the described candidate video in this much-talked-about topic is concentrated, and carries out the process of video duplicate removal, comprising:

8. method according to claim 7, is characterized in that, described judge to issue moment adjacent two video files be whether palinopsia frequently, comprising:

Or,

Or;

9. a much-talked-about topic keyword determining device, is characterized in that, is applied to server, and described device comprises:

10. device according to claim 9, is characterized in that, described basic set of words obtain module also for:

11. devices according to claim 9 or 10, is characterized in that, also comprise:

12. 1 kinds of much-talked-about topic follow-up mechanisms based on the much-talked-about topic keyword determining device described in any one of claim 9 or 11, it is characterized in that, be applied to server, described device comprises:

13. devices according to claim 12, is characterized in that, described video file tracing module, comprising:

14. devices according to claim 13, is characterized in that, also comprise:

15. devices according to claim 13 or 14, it is characterized in that, described duplicate removal process submodule, comprising:

16. devices according to claim 15, is characterized in that, described repetition video judging unit, specifically for:

Or,

Or;