CN104915447B

CN104915447B - A kind of much-talked-about topic tracking and keyword determine method and device

Info

Publication number: CN104915447B
Application number: CN201510372462.0A
Authority: CN
Inventors: 乔奇
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2015-06-30
Filing date: 2015-06-30
Publication date: 2018-04-20
Anticipated expiration: 2035-06-30
Also published as: CN104915447A

Abstract

The embodiment of the invention discloses a kind of tracking of much-talked-about topic and keyword to determine method and device.A kind of much-talked-about topic keyword determine method be by obtaining hot spot data in setting website, and by the way that these hot spot datas are segmented, are clustered etc. with processing, definite much-talked-about topic keyword in time；A kind of much-talked-about topic method for tracing is to be directed to each much-talked-about topic, and by the keyword of the much-talked-about topic and the text similarity of the description information of video file, tracking obtains the corresponding video file of the much-talked-about topic.The technical solution provided using the embodiment of the present invention, the definite and much-talked-about topic tracking of much-talked-about topic keyword is automatically performed by server, it effectively prevent by manually runing lag issues caused by definite much-talked-about topic, and even if some much-talked-about topic may be made of all multistage negotiation events, the technical program is performed a plurality of times by server, the video file corresponding to the much-talked-about topic can be regularly updated, saves artificial operation cost.

Description

A kind of much-talked-about topic tracking and keyword determine method and device

Technical field

The present invention relates to Internet technical field, more particularly to a kind of much-talked-about topic tracking and keyword determine method and dress Put.

Background technology

Much-talked-about topic, refers to certain time, in a certain range, the hot issue that the public is concerned about the most.In video website, UGC (User Generated Content, user produce content) class video is mostly that user occurs in focus incident or topic Issue at the first time, there is certain timeliness, this kind of video is higher by the degree of concern of user.But because video network UGC classes video has the features such as magnanimity, numerous and jumbled property, high duplication in standing, it has not been convenient to which user obtains important information in time.Very much Video website can be directed to discovery and tracking that this kind of video carries out much-talked-about topic, by the video aggregation of same much-talked-about topic one Rise, facilitate user to check.

Current video website is all the discovery and tracking that much-talked-about topic is carried out by operation personnel.Operation personnel passes through analysis The description informations such as the title of video file, brief introduction, determine current much-talked-about topic, further determine that each much-talked-about topic institute is right The video answered.

The much-talked-about topic determined by the description information of operation personnel's analysis video file often has hysteresis quality, moreover, Some much-talked-about topics may be made of all multistage negotiation events, be had in time compared with long span, it is necessary to which operation personnel is persistently closed Note and analysis, artificial operation cost are higher.

The content of the invention

To solve the above problems, the embodiment of the invention discloses a kind of tracking of much-talked-about topic and keyword to determine method and dress Put.Technical solution is as follows：

A kind of much-talked-about topic keyword determines method, applied to server, the described method includes：

The text message of every hot spot data to being obtained in setting website segments, and obtains the base of every hot spot data The set of plinth word；

Every hot spot data is directed to respectively, it is real for name according to attribute in the set of the basic word of this hot spot data The frequency that the basic word of body occurs in the text message of this hot spot data, determines the text for establishing this hot spot data The attribute of model is the basic word of name entity；

Basic word according to identified attribute for name entity, establishes the text model of every hot spot data；

According to the text similarity of the text model of every two hot spot datas, acquired all hot spot datas are gathered Class, obtains at least one class cluster；

For each class cluster, the hot spot data included according to attribute in such cluster for the basic word of name entity in such cluster Text message in the frequency that occurs, determine the keyword of such cluster, and the keyword of such cluster is determined as such cluster and is corresponded to Much-talked-about topic keyword.

The present invention a kind of embodiment in, the basic word for obtaining every hot spot data set it Afterwards, before basic word of the attribute for determining the text model for establishing this hot spot data for name entity, further include：

Stop words filtration treatment is carried out to the basic word in the set of the basic word of every hot spot data respectively.

In a kind of embodiment of the present invention, further include：

For each class cluster, the highest at least one keyword of the frequency is searched in the keyword of such definite cluster；

A title is selected in the title of the hot spot data where at least one keyword found, as such cluster Corresponding much-talked-about topic.

A kind of much-talked-about topic method for tracing that method is determined based on above-mentioned much-talked-about topic keyword, applied to server, institute The method of stating includes：

For each much-talked-about topic, the keyword of the much-talked-about topic and the text of the description information of each video file are determined Similarity；

According to identified text similarity, the corresponding video file of the much-talked-about topic is followed the trail of.

In a kind of embodiment of the present invention, text similarity determined by the basis, follows the trail of hot spot words Corresponding video file is inscribed, including：

Whether default first threshold is more than according to identified text similarity, determines the candidate video of the much-talked-about topic Collection；

Concentrated in the candidate video of the much-talked-about topic, carry out video duplicate removal processing；

According to duplicate removal handling result, the corresponding video file of the much-talked-about topic is determined.

In a kind of embodiment of the present invention, described according to duplicate removal handling result, the much-talked-about topic pair is determined After the video file answered, further include：

Whether the quantity of the corresponding video file of the much-talked-about topic is more than default second threshold determined by judging；

If it is, the identified hot spot is talked about according to the issue time at intervals of issue moment adjacent video successively Inscribe corresponding video file and carry out hierarchical clustering processing, until obtained classification number is not more than the default second threshold；

According to the quality of video in each classification, determine that each classification is corresponding and represent video；

By the corresponding associated video for representing video and being determined as the much-talked-about topic of each classification.

In a kind of embodiment of the present invention, the candidate video in the much-talked-about topic is concentrated, and carries out video Duplicate removal processing, including：

The issue moment for the video file concentrated according to the candidate video of the much-talked-about topic, according to the issue moment from morning to night The video concentrated to the candidate video of order be ranked up；

Judge to issue whether moment adjacent two video files are palinopsia frequency successively, if it is, talking about in the hot spot The candidate video of topic concentrates the video file for retaining issue morning at moment, deletes the video file in issue evening at moment.

In a kind of embodiment of the present invention, it is described judge issue moment adjacent two video files whether be Video is repeated, including：

Calculate the text similarity of the description information of issue moment adjacent two video files, and according to text is calculated This similarity, determines whether the two video files are palinopsia frequency；

Alternatively,

The visual signature similarity of two adjacent video files of moment is issued in calculating, and special according to the vision being calculated Similarity is levied, determines whether the two video files are palinopsia frequency；

Or；

The text similarity and the two video files of the description information of two adjacent video files of moment are issued in calculating Visual signature similarity, and according to the text similarity and visual signature similarity being calculated, determine the two videos text Whether part is palinopsia frequency.

A kind of much-talked-about topic keyword determining device, applied to server, described device includes：

Basic set of words obtains module, and the text message for every hot spot data to being obtained in setting website divides Word, obtains the set of the basic word of every hot spot data；

Entity attribute basis word determining module is named, for being directed to every hot spot data respectively, in this hot spot data It is the frequency for naming the basic word of entity to occur in the text message of this hot spot data according to attribute in the set of basic word Secondary, the attribute for determining the text model for establishing this hot spot data is the basic word of name entity；

Text model establishes module, for the basic word according to identified attribute for name entity, establishes every hot spot The text model of data；

Hot spot data cluster module, for the text similarity of the text model according to every two hot spot datas, to being obtained All hot spot datas taken are clustered, and obtain at least one class cluster；

Much-talked-about topic keyword determining module, for being name entity according to attribute in such cluster for each class cluster The frequency that basic word occurs in the text message for the hot spot data that such cluster includes, determines the keyword of such cluster, and should The keyword of class cluster is determined as the keyword of the corresponding much-talked-about topic of such cluster.

In a kind of embodiment of the present invention, the basis set of words obtains module and is additionally operable to：

In a kind of embodiment of the present invention, further include：

Much-talked-about topic title determining module, for for each class cluster, frequency to be searched in the keyword of such definite cluster Secondary highest at least one keyword；One is selected in the title of the hot spot data where at least one keyword found Title, as the corresponding much-talked-about topic of such cluster.

A kind of much-talked-about topic follow-up mechanism based on above-mentioned much-talked-about topic keyword determining device, applied to server, institute Stating device includes：

Text similarity determining module, for for each much-talked-about topic, determine the keyword of the much-talked-about topic with it is each The text similarity of the description information of video file；

Video file tracing module, for according to identified text similarity, following the trail of the corresponding video of the much-talked-about topic File.

In a kind of embodiment of the present invention, the video file tracing module, including：

Candidate video collection determination sub-module, for whether being more than default first threshold according to identified text similarity Value, determines the candidate video collection of the much-talked-about topic；

Duplicate removal handles submodule, for the candidate video concentration in the much-talked-about topic, carries out video duplicate removal processing；

Video file determination sub-module, for according to duplicate removal handling result, determining the corresponding video file of the much-talked-about topic.

In a kind of embodiment of the present invention, further include：

Judging submodule, it is default whether the quantity for judging the corresponding video file of the identified much-talked-about topic is more than Second threshold, if it is, triggering clustering processing submodule；

The clustering processing submodule, for successively according to the issue time at intervals of issue moment adjacent video, to institute The corresponding video file of the definite much-talked-about topic carries out hierarchical clustering processing, until obtained classification number is default no more than described Second threshold；

Video determination sub-module is represented, for the quality according to video in each classification, determines each classification corresponding generation Table video；

Associated video determination sub-module, for by the corresponding association for representing video and being determined as the much-talked-about topic of each classification Video.

In a kind of embodiment of the present invention, the duplicate removal handles submodule, including：

Video sequencing unit, for the issue moment for the video file concentrated according to the candidate video of the much-talked-about topic, is pressed The video that the order of cloth moment from morning to night approved for distribution concentrates the candidate video is ranked up；

Video judging unit is repeated, for judging to issue whether moment adjacent two video files are palinopsia successively Frequently, if it is, triggering duplicate removal processing unit；

The duplicate removal processing unit, the video text of issue morning at moment is retained for being concentrated in the candidate video of the much-talked-about topic Part, deletes the video file in issue evening at moment.

It is described to repeat video judging unit in a kind of embodiment of the present invention, it is specifically used for：

Alternatively,

Or；

The technical solution provided using the embodiment of the present invention, by obtaining hot spot data in setting website, and by right These hot spot datas such as are segmented, are clustered at the processing, in time the keyword of definite much-talked-about topic and much-talked-about topic, for each heat Point topic, by the keyword of the much-talked-about topic and the text similarity of the description information of video file, tracking obtains the hot spot The corresponding video file of topic.So far, definite and much-talked-about topic the tracking of much-talked-about topic keyword is automatically complete by server Into effectively prevent by manually runing lag issues caused by definite much-talked-about topic, and even if some much-talked-about topic may It is made of all multistage negotiation events, the technical program is performed a plurality of times by server, it is right that much-talked-about topic institute can be regularly updated The video file answered, saves artificial operation cost.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of implementing procedure figure that much-talked-about topic keyword determines method in the embodiment of the present invention；

Fig. 2 is a kind of implementing procedure figure of much-talked-about topic method for tracing in the embodiment of the present invention；

Fig. 3 is another implementing procedure figure of much-talked-about topic method for tracing in the embodiment of the present invention；

Fig. 4 is a kind of structure diagram of much-talked-about topic keyword determining device in the embodiment of the present invention；

Fig. 5 is a kind of structure diagram of much-talked-about topic follow-up mechanism in the embodiment of the present invention.

Embodiment

In order to make those skilled in the art more fully understand the technical solution in the embodiment of the present invention, below in conjunction with this hair Attached drawing in bright embodiment, is clearly and completely described the technical solution in the embodiment of the present invention, it is clear that described Embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this area Those of ordinary skill's all other embodiments obtained without making creative work, belong to protection of the present invention Scope.

Shown in Figure 1, a kind of much-talked-about topic keyword provided by the embodiment of the present invention determines the implementation stream of method Cheng Tu, this method are applied to server, may comprise steps of：

S110：The text message of every hot spot data to being obtained in setting website segments, and obtains every hot spot number According to basic word set.

In Internet era, information is able to fast propagation, and in certain time, the focus of people often compares concentration.It is micro- Hot spot data of concern has been converged in the websites such as rich topic list, search roll of the hour, has embodied a concentrated reflection of people for current time The view of some topic or something in section.

In practical applications, same topic may have different language performances, but its theme showed is identical 's.Can be periodically or non-periodically by web crawlers or other information acquisition modes, from website set in advance (on such as The websites such as the microblog topic list stated, search roll of the hour) in obtain hot spot data.Every acquired hot spot data can include should The text messages such as title, summary, detailed description and/or the linked contents of bar hot spot data.

It is understood that different websites may be different for the expression way of hot spot data, so for obtaining The hot spot data arrived, can be first unified data representation according to default format conversion by it, such as according to following form：Mark Topic, description, time, relevant textual information.The text message amount of every hot spot data is larger, it is necessary to text to every hot spot data This information carries out word segmentation processing, can obtain the set of the basic word of every hot spot data.

, can also be respectively in the set of the basic word of every hot spot data in a kind of embodiment of the present invention Basic word carry out stop words filtration treatment.In practical applications, stop words dictionary can be pre-set, can be wrapped in the dictionary Containing " ", " ", the function word such as " obtaining ", the word that " high definition ", " low clear " etc. be determined by operation personnel can also be included.

S120：Every hot spot data is directed to respectively, is life according to attribute in the set of the basic word of this hot spot data The frequency that the basic word of name entity occurs in the text message of this hot spot data, determines for establishing this hot spot data The attribute of text model is the basic word of name entity.

In embodiments of the present invention, Entity recognition can be named, determines basic word of the attribute for name entity, institute Meaning attribute for the basic word of name entity can be with semantic basic word, such as name, place name, building name, can be with It is basic word with verb or noun part-of-speech etc..Can in the set of the basic word for every hot spot data that step S110 is obtained Information with the frequency occurred comprising each basic word in the text message of this hot spot data, so as to be according to attribute The frequency that the basic word of name entity occurs in the text message of this hot spot data, determines to be used to establish this hot spot data Text model attribute for name entity basic word.Alternatively, being scanned in the text message of this hot spot data, obtain The frequency occurred to attribute for the basic word of name entity in the text message of this hot spot data, so as to be believed according to the frequency Breath, the attribute for determining the text model for establishing this hot spot data are the basic word of name entity.It should be noted that tool The recognition methods for having the basic word of name entity attribute is the prior art, is repeated no more in embodiments of the present invention.

In practical applications, the text in this hot spot data can be selected in the basic set of words of this hot spot data The higher attribute of frequency of occurrence is used to characterize this hot spot data for the basic word of name entity in this information.For every hot spot Data, can be according to each text envelope of the basic word in this hot spot data with name entity attribute of this hot spot data The frequency occurred in breath, sorts according to the height of the frequency, top n is determined as the text model for establishing this hot spot data Attribute be the basic word of name entity, or preceding x% is determined as to the category of the text model for establishing this hot spot data Property for name entity basic word.It should be noted that N or x can be set and adjusted, this hair according to actual conditions here Bright embodiment is without limitation.

S130：Basic word according to identified attribute for name entity, establishes the text model of every hot spot data.

It is determined that the attribute of the text model for establishing this hot spot data is the basis of name entity in step S120 Word, according to identified attribute to name the basic word of entity, can establish the text model of every hot spot data.Specifically, The text model of every hot spot data can be expressed in the form of vector space model (VSM), wherein determined by can recording The frequency that attribute occurs for the basic word of name entity in the text message of corresponding hot spot data.It is it should be noted that vectorial Spatial model is the prior art, and the embodiment of the present invention repeats no more this.

S140：According to the text similarity of the text model of every two hot spot datas, to acquired all hot spot datas Clustered, obtain at least one class cluster.

The text model of every hot spot data is established in step S130, can be according to the prior art, such as common cosine Angle theorem, Jaccard distance metrics etc., calculate the text similarity of the text model of every two hot spot datas, and the present invention is right This is repeated no more.

According to the text similarity of the text model for every two hot spot datas being calculated, to acquired all hot spots Data are clustered, and obtain at least one class cluster.Such as, will be similar to the text of its text model for a certain bar hot spot data Degree is classified as a class cluster higher than the hot spot data of a certain predetermined threshold value.Text similarity is carried out using the text model of hot spot data Calculating, compared to using hot spot data all text messages carry out text similarity calculating, calculation amount is small, computational efficiency It is high.

For convenience of understanding, lift a simple case and illustrate.Have four hot spot datas, corresponding text model be respectively A, B, C, D, wherein, the text similarity between A, B, C, D is as shown in table 1：

Table 1

Text model	A	B	C	D
					A	1	0.8	0.3	0.5
B	0.8	1	0.3	0.2
					C	0.3	0.3	1	0.5
D	0.5	0.2	0.5	1

It is assumed that if the text similarity of the text model of two hot spot datas is higher than 0.5, by this two hot spot datas It is classified as a class cluster, then according to the condition, the corresponding hot spot datas of A and the corresponding hot spot datas of B are a class cluster, and C is corresponded to Hot spot data be a class cluster, the corresponding hot spot datas of D are a class cluster.

S150：For each class cluster, the heat included according to attribute in such cluster for the basic word of name entity in such cluster The frequency occurred in the text message of point data, determines the keyword of such cluster, and the keyword of such cluster is determined as such The keyword of the corresponding much-talked-about topic of cluster.

Acquired all hot spot datas are clustered in step S140, obtain at least one class cluster, each class Hot spot data in cluster can characterize a much-talked-about topic.For every a kind of cluster, the text of the hot spot data in such cluster The basic word that frequency of occurrence is higher in information (meeting default threshold condition) can as the keyword of such cluster, so as to The keyword of such cluster is determined as to the keyword of the corresponding much-talked-about topic of such cluster.

In one embodiment of the invention, each class cluster can be directed to, is searched in the keyword of such definite cluster The highest at least one keyword of the frequency, and selected in the title of the hot spot data where at least one keyword found One title, as the corresponding much-talked-about topic of such cluster.

The technical solution provided using the embodiment of the present invention, hot spot data is obtained by server in setting website, and By the way that these hot spot datas are segmented, are clustered etc. with processing, the keyword of definite much-talked-about topic and much-talked-about topic in time, effectively Avoid by manually runing lag issues caused by definite much-talked-about topic.

Method is determined based on above-mentioned much-talked-about topic keyword, the embodiment of the present invention additionally provides much-talked-about topic method for tracing, Shown in Figure 2, this method is applied to server, may comprise steps of：

S210：For each much-talked-about topic, the keyword of the much-talked-about topic and the description information of each video file are determined Text similarity；

Method is determined based on above-mentioned much-talked-about topic keyword, it is determined that current much-talked-about topic and the pass of each much-talked-about topic Keyword.The description information of video file in video website can be title or the brief introduction of the video file.To Mr. Yu For a much-talked-about topic, the keyword of the much-talked-about topic and the text similarity of the description information of some video file are higher, table Show that the video file and the much-talked-about topic are closer.It should be noted that those skilled in the art can pass through prior art meter Calculate the text similarity of the description information of the keyword for obtaining each much-talked-about topic and each video file, the embodiment of the present invention pair This is repeated no more.

S220：According to identified text similarity, the corresponding video file of the much-talked-about topic is followed the trail of.

Step S210 determines that the keyword of each much-talked-about topic is similar to the text of the description information of each video file Degree, according to identified text similarity, can follow the trail of to obtain the corresponding video file of each much-talked-about topic.

In a kind of embodiment of the present invention, step S220 may comprise steps of：

S221：Whether default first threshold is more than according to identified text similarity, determines the time of the much-talked-about topic Select video set.

It is understood that the keyword of much-talked-about topic and the numerical value of text similarity of description information of video file are got over Height, represents that the video file and the much-talked-about topic are closer.In practical applications, if identified text similarity is more than in advance If first threshold, then corresponding video file can be belonged to the much-talked-about topic candidate video concentrate.

S222：Concentrated in the candidate video of the much-talked-about topic, carry out video duplicate removal processing；

It is understood that (User Generated Content, are used for the video file in video website, especially UGC Family produces content) for class video file mainly by user's upload, description of the different user for the video of identical content may It is not quite similar, so, there are more repetition or similar video file in video website.In practical applications, can be with pin The video concentrated to the candidate video of each much-talked-about topic carries out duplicate removal processing.

In a kind of embodiment of the present invention, step S222 may comprise steps of：

Step 1：The issue moment for the video file concentrated according to the candidate video of the much-talked-about topic, according to the issue moment The video that order from morning to night concentrates the candidate video is ranked up；

Step 2：Judge to issue whether moment adjacent two video files are palinopsia frequency successively, if it is, at this The candidate video of much-talked-about topic concentrates the video file for retaining issue morning at moment, deletes the video file in issue evening at moment.

For convenience of description, above-mentioned two step is combined and is illustrated.

In embodiments of the present invention, it is the issue moment of video file at the time of user's uploaded videos file, to ensure first The rights and interests of the user of uploaded videos file, for repeating video, can preferentially retain the video file of issue morning at moment, delete hair The video file in evening at cloth moment.

In practical applications, at the issue moment for the video file that can be concentrated according to the candidate video of the much-talked-about topic, press The video that the order of cloth moment from morning to night approved for distribution concentrates candidate video is ranked up, by judge the adjacent issue moment two Whether a video file is palinopsia frequency, carries out duplicate removal processing.

Judge to issue moment adjacent two video files whether be palinopsia frequency determination methods can have it is following several：

The first：The text similarity of the description information of two adjacent video files of moment is issued in calculating, and according to meter Calculation obtains text similarity, determines whether the two video files are palinopsia frequency；

It can be counted as previously mentioned, for the text similarity of the description information of two video files by the prior art Calculate.If the text similarity of the description information of issue moment adjacent two video files is higher than predetermined threshold value, can be true The two fixed video files are palinopsia frequency.

Second：Calculate the visual signature similarity of issue moment adjacent two video files, and according to being calculated Visual signature similarity, determine the two video files whether be palinopsia frequency；

In video website, video file is intuitively checked for the convenience of the user, generally can be by video file with breviary diagram form Show user.The description information of the video file of identical content may be different, but the visual signature of its thumbnail may be identical Or close, the feature such as the color of the visual signature such as thumbnail of thumbnail, texture, shape.So issue moment phase can be calculated The visual signature similarity of the thumbnail of two adjacent video files, and according to the visual signature similarity being calculated, determine Whether the two video files are palinopsia frequency, specifically, visual signature similarity can be higher than to the two of a certain predetermined threshold value A video file is determined as repeating video.Can be according to existing for the visual signature similarity of the thumbnail of different video file Technology is calculated, and is such as contrasted color histogram and is obtained visual signature similarity.

In practical applications, the key frame picture in two video files can also be extracted respectively, passes through contrast The visual signature of the key frame picture of two video files, carries out the calculating of the visual signature similarity of the two video files, So as to according to result of calculation, determine whether the two video files are palinopsia frequency.For example, video file A and regarding Frequency file B is two adjacent video files of issue moment, M key frame is first extracted from video file A, from video file B The middle N number of key frame of extraction, M and N can be identical or different, the visual signature of each key frame is then extracted respectively, for each Key frame, is reached with a high dimensional feature vector table, under the constraint of time and spatial relationship calculate key frame feature vector it Between matching degree, to determine whether video file A and video file B are palinopsia frequency.For example, the key frame with video file A On the basis of, key frame of the key frame of video file A sequentially successively with video file B is contrasted, if video file A The visual signature similarity of i-th of key frame and j-th of key frame of video file B is more than predetermined threshold value, then it is assumed that finds just Beginning matching double points i and j, after initial matching point, order calculate video file A, video file B key frame pair vision it is special Similarity is levied, untill end or mismatch, obtains the matching picture pair for starting point with (i, j).Repeat the above steps, look for To global maximum matching picture pair, if the number of global maximum matching graph piece pair accounts for total figure the piece number purpose ratio higher than another Preset matching threshold value, then can determine the two video files for palinopsia frequency.

Certainly, those skilled in the art can also be regarded according to the prior art using other video frame of video file Feel the calculating of characteristic similarity, the embodiment of the present invention repeats no more this.

The third：The text similarity that the description information of two adjacent video files of moment is issued in calculating is regarded with the two The visual signature similarity of frequency file, and according to the text similarity and visual signature similarity being calculated, determine the two Whether video file is palinopsia frequency.

In practical applications, the text similarity and the two videos text of the description information of two video files are considered The visual signature similarity of part, more can accurately determine whether the two video files are palinopsia frequency, improve duplicate removal essence Degree.Specifically, text similarity and visual signature similarity predetermined threshold value can be directed to respectively, when the two is above corresponding to it Predetermined threshold value when, by the two video files be determined as repeat video, alternatively, text similarity and vision can be assigned respectively The certain weight of characteristic similarity, when the weighted sum of the two is higher than certain predetermined threshold value, determines the two video files to repeat Video.

It should be noted that above-mentioned three kinds judge whether two video files are that the method for palinopsia frequency can be according to reality Situation makes choice use, wherein the setting for threshold value can be carried out according to actual conditions.

S223：According to duplicate removal handling result, the corresponding video file of the much-talked-about topic is determined.

According to step S222 duplicate removal handling results, it may be determined that the corresponding video file of the much-talked-about topic.

In practical applications, after carrying out duplicate removal processing, the quantity of the corresponding video file of identified much-talked-about topic may Still it is very much, show that the much-talked-about topic is more wide in range, there may be certain ductility in time.If talked about as the hot spot The associated video of topic is all showed, and can be made troubles to checking for user, because user needs constantly to ransack just find to think The video file to be checked.

Based on this, in one embodiment of the invention, after step S223, can also comprise the following steps：

First step：Whether the quantity of the much-talked-about topic corresponding video file determined by judging is more than default the Two threshold values, if it is, performing second step.

In practical applications, for each much-talked-about topic, the maximum of the associated video of the much-talked-about topic can be pre-set Quantity, i.e. second threshold, if the quantity of the corresponding video file of much-talked-about topic determined by step S223 is more than default the Two threshold values, then perform second step.The specific setting and adjustment of second threshold can be carried out according to actual conditions.

Second step：Successively according to the issue time at intervals of issue moment adjacent video, to the identified hot spot The corresponding video file of topic carries out hierarchical clustering processing, until obtained classification number is not more than the default second threshold.

It is understood that the issue moment for the different video file that the candidate video after duplicate removal is handled is concentrated may It is identical or different.If certain much-talked-about topic has certain ductility in time, then the issue of its corresponding video file Time at intervals may be larger.Can be successively according to the issue time at intervals of issue moment adjacent video, to the identified heat The corresponding video file of point topic carries out hierarchical clustering processing.Each classification can be as one in the much-talked-about topic evolutionary process A important stage.

For example for some much-talked-about topic, the quantity for the video file that its candidate video is concentrated is still after duplicate removal is handled So exceed default second threshold, sequentially these video files can be unfolded sooner or later according to the issue moment, be ordered as video text Part 1, video file 2, video file 3, video file 4, video file 5, wherein, during the issue of video file 1 and video file 2 Carve at intervals of 1 it is small when, the issue time at intervals of video file 2 and video file 3 is 3 days, video file 3 and video file 4 It is 5 days to issue time at intervals, when the issue time at intervals of video file 4 and video file 5 is 2 small.

Assuming that cluster condition is：No more than 1 day, video file of the time at intervals no more than 1 day will be issued and be classified as one Classification, obtained result are：

{ video file 1, video file 2 }, { video file 3 }, { video file 4, video file 5 }, totally three classifications.

Still greater than default second threshold, if can change cluster condition is obtained classification number 3：No more than 3 days, i.e., Video file of the time at intervals no more than 3 days will be issued and be classified as a classification, obtained result is：

{ video file 1, video file 2, video file 3 }, { video file 4, video file 5 }, totally two classifications.

It should be noted that after these video files are carried out hierarchical clustering processing, the classification number needs finally obtained are small In default second threshold.

3rd step：According to the quality of video in each classification, determine that each classification is corresponding and represent video.

In practical applications, the video file concentrated by the candidate video of duplicate removal processing is subjected to hierarchical clustering processing Afterwards, multiple video files are contained in each classification.It is understood that the quality of different video files is uneven, on The identity of biography person may be different, and beholder is to possible difference of its fancy grade, etc..Each side factor can be considered, Selection one represents video in multiple video files that each classification is included.

4th step：By the corresponding associated video for representing video and being determined as the much-talked-about topic of each classification.

By the corresponding associated video for representing video and being determined as the much-talked-about topic of each classification., will when there is displaying demand The associated video of the much-talked-about topic shows user.

The technical solution provided using the embodiment of the present invention, for each much-talked-about topic, passes through the pass of the much-talked-about topic The text similarity of the description information of keyword and video file, tracking obtains the corresponding video file of the much-talked-about topic, by servicing Device is automatically performed, even if some much-talked-about topic may be made of all multistage negotiation events, this technology is performed a plurality of times by server Scheme, can regularly update the video file corresponding to the much-talked-about topic, save artificial operation cost.

Corresponding to embodiment of the method shown in Fig. 1, the embodiment of the present invention additionally provides a kind of much-talked-about topic keyword and determines to fill Put, shown in Figure 4 applied to server, which can include with lower module：

Basic set of words obtains module 310, for every hot spot data to being obtained in setting website text message into Row participle, obtains the set of the basic word of every hot spot data；

Entity attribute basis word determining module 320 is named, for being directed to every hot spot data respectively, in this hot spot data Basic word set in, the frequency that occurs in the text message of this hot spot data of basic word according to attribute for name entity Secondary, the attribute for determining the text model for establishing this hot spot data is the basic word of name entity；

Text model establishes module 330, for the basic word according to identified attribute for name entity, establishes every heat The text model of point data；

Hot spot data cluster module 340, for the text similarity of the text model according to every two hot spot datas, to institute All hot spot datas obtained are clustered, and obtain at least one class cluster；

Much-talked-about topic keyword determining module 350, for being name entity according to attribute in such cluster for each class cluster The frequency that occurs in the text message for the hot spot data that such cluster includes of basic word, determine the keyword of such cluster, and will The keyword of such cluster is determined as the keyword of the corresponding much-talked-about topic of such cluster.

In one embodiment of the invention, the basic set of words obtains module 310 and can be also used for：

In one embodiment of the invention, which can also include with lower module：

The device provided using the embodiment of the present invention, hot spot data is obtained by server in setting website, and is passed through These hot spot datas are segmented, are clustered etc. with processing, the keyword of definite much-talked-about topic and much-talked-about topic, effectively avoids in time By manually runing lag issues caused by definite much-talked-about topic.

Corresponding to embodiment of the method shown in Fig. 2, the embodiment of the present invention additionally provides a kind of much-talked-about topic follow-up mechanism, application Shown in Figure 5 in server, which can include with lower module：

Text similarity determining module 410, for for each much-talked-about topic, determine the keyword of the much-talked-about topic with it is each The text similarity of the description information of a video file；

Video file tracing module 420, for according to identified text similarity, following the trail of, the much-talked-about topic is corresponding to be regarded Frequency file.

In a kind of embodiment of the present invention, the video file tracing module 420, can include following submodule Block：

In a kind of embodiment of the present invention, which can also include following submodule：

In a kind of embodiment of the present invention, the duplicate removal handles submodule, can include with lower unit：

Alternatively,

Or；

The device provided using the embodiment of the present invention, for each much-talked-about topic, passes through the keyword of the much-talked-about topic With the text similarity of the description information of video file, tracking obtains the corresponding video file of the much-talked-about topic, by server certainly It is dynamic to complete, even if some much-talked-about topic may be made of all multistage negotiation events, the technical program is performed a plurality of times by server, The video file corresponding to the much-talked-about topic can be regularly updated, saves artificial operation cost.

It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed, or further include as this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there are other identical element in process, method, article or equipment including the key element.

Each embodiment in this specification is described using relevant mode, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for device For applying example, since it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

Can one of ordinary skill in the art will appreciate that realizing that all or part of step in above method embodiment is To instruct relevant hardware to complete by program, the program can be stored in computer read/write memory medium, The storage medium designated herein obtained, such as：ROM/RAM, magnetic disc, CD etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of much-talked-about topic keyword determines method, it is characterised in that applied to server, the described method includes：

The text message of every hot spot data to being obtained in setting website segments, and obtains the basic word of every hot spot data Set；

Every hot spot data is directed to respectively, is name entity according to attribute in the set of the basic word of this hot spot data The frequency that basic word occurs in the text message of this hot spot data, determines the text model for establishing this hot spot data Attribute for name entity basic word；

The frequency occurred according to attribute for the basic word of name entity in the text message of this hot spot data, determines to use In the basic word that the attribute for the text model for establishing this hot spot data is name entity, including：

According to each basic word with name entity attribute of this hot spot data in the text message of this hot spot data The frequency of appearance, sorts according to the height of the frequency, top n is determined as to the category of the text model for establishing this hot spot data Property for name entity basic word, or preceding x% is determined as to the attribute of the text model for establishing this hot spot data and is Name the basic word of entity；

According to the text similarity of the text model of every two hot spot datas, acquired all hot spot datas are clustered, Obtain at least one class cluster；

For each class cluster, the text of the hot spot data included according to attribute in such cluster for the basic word of name entity in such cluster The frequency occurred in this information, determines the keyword of such cluster, and the keyword of such cluster is determined as the corresponding heat of such cluster The keyword of point topic；

It is described according to attribute in such cluster for name entity basic word in the text message for the hot spot data that such cluster includes The frequency of appearance, determines the keyword of such cluster, including：

Using frequency of occurrence in the text message of the hot spot data in such cluster meet the basic word of default threshold condition as The keyword of such cluster.

2. according to the method described in claim 1, it is characterized in that, the basic word for obtaining every hot spot data set Afterwards, before basic word of the attribute for determining the text model for establishing this hot spot data for name entity, also wrap Include：

3. according to the method described in claim 1, it is characterized in that, further include：

A title is selected in the title of the hot spot data where at least one keyword found, is corresponded to as such cluster Much-talked-about topic.

A kind of 4. much-talked-about topic tracking side that method is determined based on claims 1 to 3 any one of them much-talked-about topic keyword Method, it is characterised in that applied to server, the described method includes：

For each much-talked-about topic, determine that the keyword of the much-talked-about topic is similar to the text of the description information of each video file Degree；

5. according to the method described in claim 4, it is characterized in that, text similarity determined by the basis, follows the trail of the heat The corresponding video file of point topic, including：

Whether default first threshold is more than according to identified text similarity, determines the candidate video collection of the much-talked-about topic；

6. according to the method described in claim 5, it is characterized in that, determine that the hot spot is talked about according to duplicate removal handling result described After inscribing corresponding video file, further include：

If it is, successively according to the issue time at intervals of issue moment adjacent video, to the identified much-talked-about topic pair The video file answered carries out hierarchical clustering processing, until obtained classification number is not more than the default second threshold；

7. the method according to claim 5 or 6, it is characterised in that the candidate video in the much-talked-about topic is concentrated, into The processing of row video duplicate removal, including：

The issue moment for the video file concentrated according to the candidate video of the much-talked-about topic, according to issue moment from morning to night suitable The video that candidate video described in ordered pair is concentrated is ranked up；

Judge to issue whether moment adjacent two video files are palinopsia frequency successively, if it is, in the much-talked-about topic Candidate video concentrates the video file for retaining issue morning at moment, deletes the video file in issue evening at moment.

8. the method according to the description of claim 7 is characterized in that described judge that issue moment adjacent two video files are No is palinopsia frequency, including：

Calculate the text similarity of the description information of issue moment adjacent two video files, and according to text phase is calculated Like degree, determine whether the two video files are palinopsia frequency；

Alternatively,

The visual signature similarity of two adjacent video files of moment is issued in calculating, and according to the visual signature phase being calculated Like degree, determine whether the two video files are palinopsia frequency；

Or；

The text similarity of description information and the regarding for the two video files of two adjacent video files of moment are issued in calculating Feel characteristic similarity, and according to the text similarity and visual signature similarity being calculated, determine that the two video files are No is palinopsia frequency.

9. a kind of much-talked-about topic keyword determining device, it is characterised in that applied to server, described device includes：

Basic set of words obtains module, and the text message for every hot spot data to being obtained in setting website segments, Obtain the set of the basic word of every hot spot data；

Entity attribute basis word determining module is named, for being directed to every hot spot data respectively, on the basis of this hot spot data In the set of word, according to attribute to name the frequency that the basic word of entity occurs in the text message of this hot spot data, really The attribute of the fixed text model for being used to establish this hot spot data is the basic word of name entity；

Name entity attribute basis word determining module, is specifically used for：Each according to this hot spot data has name in fact The frequency that the basic word of body attribute occurs in the text message of this hot spot data, sorts according to the height of the frequency, by top n The attribute for being determined as the text model for establishing this hot spot data is the basic word of name entity, or preceding x% is determined Attribute for the text model for establishing this hot spot data is the basic word for naming entity；

Text model establishes module, for the basic word according to identified attribute for name entity, establishes every hot spot data Text model；

Hot spot data cluster module, for the text similarity of the text model according to every two hot spot datas, to acquired All hot spot datas are clustered, and obtain at least one class cluster；

Much-talked-about topic keyword determining module, for being the basis for naming entity according to attribute in such cluster for each class cluster The frequency that word occurs in the text message for the hot spot data that such cluster includes, determines the keyword of such cluster, and by such cluster Keyword be determined as the keyword of the corresponding much-talked-about topic of such cluster；

The much-talked-about topic keyword determining module, is specifically used for：It will go out in the text message of hot spot data in such cluster The existing frequency meets keyword of the basic word of default threshold condition as such cluster.

10. device according to claim 9, it is characterised in that the basis set of words obtains module and is additionally operable to：

11. the device according to claim 9 or 10, it is characterised in that further include：

Much-talked-about topic title determining module, for for each class cluster, the frequency to be searched most in the keyword of such definite cluster High at least one keyword；A mark is selected in the title of the hot spot data where at least one keyword found Topic, as the corresponding much-talked-about topic of such cluster.

12. a kind of much-talked-about topic follow-up mechanism of the much-talked-about topic keyword determining device based on described in claim 9, its feature It is, applied to server, described device includes：

Text similarity determining module, for for each much-talked-about topic, determining the keyword of the much-talked-about topic and each video The text similarity of the description information of file；

Video file tracing module, for according to identified text similarity, following the trail of the corresponding video file of the much-talked-about topic.

13. device according to claim 12, it is characterised in that the video file tracing module, including：

Candidate video collection determination sub-module, for whether being more than default first threshold according to identified text similarity, really The candidate video collection of the fixed much-talked-about topic；

14. device according to claim 13, it is characterised in that further include：

Whether judging submodule, the quantity for the corresponding video file of the much-talked-about topic determined by judging are more than default the Two threshold values, if it is, triggering clustering processing submodule；

The clustering processing submodule, for successively according to the issue time at intervals of issue moment adjacent video, to determining The corresponding video file of the much-talked-about topic carry out hierarchical clustering processing, until obtained classification number is no more than described default the Two threshold values；

Video determination sub-module is represented, for the quality according to video in each classification, determines that the corresponding representative of each classification regards Frequently；

Associated video determination sub-module, for by each classification it is corresponding represent video and be determined as the association of the much-talked-about topic regard Frequently.

15. the device according to claim 13 or 14, it is characterised in that the duplicate removal handles submodule, including：

Video sequencing unit, for the issue moment for the video file concentrated according to the candidate video of the much-talked-about topic, according to hair The video that the order of cloth moment from morning to night concentrates the candidate video is ranked up；

Video judging unit is repeated, for judging to issue whether moment adjacent two video files are palinopsia frequency successively, such as Fruit is then to trigger duplicate removal processing unit；

The duplicate removal processing unit, the video file of issue morning at moment is retained for being concentrated in the candidate video of the much-talked-about topic, Delete the video file in issue evening at moment.

16. device according to claim 15, it is characterised in that it is described to repeat video judging unit, it is specifically used for：

Alternatively,

Or；