CN104915447A - Method and device for tracing hot topics and confirming keywords - Google Patents

Method and device for tracing hot topics and confirming keywords Download PDF

Info

Publication number
CN104915447A
CN104915447A CN201510372462.0A CN201510372462A CN104915447A CN 104915447 A CN104915447 A CN 104915447A CN 201510372462 A CN201510372462 A CN 201510372462A CN 104915447 A CN104915447 A CN 104915447A
Authority
CN
China
Prior art keywords
video
topic
talked
much
hot spot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510372462.0A
Other languages
Chinese (zh)
Other versions
CN104915447B (en
Inventor
乔奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201510372462.0A priority Critical patent/CN104915447B/en
Publication of CN104915447A publication Critical patent/CN104915447A/en
Application granted granted Critical
Publication of CN104915447B publication Critical patent/CN104915447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The present invention discloses a method for tracing hot topics and confirming keywords. The method comprises that hot data is obtained through a set website, and the hot data is classified and clustered so as to confirm the keywords of the hot topics in time. According to the method for tracing hot topics, a video file corresponding to each hot topic can be obtained through the text similarity of the keywords of the hot topics and the description of the video file. When the technical scheme adopted by the method provided by the embodiment of the invention is applied, the confirmation of the keywords of each hot topic and the tracing of the hot topics are automatically completed by a server, so that the delay caused by confirming the hot topics through manpower is effectively avoided; even though some hot topic can be composed of a plurality of phased events, the server can repeatedly implement the technical scheme so as to periodically update the video file corresponding to each hot topic, so that the labor operation cost is reduced.

Description

A kind of much-talked-about topic is followed the trail of and keyword defining method and device
Technical field
The present invention relates to Internet technical field, particularly a kind of much-talked-about topic is followed the trail of and keyword defining method and device.
Background technology
Much-talked-about topic, refers in certain hour, certain limit, the hot issue that the public is concerned about the most.In video website, UGC (User Generated Content, user's production content) mostly class video be that user issued in the very first time that focus incident or topic occur, have certain ageing, this kind of video is higher by the degree of concern of user.But because UGC class video has the features such as magnanimity, numerous and jumbled property, high duplication in video website, inconvenient user obtains important information in time.A lot of video website can carry out discovery and the tracking of much-talked-about topic for this kind of video, with by the video aggregation of same much-talked-about topic together, facilitate user to check.
Current video website is all discovery and the tracking of being carried out much-talked-about topic by operation personnel.Operation personnel, by analyzing the descriptor such as title, brief introduction of video file, determines current much-talked-about topic, then determines the video corresponding to each much-talked-about topic further.
The much-talked-about topic that the descriptor analyzing video file by operation personnel is determined often has hysteresis quality, and some much-talked-about topics may be made up of all multistage negotiation events, have comparatively long span in time, need operation personnel to give more sustained attention and analysis, artificial operation cost is higher.
Summary of the invention
For solving the problem, the embodiment of the invention discloses a kind of much-talked-about topic and following the trail of and keyword defining method and device.Technical scheme is as follows:
A kind of much-talked-about topic keyword defining method, be applied to server, described method comprises:
Participle is carried out to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;
Respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;
According to the basic word that determined attribute is named entity, set up the text model of every bar hot spot data;
According to the text similarity of the text model of every two hot spot datas, cluster is carried out to obtained all hot spot datas, obtains at least one class bunch;
For each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.
In a kind of embodiment of the present invention, described obtain the set of the basic word of every bar hot spot data after, described in determine setting up the text model of this hot spot data attribute be the basic word of named entity before, also comprise:
Respectively stop words filtration treatment is carried out to the basic word in the set of the basic word of every bar hot spot data.
In a kind of embodiment of the present invention, also comprise:
For each class bunch, in the keyword of such bunch determined, search at least one keyword that the frequency is the highest;
A title is selected, as such bunch of corresponding much-talked-about topic in the title of the hot spot data at least one the keyword place found.
Based on a much-talked-about topic method for tracing for above-mentioned much-talked-about topic keyword defining method, be applied to server, described method comprises:
For each much-talked-about topic, determine the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;
According to determined text similarity, follow the trail of the video file that this much-talked-about topic is corresponding.
In a kind of embodiment of the present invention, described according to determined text similarity, follow the trail of the video file that this much-talked-about topic is corresponding, comprising:
Whether be greater than default first threshold according to determined text similarity, determine the candidate video collection of this much-talked-about topic;
Concentrate at the candidate video of this much-talked-about topic, carry out the process of video duplicate removal;
According to duplicate removal result, determine the video file that this much-talked-about topic is corresponding.
In a kind of embodiment of the present invention, described according to duplicate removal result, after determining the video file that this much-talked-about topic is corresponding, also comprise:
Judge whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold;
If so, then successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold;
According to the quality of video in each classification, determine the representative video that each classification is corresponding;
Representative video corresponding for each classification is defined as the associated video of this much-talked-about topic.
In a kind of embodiment of the present invention, the described candidate video in this much-talked-about topic is concentrated, and carries out the process of video duplicate removal, comprising:
According to the issue moment of the video file that the candidate video of this much-talked-about topic is concentrated, according to issuing moment order from morning to night, the video that described candidate video is concentrated is sorted;
Judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then concentrate to retain at the candidate video of this much-talked-about topic and issue moment video file early, delete the video file issuing evening in moment.
In a kind of embodiment of the present invention, described judge to issue moment adjacent two video files be whether palinopsia frequently, comprising:
The text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;
Or,
The visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;
Or;
The text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.
A kind of much-talked-about topic keyword determining device, be applied to server, described device comprises:
Basis set of words obtains module, for carrying out participle to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;
Named entity attribute basis word determination module, for respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;
Text model sets up module, for being the basic word of named entity according to determined attribute, sets up the text model of every bar hot spot data;
Hot spot data cluster module, for the text similarity of the text model according to every two hot spot datas, carries out cluster to obtained all hot spot datas, obtains at least one class bunch;
Much-talked-about topic keyword determination module, for for each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.
In a kind of embodiment of the present invention, described basic set of words obtain module also for:
Respectively stop words filtration treatment is carried out to the basic word in the set of the basic word of every bar hot spot data.
In a kind of embodiment of the present invention, also comprise:
Much-talked-about topic title determination module, for for each class bunch, searches at least one keyword that the frequency is the highest in the keyword of such bunch determined; A title is selected, as such bunch of corresponding much-talked-about topic in the title of the hot spot data at least one the keyword place found.
Based on a much-talked-about topic follow-up mechanism for above-mentioned much-talked-about topic keyword determining device, be applied to server, described device comprises:
Text similarity determination module, for for each much-talked-about topic, determines the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;
Video file tracing module, for according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding.
In a kind of embodiment of the present invention, described video file tracing module, comprising:
Candidate video collection determination submodule, for whether being greater than default first threshold according to determined text similarity, determines the candidate video collection of this much-talked-about topic;
Duplicate removal process submodule, concentrates for the candidate video in this much-talked-about topic, carries out the process of video duplicate removal;
Video file determination submodule, for according to duplicate removal result, determines the video file that this much-talked-about topic is corresponding.
In a kind of embodiment of the present invention, also comprise:
Judging submodule, for judging whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold, if so, then triggering clustering processing submodule;
Described clustering processing submodule, for successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold;
Represent video determination submodule, for the quality according to video in each classification, determine the representative video that each classification is corresponding;
Associated video determination submodule, for being defined as the associated video of this much-talked-about topic by representative video corresponding for each classification.
In a kind of embodiment of the present invention, described duplicate removal process submodule, comprising:
Video sequencing unit, for issue moment of video file of concentrating according to the candidate video of this much-talked-about topic, sorts to the video that described candidate video is concentrated according to issuing moment order from morning to night;
Repeat video judging unit, for judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then trigger duplicate removal processing unit;
Described duplicate removal processing unit, issuing moment video file early for concentrating at the candidate video of this much-talked-about topic to retain, deleting the video file issuing evening in moment.
In a kind of embodiment of the present invention, described repetition video judging unit, specifically for:
The text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;
Or,
The visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;
Or;
The text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.
The technical scheme that the application embodiment of the present invention provides, by obtaining hot spot data in setting website, and by carrying out the process such as participle, cluster to these hot spot datas, determine the keyword of much-talked-about topic and much-talked-about topic in time, for each much-talked-about topic, by the text similarity of the keyword of this much-talked-about topic and the descriptor of video file, follow the trail of and obtain video file corresponding to this much-talked-about topic.So far, the determination of much-talked-about topic keyword and the tracking of much-talked-about topic complete automatically by server, effectively prevent and determine by manually runing the lag issues that much-talked-about topic is brought, even and if certain much-talked-about topic may be made up of all multistage negotiation events, by server multiple exercise the technical program, can video file corresponding to this much-talked-about topic of regular update, save artificial operation cost.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of implementing procedure figure of much-talked-about topic keyword defining method in the embodiment of the present invention;
Fig. 2 is a kind of implementing procedure figure of much-talked-about topic method for tracing in the embodiment of the present invention;
Fig. 3 is the another kind of implementing procedure figure of much-talked-about topic method for tracing in the embodiment of the present invention;
Fig. 4 is a kind of structural representation of much-talked-about topic keyword determining device in the embodiment of the present invention;
Fig. 5 is a kind of structural representation of much-talked-about topic follow-up mechanism in the embodiment of the present invention.
Embodiment
Technical scheme in the embodiment of the present invention is understood better in order to make those skilled in the art, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Shown in Figure 1, be the implementing procedure figure of a kind of much-talked-about topic keyword defining method that the embodiment of the present invention provides, the method is applied to server, can comprise the following steps:
S110: participle is carried out to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data.
Internet era, information is able to fast propagation, and in certain period, the focus of people is often more concentrated.Microblog topic list, website such as search roll of the hour etc. have converged the hot spot data that people pay close attention to, and have embodied a concentrated reflection of the view of people for certain topic in current slot or something.
In actual applications, same topic may have different language performances, but its theme showed is identical.Regularly or aperiodically can pass through web crawlers or other information acquiring pattern, from the website preset (than websites such as microblog topic list described above, search roll of the hours), obtain hot spot data.The every bar hot spot data obtained can comprise title, summary, the text message such as detailed description and/or linked contents of this hot spot data.
Be understandable that, different websites may be different for the expression way of hot spot data, so for the hot spot data got, can first by its according to preset format conversion be unified data representation, such as according to following form: title, description, time, relevant textual information.The text message amount of every bar hot spot data is comparatively large, needs to carry out word segmentation processing to the text message of every bar hot spot data, can obtain the set of the basic word of every bar hot spot data.
In a kind of embodiment of the present invention, stop words filtration treatment can also be carried out to the basic word in the set of the basic word of every bar hot spot data respectively.In actual applications, stop words dictionary can be pre-set, the function word such as " ", " ", " obtaining " in this dictionary, can be comprised, the word that " high definition ", " low clear " etc. are determined by operation personnel can also be comprised.
S120: respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity.
In embodiments of the present invention, can named entity recognition be carried out, determine that attribute is the basic word of named entity, the so-called attribute basic word that to be the basic word of named entity can be with semanteme, as name, place name, building name etc., it can also be the basic word etc. with verb or noun part-of-speech.The information of the frequency occurred in the text message of each basic word at this hot spot data can be comprised in the set of the basic word of every bar hot spot data of step S110 acquisition, thus the frequency that the basic word that can be named entity according to attribute occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity.Or, search in the text message of this hot spot data, obtaining attribute is the frequency that the basic word of named entity occurs in the text message of this hot spot data, thus according to frequency information, the attribute determining the text model setting up this hot spot data is the basic word of named entity.It should be noted that, the recognition methods with the basic word of named entity attribute is prior art, repeats no more in embodiments of the present invention.
In actual applications, can, in the basic set of words of this hot spot data, the attribute that frequency of occurrence is higher in the text message of this hot spot data be selected to be that the basic word of named entity is for characterizing this hot spot data.For every bar hot spot data, the frequency that can occur in the text message of this hot spot data according to each basic word with named entity attribute of this hot spot data, sort according to the height of the frequency, attribute top n being defined as the text model for setting up this hot spot data is the basic word of named entity, or the attribute of the text model being defined as front x% for setting up this hot spot data is the basic word of named entity.It should be noted that, N or x can carry out setting and adjusting according to actual conditions here, and the embodiment of the present invention does not limit this.
S130: the basic word according to determined attribute being named entity, sets up the text model of every bar hot spot data.
The attribute of the text model determined for setting up this hot spot data in step S120 is the basic word of named entity, is the basic word of named entity, can sets up the text model of every bar hot spot data according to determined attribute.Concrete, can express the text model of every bar hot spot data with the form of vector space model (VSM), wherein can record determined attribute is the frequency that the basic word of named entity occurs in the text message of corresponding hot spot data.It should be noted that, vector space model is prior art, and the embodiment of the present invention repeats no more this.
S140: according to the text similarity of the text model of every two hot spot datas, carries out cluster to obtained all hot spot datas, obtains at least one class bunch.
The text model of every bar hot spot data is established in step S130, can according to prior art, as conventional cosine angle theorem, Jaccard distance metric etc., calculate the text similarity of the text model of every two hot spot datas, the present invention repeats no more this.
According to the text similarity of the text model of every two hot spot datas calculated, cluster is carried out to obtained all hot spot datas, obtains at least one class bunch.As, for a certain bar hot spot data, be classified as a class bunch by with the text similarity of its text model higher than the hot spot data of a certain predetermined threshold value.Use the text model of hot spot data to carry out the calculating of text similarity, compare the calculating using all text messages of hot spot data to carry out text similarity, calculated amount is little, and counting yield is high.
For convenience of understanding, lifting a simple case and being described.Have four hot spot datas, corresponding text model is respectively A, B, C, D, and wherein, the text similarity between A, B, C, D is as shown in table 1:
Table 1
Text model A B C D
A 1 0.8 0.3 0.5
B 0.8 1 0.3 0.2
C 0.3 0.3 1 0.5
D 0.5 0.2 0.5 1
Suppose, if two the text similarity of the text model of hot spot data is higher than 0.5, then these two hot spot datas are classified as a class bunch, so according to this condition, the hot spot data that A is corresponding and hot spot data corresponding to B are a class bunch, the hot spot data that C is corresponding is a class bunch, and the hot spot data that D is corresponding is a class bunch.
S150: for each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.
In step S140, carry out cluster to obtained all hot spot datas, obtain at least one class bunch, the hot spot data in each class bunch can characterize a much-talked-about topic.For each class bunch, in the text message of the hot spot data in such bunch, the basic word of frequency of occurrence higher (meeting default threshold condition) can as the keyword of such bunch, thus the keyword of such bunch can be defined as the keyword of such bunch of corresponding much-talked-about topic.
In one embodiment of the invention, can for each class bunch, at least one keyword that the frequency is the highest is searched in the keyword of such bunch determined, and in the title of the hot spot data at least one the keyword place found, select a title, as such bunch of corresponding much-talked-about topic.
The technical scheme that the application embodiment of the present invention provides, hot spot data is obtained in setting website by server, and by carrying out the process such as participle, cluster to these hot spot datas, determine the keyword of much-talked-about topic and much-talked-about topic in time, effectively prevent and determine by manually runing the lag issues that much-talked-about topic is brought.
Based on above-mentioned much-talked-about topic keyword defining method, the embodiment of the present invention additionally provides much-talked-about topic method for tracing, shown in Figure 2, and the method is applied to server, can comprise the following steps:
S210: for each much-talked-about topic, determines the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;
Based on above-mentioned much-talked-about topic keyword defining method, determine the keyword of current much-talked-about topic and each much-talked-about topic.The descriptor of the video file in video website can be title or the brief introduction of this video file.For certain much-talked-about topic, the text similarity of the descriptor of the keyword of this much-talked-about topic and certain video file is higher, represent this video file and this much-talked-about topic more close.It should be noted that, those skilled in the art can calculate the text similarity of the keyword of each much-talked-about topic and the descriptor of each video file by prior art, the embodiment of the present invention repeats no more this.
S220: according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding.
Step S210 determines the text similarity of the keyword of each much-talked-about topic and the descriptor of each video file, according to determined text similarity, can follow the trail of and obtain video file corresponding to each much-talked-about topic.
In a kind of embodiment of the present invention, step S220 can comprise the following steps:
S221: whether be greater than default first threshold according to determined text similarity, determines the candidate video collection of this much-talked-about topic.
Be understandable that, the numerical value of the text similarity of the keyword of much-talked-about topic and the descriptor of video file is higher, represent this video file and this much-talked-about topic more close.In actual applications, if determined text similarity is greater than default first threshold, then the candidate video that corresponding video file can be belonged to this much-talked-about topic is concentrated.
S222: concentrate at the candidate video of this much-talked-about topic, carry out the process of video duplicate removal;
Be understandable that, video file in video website, especially UGC (User Generated Content, user's production content) class video file mainly uploaded by user, different user may be not quite similar for the description of the video of identical content, so, in video website, there is more repetition or close video file.In actual applications, the video can concentrated for the candidate video of each much-talked-about topic carries out duplicate removal process.
In a kind of embodiment of the present invention, step S222 can comprise the following steps:
Step one: according to the issue moment of the video file that the candidate video of this much-talked-about topic is concentrated, sorts to the video that described candidate video is concentrated according to issuing moment order from morning to night;
Step 2: judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then concentrate to retain at the candidate video of this much-talked-about topic and issue moment video file early, delete the video file issuing evening in moment.
For convenience of describing, above-mentioned two steps being combined and is described.
In embodiments of the present invention, the moment of user's uploaded videos file is the issue moment of video file, for ensureing the rights and interests of the user of first uploaded videos file, for repetition video, preferentially can retain and issue moment video file early, delete the video file issuing evening in moment.
In actual applications, the issue moment of the video file can concentrated according to the candidate video of this much-talked-about topic, according to issuing moment order from morning to night, the video that candidate video is concentrated is sorted, by judging whether two video files in adjacent issue moment are palinopsia frequency, carry out duplicate removal process.
Judge whether issue moment adjacent two video files is that palinopsia determination methods frequently can have following several:
The first: the text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;
As previously mentioned, the text similarity for the descriptor of two video files can be calculated by prior art.If the text similarity issuing the descriptor of moment adjacent two video files is higher than predetermined threshold value, then can determine these two video files be palinopsia frequently.
The second: the visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;
In video website, intuitively check video file for the convenience of the user, generally video file can be showed user with thumbnail form.The descriptor of the video file of identical content may be different, but the visual signature of its thumbnail may be identical or close, and the visual signature of thumbnail is as the feature such as color, texture, shape of thumbnail.So, the visual signature similarity of the thumbnail issuing moment adjacent two video files can be calculated, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency, concrete, visual signature similarity can be defined as repetition video higher than two video files of a certain predetermined threshold value.Visual signature similarity for the thumbnail of different video file can calculate according to prior art, as contrast color histogram obtains visual signature similarity.
In actual applications, respectively the key frame picture in two video files can also be extracted, by contrasting the visual signature of the key frame picture of two video files, carry out the calculating of the visual signature similarity of these two video files, thus according to result of calculation, can determine whether these two video files are palinopsia frequency.Illustrate, video file A and video file B issues moment adjacent two video files, first from video file A, extract M key frame, N number of key frame is extracted from video file B, M and N can be identical or different, then the visual signature of each key frame is extracted respectively, for each key frame, express with a high dimensional feature vector, matching degree between the proper vector calculating key frame under the constraint of Time and place relation, with determine video file A and video file B be whether palinopsia frequently.Such as, with the key frame of video file A for benchmark, the key frame of video file A is contrasted with the key frame of video file B according to the order of sequence successively, if the visual signature similarity of a jth key frame of i-th of video file A key frame and video file B is greater than predetermined threshold value, then think and find initial matching point to i and j, after initial matching point, the visual signature similarity that sequentially key frame of calculating video file A, video file B is right, until terminate or do not mate, the coupling picture pair that to obtain with (i, j) be starting point.Repeat above-mentioned steps, find the maximum coupling picture pair of the overall situation, if the right number of overall maximum matching graph sheet accounts for the ratio of total number of pictures higher than another preset matching threshold value, then can determine these two video files be palinopsia frequently.
Certainly, those skilled in the art also can according to prior art, and use other frame of video of video file to carry out the calculating of visual signature similarity, the embodiment of the present invention repeats no more this.
The third: the text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.
In actual applications, consider the text similarity of the descriptor of two video files and the visual signature similarity of these two video files, can determine whether these two video files are palinopsia frequency, improve duplicate removal precision more accurately.Concrete, can respectively for text similarity and visual signature similarity predetermined threshold value, when the two is all higher than predetermined threshold value corresponding to it, these two video files are defined as repetition video, or, text similarity and the certain weight of visual signature similarity can be given respectively, when the weighted sum of the two is higher than certain predetermined threshold value, determine these two video files be palinopsia frequently.
It should be noted that, above-mentioned three kinds judge whether two video files are that palinopsia method frequently can carry out choice for use according to actual conditions, and the setting wherein for threshold value can be carried out according to actual conditions.
S223: according to duplicate removal result, determines the video file that this much-talked-about topic is corresponding.
According to step S222 duplicate removal result, the video file that this much-talked-about topic is corresponding can be determined.
In actual applications, after carrying out duplicate removal process, the quantity of the video file that determined much-talked-about topic is corresponding may be still a lot, show that this much-talked-about topic is comparatively wide in range, may have certain ductility in time.If all represented as the associated video of this much-talked-about topic, make troubles, because user needs constantly to ransack the video file that just can find and want to check to checking of user.
Based on this, in one embodiment of the invention, after step S223, can also comprise the following steps:
First step: judge whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold, if so, then performs second step.
In actual applications, for each much-talked-about topic, the maximum quantity of the associated video of this much-talked-about topic can be pre-set, i.e. Second Threshold, if the quantity of the video file that the determined much-talked-about topic of step S223 is corresponding is greater than default Second Threshold, then perform second step.Concrete setting and the adjustment of Second Threshold can be carried out according to actual conditions.
Second step: successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold.
Be understandable that, the issue moment of the different video file that the candidate video after duplicate removal process is concentrated may be identical or different.If certain much-talked-about topic has certain ductility in time, so the issue time at intervals of the video file of its correspondence may be larger.Can successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process.Each classification can as an important stage in this much-talked-about topic evolutionary process.
Such as, for certain much-talked-about topic, after duplicate removal process, the quantity of the video file that its candidate video is concentrated still exceedes default Second Threshold, sooner or later sequentially these video files can be launched according to the issue moment, sequence is video file 1, video file 2, video file 3, video file 4, video file 5, wherein, video file 1 is 1 hour with the issue time at intervals of video file 2, the issue time at intervals of video file 2 and video file 3 is 3 days, the issue time at intervals of video file 3 and video file 4 is 5 days, the issue time at intervals of video file 4 and video file 5 is 2 hours.
Suppose that cluster condition is: be not more than 1 day, the video file being not more than 1 day by issue time at intervals is classified as a classification, and the result obtained is:
{ video file 1, video file 2}, { video file 3}, { video file 4, video file 5}, totally three classifications.
If the classification number 3 obtained still is greater than default Second Threshold, then can revise cluster condition is: be not more than 3 days, and the video file being not more than 3 days by issue time at intervals is classified as a classification, and the result obtained is:
{ video file 1, video file 2, video file 3}, { video file 4, video file 5}, totally two classifications.
It should be noted that, after these video files are carried out hierarchical clustering process, the classification number finally obtained needs to be less than default Second Threshold.
3rd step: according to the quality of video in each classification, determines the representative video that each classification is corresponding.
In actual applications, after video file concentrated for the candidate video through duplicate removal process is carried out hierarchical clustering process, in each classification, multiple video file is contained.Be understandable that, the quality of different video files is uneven, and the identity of uploader may be different, and beholder may be different to its fancy grade, etc.Each side factor can be considered, in multiple video files that each classification comprises, select one represent video.
4th step: associated video representative video corresponding for each classification being defined as this much-talked-about topic.
Representative video corresponding for each classification is defined as the associated video of this much-talked-about topic.When there being displaying demand, the associated video of this much-talked-about topic is showed user.
The technical scheme that the application embodiment of the present invention provides, for each much-talked-about topic, by the text similarity of the keyword of this much-talked-about topic and the descriptor of video file, follow the trail of and obtain video file corresponding to this much-talked-about topic, automatically certain much-talked-about topic completed by server, even if may be made up of all multistage negotiation events, by server multiple exercise the technical program, can video file corresponding to this much-talked-about topic of regular update, save artificial operation cost.
Corresponding to embodiment of the method shown in Fig. 1, the embodiment of the present invention additionally provides a kind of much-talked-about topic keyword determining device, is applied to server, and shown in Figure 4, this device can comprise with lower module:
Basis set of words obtains module 310, for carrying out participle to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;
Named entity attribute basis word determination module 320, for respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;
Text model sets up module 330, for being the basic word of named entity according to determined attribute, sets up the text model of every bar hot spot data;
Hot spot data cluster module 340, for the text similarity of the text model according to every two hot spot datas, carries out cluster to obtained all hot spot datas, obtains at least one class bunch;
Much-talked-about topic keyword determination module 350, for for each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.
In one embodiment of the invention, described basic set of words acquisition module 310 can also be used for:
Respectively stop words filtration treatment is carried out to the basic word in the set of the basic word of every bar hot spot data.
In one embodiment of the invention, this device can also comprise with lower module:
Much-talked-about topic title determination module, for for each class bunch, searches at least one keyword that the frequency is the highest in the keyword of such bunch determined; A title is selected, as such bunch of corresponding much-talked-about topic in the title of the hot spot data at least one the keyword place found.
The device that the application embodiment of the present invention provides, hot spot data is obtained in setting website by server, and by carrying out the process such as participle, cluster to these hot spot datas, determine the keyword of much-talked-about topic and much-talked-about topic in time, effectively prevent and determine by manually runing the lag issues that much-talked-about topic is brought.
Corresponding to embodiment of the method shown in Fig. 2, the embodiment of the present invention additionally provides a kind of much-talked-about topic follow-up mechanism, is applied to server, shown in Figure 5, and this device can comprise with lower module:
Text similarity determination module 410, for for each much-talked-about topic, determines the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;
Video file tracing module 420, for according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding.
In a kind of embodiment of the present invention, described video file tracing module 420, can comprise following submodule:
Candidate video collection determination submodule, for whether being greater than default first threshold according to determined text similarity, determines the candidate video collection of this much-talked-about topic;
Duplicate removal process submodule, concentrates for the candidate video in this much-talked-about topic, carries out the process of video duplicate removal;
Video file determination submodule, for according to duplicate removal result, determines the video file that this much-talked-about topic is corresponding.
In a kind of embodiment of the present invention, this device can also comprise following submodule:
Judging submodule, for judging whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold, if so, then triggering clustering processing submodule;
Described clustering processing submodule, for successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold;
Represent video determination submodule, for the quality according to video in each classification, determine the representative video that each classification is corresponding;
Associated video determination submodule, for being defined as the associated video of this much-talked-about topic by representative video corresponding for each classification.
In a kind of embodiment of the present invention, described duplicate removal process submodule, can comprise with lower unit:
Video sequencing unit, for issue moment of video file of concentrating according to the candidate video of this much-talked-about topic, sorts to the video that described candidate video is concentrated according to issuing moment order from morning to night;
Repeat video judging unit, for judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then trigger duplicate removal processing unit;
Described duplicate removal processing unit, issuing moment video file early for concentrating at the candidate video of this much-talked-about topic to retain, deleting the video file issuing evening in moment.
In a kind of embodiment of the present invention, described repetition video judging unit, specifically for:
The text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;
Or,
The visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;
Or;
The text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.
The device that the application embodiment of the present invention provides, for each much-talked-about topic, by the text similarity of the keyword of this much-talked-about topic and the descriptor of video file, follow the trail of and obtain video file corresponding to this much-talked-about topic, automatically certain much-talked-about topic completed by server, even if may be made up of all multistage negotiation events, by server multiple exercise the technical program, can video file corresponding to this much-talked-about topic of regular update, save artificial operation cost.
It should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
Each embodiment in this instructions all adopts relevant mode to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.
One of ordinary skill in the art will appreciate that all or part of step realized in said method embodiment is that the hardware that can carry out instruction relevant by program has come, described program can be stored in computer read/write memory medium, here the alleged storage medium obtained, as: ROM/RAM, magnetic disc, CD etc.
The foregoing is only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.All any amendments done within the spirit and principles in the present invention, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims (16)

1. a much-talked-about topic keyword defining method, is characterized in that, is applied to server, and described method comprises:
Participle is carried out to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;
Respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;
According to the basic word that determined attribute is named entity, set up the text model of every bar hot spot data;
According to the text similarity of the text model of every two hot spot datas, cluster is carried out to obtained all hot spot datas, obtains at least one class bunch;
For each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.
2. method according to claim 1, is characterized in that, described obtain the set of the basic word of every bar hot spot data after, described in determine setting up the text model of this hot spot data attribute be the basic word of named entity before, also comprise:
Respectively stop words filtration treatment is carried out to the basic word in the set of the basic word of every bar hot spot data.
3. method according to claim 1 and 2, is characterized in that, also comprises:
For each class bunch, in the keyword of such bunch determined, search at least one keyword that the frequency is the highest;
A title is selected, as such bunch of corresponding much-talked-about topic in the title of the hot spot data at least one the keyword place found.
4., based on a much-talked-about topic method for tracing for the much-talked-about topic keyword defining method described in any one of claims 1 to 3, it is characterized in that, be applied to server, described method comprises:
For each much-talked-about topic, determine the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;
According to determined text similarity, follow the trail of the video file that this much-talked-about topic is corresponding.
5. method according to claim 4, is characterized in that, described according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding, comprising:
Whether be greater than default first threshold according to determined text similarity, determine the candidate video collection of this much-talked-about topic;
Concentrate at the candidate video of this much-talked-about topic, carry out the process of video duplicate removal;
According to duplicate removal result, determine the video file that this much-talked-about topic is corresponding.
6. method according to claim 5, is characterized in that, described according to duplicate removal result, after determining the video file that this much-talked-about topic is corresponding, also comprises:
Judge whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold;
If so, then successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold;
According to the quality of video in each classification, determine the representative video that each classification is corresponding;
Representative video corresponding for each classification is defined as the associated video of this much-talked-about topic.
7. the method according to claim 5 or 6, is characterized in that, the described candidate video in this much-talked-about topic is concentrated, and carries out the process of video duplicate removal, comprising:
According to the issue moment of the video file that the candidate video of this much-talked-about topic is concentrated, according to issuing moment order from morning to night, the video that described candidate video is concentrated is sorted;
Judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then concentrate to retain at the candidate video of this much-talked-about topic and issue moment video file early, delete the video file issuing evening in moment.
8. method according to claim 7, is characterized in that, described judge to issue moment adjacent two video files be whether palinopsia frequently, comprising:
The text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;
Or,
The visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;
Or;
The text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.
9. a much-talked-about topic keyword determining device, is characterized in that, is applied to server, and described device comprises:
Basis set of words obtains module, for carrying out participle to the text message of the every bar hot spot data obtained in setting website, obtains the set of the basic word of every bar hot spot data;
Named entity attribute basis word determination module, for respectively for every bar hot spot data, in the set of the basic word of this hot spot data, according to the frequency that the attribute basic word that is named entity occurs in the text message of this hot spot data, the attribute determining the text model setting up this hot spot data is the basic word of named entity;
Text model sets up module, for being the basic word of named entity according to determined attribute, sets up the text model of every bar hot spot data;
Hot spot data cluster module, for the text similarity of the text model according to every two hot spot datas, carries out cluster to obtained all hot spot datas, obtains at least one class bunch;
Much-talked-about topic keyword determination module, for for each class bunch, it is the frequency occurred in the text message of the hot spot data that the basic word of named entity comprises at such bunch according to attribute in such bunch, determine the keyword of such bunch, and the keyword of such bunch is defined as the keyword of such bunch of corresponding much-talked-about topic.
10. device according to claim 9, is characterized in that, described basic set of words obtain module also for:
Respectively stop words filtration treatment is carried out to the basic word in the set of the basic word of every bar hot spot data.
11. devices according to claim 9 or 10, is characterized in that, also comprise:
Much-talked-about topic title determination module, for for each class bunch, searches at least one keyword that the frequency is the highest in the keyword of such bunch determined; A title is selected, as such bunch of corresponding much-talked-about topic in the title of the hot spot data at least one the keyword place found.
12. 1 kinds of much-talked-about topic follow-up mechanisms based on the much-talked-about topic keyword determining device described in any one of claim 9 or 11, it is characterized in that, be applied to server, described device comprises:
Text similarity determination module, for for each much-talked-about topic, determines the text similarity of the keyword of this much-talked-about topic and the descriptor of each video file;
Video file tracing module, for according to determined text similarity, follows the trail of the video file that this much-talked-about topic is corresponding.
13. devices according to claim 12, is characterized in that, described video file tracing module, comprising:
Candidate video collection determination submodule, for whether being greater than default first threshold according to determined text similarity, determines the candidate video collection of this much-talked-about topic;
Duplicate removal process submodule, concentrates for the candidate video in this much-talked-about topic, carries out the process of video duplicate removal;
Video file determination submodule, for according to duplicate removal result, determines the video file that this much-talked-about topic is corresponding.
14. devices according to claim 13, is characterized in that, also comprise:
Judging submodule, for judging whether the quantity of the video file that this much-talked-about topic determined is corresponding is greater than default Second Threshold, if so, then triggering clustering processing submodule;
Described clustering processing submodule, for successively according to the issue time at intervals issuing moment adjacent video, the video file corresponding to this much-talked-about topic determined carries out hierarchical clustering process, until the classification number obtained is not more than described default Second Threshold;
Represent video determination submodule, for the quality according to video in each classification, determine the representative video that each classification is corresponding;
Associated video determination submodule, for being defined as the associated video of this much-talked-about topic by representative video corresponding for each classification.
15. devices according to claim 13 or 14, it is characterized in that, described duplicate removal process submodule, comprising:
Video sequencing unit, for issue moment of video file of concentrating according to the candidate video of this much-talked-about topic, sorts to the video that described candidate video is concentrated according to issuing moment order from morning to night;
Repeat video judging unit, for judge successively to issue moment adjacent two video files be whether palinopsia frequently, if so, then trigger duplicate removal processing unit;
Described duplicate removal processing unit, issuing moment video file early for concentrating at the candidate video of this much-talked-about topic to retain, deleting the video file issuing evening in moment.
16. devices according to claim 15, is characterized in that, described repetition video judging unit, specifically for:
The text similarity of the descriptor of two video files that the calculating issue moment is adjacent, and according to calculating text similarity, determine whether these two video files are palinopsia frequency;
Or,
The visual signature similarity of two video files that the calculating issue moment is adjacent, and according to the visual signature similarity calculated, determine whether these two video files are palinopsia frequency;
Or;
The text similarity of descriptor of two video files that the calculating issue moment is adjacent and the visual signature similarity of these two video files, and according to the text similarity calculated and visual signature similarity, determine whether these two video files are palinopsia frequency.
CN201510372462.0A 2015-06-30 2015-06-30 A kind of much-talked-about topic tracking and keyword determine method and device Active CN104915447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510372462.0A CN104915447B (en) 2015-06-30 2015-06-30 A kind of much-talked-about topic tracking and keyword determine method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510372462.0A CN104915447B (en) 2015-06-30 2015-06-30 A kind of much-talked-about topic tracking and keyword determine method and device

Publications (2)

Publication Number Publication Date
CN104915447A true CN104915447A (en) 2015-09-16
CN104915447B CN104915447B (en) 2018-04-20

Family

ID=54084510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510372462.0A Active CN104915447B (en) 2015-06-30 2015-06-30 A kind of much-talked-about topic tracking and keyword determine method and device

Country Status (1)

Country Link
CN (1) CN104915447B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843798A (en) * 2016-04-05 2016-08-10 江苏鼎中智能科技有限公司 Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages
CN106202293A (en) * 2016-06-30 2016-12-07 北京奇艺世纪科技有限公司 The update method of a kind of accident corpus and device
CN106503064A (en) * 2016-09-29 2017-03-15 中国国防科技信息中心 A kind of generation method of self adaptation microblog topic summary
CN107066633A (en) * 2017-06-15 2017-08-18 厦门创材健康科技有限公司 Deep learning method and apparatus based on human-computer interaction
CN107273389A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 The querying method and device of trial video
CN108446296A (en) * 2018-01-24 2018-08-24 北京奇艺世纪科技有限公司 A kind of information processing method and device
CN108509517A (en) * 2018-03-09 2018-09-07 东南大学 A kind of streaming topic evolution tracking towards real-time news content
CN109271509A (en) * 2018-08-23 2019-01-25 武汉斗鱼网络科技有限公司 Generation method, device, computer equipment and the storage medium of direct broadcasting room topic
CN109284286A (en) * 2018-09-12 2019-01-29 贵州省赤水市气象局 A method of it is concentrated from initial data and extracts validity feature
CN110020421A (en) * 2018-01-10 2019-07-16 北京京东尚科信息技术有限公司 The session information method of abstracting and system of communication software, equipment and storage medium
CN110414232A (en) * 2019-06-26 2019-11-05 腾讯科技(深圳)有限公司 Rogue program method for early warning, device, computer equipment and storage medium
CN110876070A (en) * 2018-08-29 2020-03-10 中国电信股份有限公司 Content distribution system, processing method, and storage medium
CN111027282A (en) * 2019-11-21 2020-04-17 精硕科技(北京)股份有限公司 Text duplicate removal method and device, electronic equipment and computer readable storage medium
CN111159551A (en) * 2019-12-30 2020-05-15 汉海信息技术(上海)有限公司 Display method and device of user-generated content and computer equipment
CN111309999A (en) * 2018-12-11 2020-06-19 阿里巴巴集团控股有限公司 Method and device for generating interactive scene content
WO2020155496A1 (en) * 2019-01-31 2020-08-06 平安科技(深圳)有限公司 Public opinion tracking method and device for combined video-text data, and computer apparatus
CN111581493A (en) * 2020-04-07 2020-08-25 苏宁云计算有限公司 Video pushing method and device, computer equipment and storage medium
CN111666467A (en) * 2019-03-07 2020-09-15 上海博泰悦臻网络技术服务有限公司 Vehicle, vehicle equipment and vehicle equipment news tracking reporting method thereof
CN111881275A (en) * 2020-07-24 2020-11-03 新华智云科技有限公司 Efficient hotspot identification and matching method
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN115858787A (en) * 2022-12-12 2023-03-28 交通运输部公路科学研究所 Hot spot extraction and mining method based on problem appeal information in road transportation
CN116561401A (en) * 2023-05-26 2023-08-08 北京国新汇金股份有限公司 Information hotspot refining method and system based on big data analysis
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN102945290A (en) * 2012-12-03 2013-02-27 北京奇虎科技有限公司 Hot microblog topic digging device and method
CN103577593A (en) * 2013-11-14 2014-02-12 中国科学院声学研究所 Method and system for video aggregation based on microblog hot topics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN102945290A (en) * 2012-12-03 2013-02-27 北京奇虎科技有限公司 Hot microblog topic digging device and method
CN103577593A (en) * 2013-11-14 2014-02-12 中国科学院声学研究所 Method and system for video aggregation based on microblog hot topics

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843798A (en) * 2016-04-05 2016-08-10 江苏鼎中智能科技有限公司 Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages
CN107273389A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 The querying method and device of trial video
CN106202293A (en) * 2016-06-30 2016-12-07 北京奇艺世纪科技有限公司 The update method of a kind of accident corpus and device
CN106202293B (en) * 2016-06-30 2019-05-10 北京奇艺世纪科技有限公司 A kind of update method and device of emergency event corpus
CN106503064A (en) * 2016-09-29 2017-03-15 中国国防科技信息中心 A kind of generation method of self adaptation microblog topic summary
CN106503064B (en) * 2016-09-29 2019-07-02 中国国防科技信息中心 A kind of generation method of adaptive microblog topic abstract
CN107066633A (en) * 2017-06-15 2017-08-18 厦门创材健康科技有限公司 Deep learning method and apparatus based on human-computer interaction
CN110020421A (en) * 2018-01-10 2019-07-16 北京京东尚科信息技术有限公司 The session information method of abstracting and system of communication software, equipment and storage medium
CN108446296A (en) * 2018-01-24 2018-08-24 北京奇艺世纪科技有限公司 A kind of information processing method and device
CN108509517A (en) * 2018-03-09 2018-09-07 东南大学 A kind of streaming topic evolution tracking towards real-time news content
CN108509517B (en) * 2018-03-09 2021-05-11 东南大学 Streaming topic evolution tracking method for real-time news content
CN109271509B (en) * 2018-08-23 2021-05-28 武汉斗鱼网络科技有限公司 Live broadcast room topic generation method and device, computer equipment and storage medium
CN109271509A (en) * 2018-08-23 2019-01-25 武汉斗鱼网络科技有限公司 Generation method, device, computer equipment and the storage medium of direct broadcasting room topic
CN110876070A (en) * 2018-08-29 2020-03-10 中国电信股份有限公司 Content distribution system, processing method, and storage medium
CN109284286A (en) * 2018-09-12 2019-01-29 贵州省赤水市气象局 A method of it is concentrated from initial data and extracts validity feature
CN109284286B (en) * 2018-09-12 2021-04-06 贵州省赤水市气象局 Method for extracting effective characteristics from original data set
CN111309999A (en) * 2018-12-11 2020-06-19 阿里巴巴集团控股有限公司 Method and device for generating interactive scene content
CN111309999B (en) * 2018-12-11 2023-05-16 阿里巴巴集团控股有限公司 Method and device for generating interactive scene content
WO2020155496A1 (en) * 2019-01-31 2020-08-06 平安科技(深圳)有限公司 Public opinion tracking method and device for combined video-text data, and computer apparatus
CN111666467A (en) * 2019-03-07 2020-09-15 上海博泰悦臻网络技术服务有限公司 Vehicle, vehicle equipment and vehicle equipment news tracking reporting method thereof
CN110414232A (en) * 2019-06-26 2019-11-05 腾讯科技(深圳)有限公司 Rogue program method for early warning, device, computer equipment and storage medium
CN111027282A (en) * 2019-11-21 2020-04-17 精硕科技(北京)股份有限公司 Text duplicate removal method and device, electronic equipment and computer readable storage medium
CN111159551B (en) * 2019-12-30 2023-11-03 汉海信息技术(上海)有限公司 User-generated content display method and device and computer equipment
CN111159551A (en) * 2019-12-30 2020-05-15 汉海信息技术(上海)有限公司 Display method and device of user-generated content and computer equipment
CN111581493A (en) * 2020-04-07 2020-08-25 苏宁云计算有限公司 Video pushing method and device, computer equipment and storage medium
CN111881275A (en) * 2020-07-24 2020-11-03 新华智云科技有限公司 Efficient hotspot identification and matching method
CN111881275B (en) * 2020-07-24 2024-02-13 新华智云科技有限公司 Efficient hot spot identification and matching method
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN115858787A (en) * 2022-12-12 2023-03-28 交通运输部公路科学研究所 Hot spot extraction and mining method based on problem appeal information in road transportation
CN115858787B (en) * 2022-12-12 2023-08-01 交通运输部公路科学研究所 Hot spot extraction and mining method based on problem appeal information in road transportation
CN116561401A (en) * 2023-05-26 2023-08-08 北京国新汇金股份有限公司 Information hotspot refining method and system based on big data analysis
CN116561401B (en) * 2023-05-26 2024-03-15 北京国新汇金股份有限公司 Information hotspot refining method and system based on big data analysis

Also Published As

Publication number Publication date
CN104915447B (en) 2018-04-20

Similar Documents

Publication Publication Date Title
CN104915447A (en) Method and device for tracing hot topics and confirming keywords
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
Weismayer et al. Identifying emerging research fields: a longitudinal latent semantic keyword analysis
US10572565B2 (en) User behavior models based on source domain
CN107220365B (en) Accurate recommendation system and method based on collaborative filtering and association rule parallel processing
US9317613B2 (en) Large scale entity-specific resource classification
AU2022201654A1 (en) System and engine for seeded clustering of news events
CN105022827B (en) A kind of Web news dynamic aggregation method of domain-oriented theme
CN104573130B (en) The entity resolution method and device calculated based on colony
CN111460252B (en) Automatic search engine method and system based on network public opinion analysis
US20150261773A1 (en) System and Method for Automatic Generation of Information-Rich Content from Multiple Microblogs, Each Microblog Containing Only Sparse Information
US20140207782A1 (en) System and method for computerized semantic processing of electronic documents including themes
CN110110225B (en) Online education recommendation model based on user behavior data analysis and construction method
CN105095434A (en) Recognition method and device for timeliness requirement
CN110888990A (en) Text recommendation method, device, equipment and medium
Theisen et al. Automatic discovery of political meme genres with diverse appearances
CN102428467A (en) Similarity-Based Feature Set Supplementation For Classification
Kawai et al. ChronoSeeker: Search engine for future and past events
CN112632405A (en) Recommendation method, device, equipment and storage medium
CN107341199A (en) A kind of recommendation method based on documentation & info general model
CN111932308A (en) Data recommendation method, device and equipment
CN110795613A (en) Commodity searching method, device and system and electronic equipment
Java et al. Detecting commmunities via simultaneous clustering of graphs and folksonomies
Chen et al. Data analysis and knowledge discovery in web recruitment—based on big data related jobs
CN105159898A (en) Searching method and searching device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant