CN108829699B - Hot event aggregation method and device - Google Patents
Hot event aggregation method and device Download PDFInfo
- Publication number
- CN108829699B CN108829699B CN201810354569.6A CN201810354569A CN108829699B CN 108829699 B CN108829699 B CN 108829699B CN 201810354569 A CN201810354569 A CN 201810354569A CN 108829699 B CN108829699 B CN 108829699B
- Authority
- CN
- China
- Prior art keywords
- report
- similarity
- seed
- title
- event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a method and a device for aggregating hot events, wherein the method comprises the following steps: obtaining an original report based on the title of the hotspot event; determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story; generating a hot event cluster by using the seed report; calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster; obtaining a non-seed report with the highest similarity; judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value; and if so, storing the non-seed report with the highest similarity to the hotspot event cluster. According to the embodiment of the invention, the seed reports, the hot spot events and the similarity between the reports are introduced in the aggregation process, so that the clustering algorithm can more surround the events, the similarity of texts can be more accurately measured, and a better aggregation effect can be obtained.
Description
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a method and an apparatus for aggregating hot events.
Background
The hot event aggregation is an important basic technology of NLP (natural language processing), and plays an important role in recommending, searching, and bubble services.
Aggregation is performed according to reports related to the hot event, and a TF-IDF word weight clustering method is mostly adopted at present, so that a certain effect is achieved on similarity between related reports. After the text is divided into words, TF-IDF is calculated to serve as the weight of the corresponding word, after a word vector is generated, the similarity is calculated according to the cosine distance, and then corresponding reports are aggregated according to the similarity between the texts through a related clustering algorithm.
But because the TF-IDF does not consider the influence of the context of the text, some defects may be generated in the expression of similarity, and the following disadvantages may be brought to the application of the aggregation hotspot event:
1. because the calculation of the TF-IDF depends on the size and the quality of the corpus, the larger the corpus is, the better the quality is, and the more accurate the calculated TF-IDF is, but the higher the cost is needed in the process of preparing the corpus.
2. The calculated TF-IDF is based on the assumption of independent words, so that the obtained corresponding word weights are also independent from each other, but in an actual text, the relation among the words in the text is also close, and the calculation of the similarity of the subsequent text is directly influenced.
3. When the text similarity is calculated, the similarity of texts can be more accurately expressed by the difference of information distribution among the texts, and at this point, the relative entropy is superior to the calculation of the similarity of the word vectors of the TF-IDF.
In addition, due to the independence of the words of the TF-IDF, when the similarity between the report and the event is evaluated, the problem that the similarity is not accurately judged because the emphasis of the text is ignored occurs. For example, the weights of the world cups in "2018 world cup drawing start" and "2017 world cup open-curtain will start" are high, the calculated similarity is high, but for the hot event, the two should not be grouped into one type.
Disclosure of Invention
In view of the foregoing problems, embodiments of the present invention provide a method and a device for aggregating hotspot events.
In order to solve the above problem, an embodiment of the present invention discloses a method for aggregating hotspot events, including:
obtaining an original report based on the title of the hotspot event;
determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
generating a hot event cluster by using the seed report;
calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster;
obtaining a non-seed report with the highest similarity;
judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
and if so, storing the non-seed report with the highest similarity to the hotspot event cluster.
Preferably, the step of obtaining the original report based on the title of the hot spot event includes:
acquiring a title of a first report;
determining semantic similarity between the title of the first story and the title of the hotspot event;
and when the semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold, determining that the first report is an original report.
Preferably, the original report includes one or more reports;
the step of determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story comprises:
performing word segmentation processing on the title of the hot event and the title of each original report;
calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
calculating the word frequency by adopting a relative entropy to obtain the similarity of each original report and the title of the hotspot event;
obtaining an original report with the highest similarity;
and taking the original report with the highest similarity as a seed report, and taking the original reports except the seed report as non-seed reports.
Preferably, the step of generating the hot spot event cluster by using the seed report includes:
generating a hot event cluster;
storing the seed report to the hotspot event cluster.
Preferably, the step of calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster comprises:
respectively calculating the similarity of each non-seed report and each report in the hotspot event cluster;
respectively calculating a first similarity average value of each non-seed report and each report in the hotspot event cluster;
respectively calculating a second similarity of each non-seed report and the title of the hotspot event;
and respectively calculating the third similarity of each non-seed report by using the first similarity average value and the second similarity.
Preferably, the method further comprises the following steps:
and carrying out deduplication on the reports in the hotspot event cluster.
Correspondingly, the embodiment of the invention discloses a hot event aggregation device, which comprises:
the original report acquisition module is used for acquiring an original report based on the title of the hotspot event;
a determining module for determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
the generating module is used for generating the hot event cluster by adopting the seed report;
the calculation module is used for calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster;
the non-seed report acquisition module is used for acquiring a non-seed report with the highest similarity;
the judging module is used for judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
and the storage module is used for storing the non-seed reports with the highest similarity to the hotspot event cluster.
Preferably, the original report acquiring module comprises:
the title acquisition submodule is used for acquiring the title of the first report;
the similarity judgment submodule is used for judging the semantic similarity between the title of the first report and the title of the hotspot event;
an original report determining submodule, configured to determine that the first report is an original report when semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold.
Preferably, the original report includes one or more reports;
the determining module comprises:
the word segmentation submodule is used for carrying out word segmentation on the title of the hot event and the title of each original report;
the word frequency calculation submodule is used for calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
the similarity obtaining submodule is used for calculating the word frequency by adopting relative entropy to obtain the similarity between each original report and the title of the hotspot event;
an original report acquisition submodule for acquiring an original report with the highest similarity;
and a seed report determination submodule for taking the original report with the highest similarity as a seed report and taking the original reports except the seed report as non-seed reports.
Preferably, the generating module includes:
the hot event cluster generating sub-module is used for generating a hot event cluster;
and the storage submodule is used for storing the seed report to the hot spot event cluster.
Preferably, the calculation module includes:
a first calculation submodule, configured to calculate similarity between each non-seed report and each report in the hot spot event cluster;
a second calculating submodule, configured to calculate a first similarity average between each non-seed report and each report in the hot spot event cluster;
a third calculation submodule for calculating a second similarity of each non-seed report to the title of the hotspot event;
and the fourth calculation submodule is used for respectively calculating the third similarity of each non-seed report by adopting the first similarity average value and the second similarity.
Preferably, the method further comprises the following steps:
and the duplication removing module is used for carrying out duplication removal on the reports in the hot spot event cluster.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, an original report is obtained based on a title of the hot event, then one seed report and a plurality of non-seed reports are determined based on the title of the hot event and the original report, then a hot event cluster is generated by adopting the seed reports, the similarity between each non-seed report and the title of the hot event and each report in the hot event cluster is calculated, then a non-seed report with the highest similarity is obtained, whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value is judged, and if yes, the non-seed report with the highest similarity is stored in the hot event cluster. Therefore, seed reports, hot spot events and similarity between reports are introduced in the aggregation process, so that the clustering algorithm can surround the events, the similarity of texts can be measured more accurately, and a better aggregation effect can be obtained.
Drawings
FIG. 1 is a flowchart illustrating the steps of an embodiment of a method for aggregating hotspot events according to the present invention;
fig. 2 is a block diagram of an embodiment of an aggregation apparatus for hot spot events according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a method for aggregating hotspot events in the present invention is shown, which may specifically include the following steps:
in practical application, the search engine may acquire the hotspot event from the hot search board, may also extract the hotspot event from some data with a steep query click amount, and of course, may also determine the hotspot event in other ways, which is not limited in this embodiment of the present invention.
In the embodiment of the present invention, the hotspot event may be one, that is, the title of the hotspot event may also be one.
In a preferred embodiment of the present invention, the step of obtaining the original report based on the title of the hot spot event includes:
acquiring a title of a first report;
determining semantic similarity between the title of the first story and the title of the hotspot event;
and when the semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold, determining that the first report is an original report.
Specifically, the first story refers to one or more stories collected by the search engine according to the title of the hot event, the collected rules may be whether the titles of the stories are related to the title of the hot event, such as certain words and phrases are equal, or whether certain words and phrases in the contents of the stories are equal to the title of the hot event, and if so, the stories may be determined to have semantic similarity to the title of the hot event, thereby determining that the story is the original story.
In practical applications, the step may aggregate reports that are not related to the title of the hotspot event, so that the semantic similarity threshold may be set to be lower, and thus, more original reports related to the title of the hotspot event may be collected.
after a plurality of original reports are obtained, one report which is most similar to the title of the hotspot event can be further selected from the original reports to serve as a seed report. The seed report is a report with the highest semantic similarity with the title of the hot event, and is used for providing a good-quality basis when the reports of the hot event are aggregated, so that the quality of the reports aggregated subsequently can be guaranteed, otherwise, if the quality of the seed report is not good (the semantic similarity with the title of the hot event is low), a linkage effect may be caused, and the quality of the reports aggregated subsequently to the hot event is not good (the semantic similarity with the title of the hot event is low).
Of course, in practical applications, the following may also occur: assuming that only one original report is obtained under a certain condition, and the similarity of the original report and the hotspot event is very high, for example, the similarity is 99.5%, the report can be regarded as a seed report, and no non-seed report exists.
In a preferred embodiment of the present invention, the original report comprises one or more reports;
the step of determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story comprises:
performing word segmentation processing on the title of the hot event and the title of each original report;
calculating each word in the title of the hot event and the word frequency with word weight of each word in the title of each original report;
calculating the word frequency by adopting a relative entropy to obtain the similarity of each original report and the title of the hotspot event;
obtaining an original report with the highest similarity;
and taking the original report with the highest similarity as a seed report, and taking the original reports except the seed report as non-seed reports.
Specifically, word segmentation processing is carried out on the titles of the hot events and the titles of all original reports, then word frequency of attached word weight of each word in the titles of the hot events and all the original reports is calculated, similarity between each original report and the title of the hot events is calculated by adopting a formula of relative entropy, and finally the original report with the highest similarity is used as a seed report. The word frequency with the word weight is the word frequency and the word weight, in the prior art, the similarity is calculated only by adopting the word frequency, so that the obtained similarity is low.
The word weight is obtained by firstly segmenting the hot event title and then inputting each word obtained by segmenting into a preset word weight calculation model for calculation.
For example, the title of the hot event is "a woman denies pregnancy", the title of an original report is "a woman denies pregnancy", a net friend deems a woman lies ", the title of the hot event and the title of the original report are segmented to obtain" a woman "," denials "," pregnancy "," net friend "," deem "," lie "6 words, and the word frequencies of the 6 words are [3, 2, 2, 1, 1 ] respectively]Then normalized to [3/10, 2/10, 2/10, 1/10, 1/10, 1/10]And finally, multiplying the corresponding weight of each word respectively, and calculating the word frequency by adopting a relative entropy formula to obtain the similarity between each original report and the title of the hotspot event. Wherein, the formula of the relative entropy is as follows:
103, generating a hot event cluster by adopting the seed report;
in a preferred embodiment of the present invention, the step of generating the hot spot event cluster by using the seed report includes:
generating a hot event cluster;
storing the seed report to the hotspot event cluster.
Specifically, after the seed report is determined, a hot spot event cluster can be generated, and then the seed report is stored as the first report of the hot spot event cluster in the hot spot event cluster.
generally speaking, the search engine collects more than one original report and only one seed report, and then the reports other than the seed report in the original report are all non-seed reports. However, among the non-seed reports, some reports may have a relatively low correlation with the title of the hot event, and those reports having a relatively low correlation may be filtered out, while some reports may have a relatively high correlation with the title of the hot event, but lower than the seed reports, may be stored again in the hot event cluster. Therefore, there is a need for further screening of non-seed reports.
In a preferred embodiment of the present invention, the step of calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster comprises:
respectively calculating the similarity of each non-seed report and each report in the hotspot event cluster;
respectively calculating a first similarity average value of each non-seed report and each report in the hotspot event cluster;
respectively calculating a second similarity of each non-seed report and the title of the hotspot event;
and respectively calculating the third similarity of each non-seed report by using the first similarity average value and the second similarity.
For example, if there were 5 original reports A, B, C, D, E, where B is the seed report, then the non-seed report was A, C, D, E.
1) And (3) selecting an optional A from all the non-seed reports, calculating JS distances (Jensen-Shannon divergence) between the A and each report in the hot spot event cluster, and averaging all the obtained JS distances to obtain a first similarity average value between the A and each report in the hot spot cluster. Wherein, JS distance is also the similarity, and JS distance is the smaller, and the similarity is higher, and JS distance formula is:
2) calculating the JS distance between the A and the title of the hot event to obtain a second similarity;
3) calculating the average value of the first similarity and the average value of the second similarity to obtain a third similarity of the A;
4) a third similarity is in turn calculated C, D, E.
105, acquiring a non-seed report with the highest similarity;
continuing with the above example, after sequentially calculating the third similarity of A, C, D, E, a report with the smallest JS distance, i.e., the highest similarity, such as D, is selected.
continuing with the above example, it is determined whether the third similarity of D is greater than the similarity threshold.
And 107, if so, storing the non-seed report with the highest similarity to the hotspot event cluster.
Continuing with the above example, if yes, D is stored in the hot spot event cluster. Thus, there are B, D two reports in the hotspot event cluster.
It should be noted that, after D exists in the hot spot event cluster, the flow of 1) to 4) still needs to be performed again, because there are only A, C, E non-seed reports at this time, the first similarity average value changes, so that the third similarity also changes, and if not recalculated, the final third similarity is biased.
In a preferred embodiment of the present invention, the method further comprises: and carrying out deduplication on the reports in the hotspot event cluster.
In practical applications, some reports in the non-seed reports have very high similarity to the title of the hot event, but are not seed reports, and these reports are stored in the hot event cluster, so that the reports need to be deduplicated at this time, and duplicate reports are not stored.
In 90 events, the algorithm pair of the present application and the prior art is shown in table 1:
TABLE 1
As can be seen from the evaluation index of F1-score, the polymerization effect of the method is greatly improved compared with the prior art.
In the embodiment of the invention, an original report is obtained based on a title of the hot event, then one seed report and a plurality of non-seed reports are determined based on the title of the hot event and the original report, then a hot event cluster is generated by adopting the seed reports, the similarity between each non-seed report and the title of the hot event and each report in the hot event cluster is calculated, then a non-seed report with the highest similarity is obtained, whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value is judged, and if yes, the non-seed report with the highest similarity is stored in the hot event cluster. Therefore, reports, hot spot events and similarity among the reports are introduced in the aggregation process, so that the clustering algorithm can be centered around the events, the similarity of texts can be measured more accurately, and a better aggregation effect can be obtained.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 2, a block diagram of an embodiment of an aggregation apparatus for hotspot events in the present invention is shown, which may specifically include the following modules:
an original report obtaining module 201, configured to obtain an original report based on a title of the hotspot event;
a determining module 202, configured to determine a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
a generating module 203, configured to generate a hot event cluster by using the seed report;
a calculating module 204, configured to calculate similarity between each non-seed report and the title of the hotspot event and each report in the hotspot event cluster;
a non-seed report acquisition module 205, configured to acquire a non-seed report with a highest similarity;
a judging module 206, configured to judge whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold;
a storage module 207, configured to store the non-seed report with the highest similarity to the hotspot event cluster.
In a preferred embodiment of the present invention, the original report acquiring module comprises:
the title acquisition submodule is used for acquiring the title of the first report;
the similarity judgment submodule is used for judging the semantic similarity between the title of the first report and the title of the hotspot event;
an original report determining submodule, configured to determine that the first report is an original report when semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold.
In a preferred embodiment of the present invention, the original report comprises one or more reports;
the determining module comprises:
the word segmentation submodule is used for carrying out word segmentation on the title of the hot event and the title of each original report;
the word frequency calculation submodule is used for calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
the similarity obtaining submodule is used for calculating the word frequency by adopting relative entropy to obtain the similarity between each original report and the title of the hotspot event;
an original report acquisition submodule for acquiring an original report with the highest similarity;
and a seed report determination submodule for taking the original report with the highest similarity as a seed report and taking the original reports except the seed report as non-seed reports.
In a preferred embodiment of the present invention, the generating module includes:
the hot event cluster generating sub-module is used for generating a hot event cluster;
and the storage submodule is used for storing the seed report to the hot spot event cluster.
In a preferred embodiment of the present invention, the calculation module includes:
a first calculation submodule, configured to calculate similarity between each non-seed report and each report in the hot spot event cluster;
a second calculating submodule, configured to calculate a first similarity average between each non-seed report and each report in the hot spot event cluster;
a third calculation submodule for calculating a second similarity of each non-seed report to the title of the hotspot event;
and the fourth calculation submodule is used for respectively calculating the third similarity of each non-seed report by adopting the first similarity average value and the second similarity.
In a preferred embodiment of the present invention, the method further comprises:
and the duplication removing module is used for carrying out duplication removal on the reports in the hot spot event cluster.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The hot spot event aggregation method and the hot spot event aggregation device provided by the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. A method for aggregating hotspot events, comprising:
obtaining an original report based on the title of the hotspot event;
determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
generating a hot event cluster by using the seed report;
calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster;
obtaining a non-seed report with the highest similarity;
judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
if yes, storing the non-seed report with the highest similarity to the hotspot event cluster;
wherein the original report comprises one or more reports;
the step of determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story comprises:
performing word segmentation processing on the title of the hot event and the title of each original report;
calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
calculating the word frequency by adopting a relative entropy to obtain the similarity of each original report and the title of the hotspot event;
obtaining an original report with the highest similarity;
and taking the original report with the highest similarity as a seed report, and taking the original reports except the seed report as non-seed reports.
2. The method of claim 1, wherein the step of obtaining the original story based on the title of the hotspot event comprises:
acquiring a title of a first report;
determining semantic similarity between the title of the first story and the title of the hotspot event;
and when the semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold, determining that the first report is an original report.
3. The method of claim 1, wherein the step of generating a cluster of hotspot events using the seed story comprises:
generating a hot event cluster;
storing the seed report to the hotspot event cluster.
4. The method of claim 1 or 3, wherein the step of calculating the similarity between each non-seed report and the title of the hotspot event and each report in the hotspot event cluster comprises:
respectively calculating the similarity of each non-seed report and each report in the hotspot event cluster;
respectively calculating a first similarity average value of each non-seed report and each report in the hotspot event cluster;
respectively calculating a second similarity of each non-seed report and the title of the hotspot event;
and respectively calculating the third similarity of each non-seed report by using the first similarity average value and the second similarity.
5. The method of claim 1, further comprising:
and carrying out deduplication on the reports in the hotspot event cluster.
6. An aggregation apparatus of hotspot events, comprising:
the original report acquisition module is used for acquiring an original report based on the title of the hotspot event;
a determining module for determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
the generating module is used for generating the hot event cluster by adopting the seed report;
the calculation module is used for calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster;
the non-seed report acquisition module is used for acquiring a non-seed report with the highest similarity;
the judging module is used for judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
the storage module is used for storing the non-seed reports with the highest similarity to the hotspot event cluster;
wherein the original report comprises one or more reports;
the determining module comprises:
the word segmentation submodule is used for carrying out word segmentation on the title of the hot event and the title of each original report;
the word frequency calculation submodule is used for calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
the similarity obtaining submodule is used for calculating the word frequency by adopting relative entropy to obtain the similarity between each original report and the title of the hotspot event;
an original report acquisition submodule for acquiring an original report with the highest similarity;
and a seed report determination submodule for taking the original report with the highest similarity as a seed report and taking the original reports except the seed report as non-seed reports.
7. The apparatus of claim 6, wherein the original story acquisition module comprises:
the title acquisition submodule is used for acquiring the title of the first report;
the similarity judgment submodule is used for judging the semantic similarity between the title of the first report and the title of the hotspot event;
an original report determining submodule, configured to determine that the first report is an original report when semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold.
8. The apparatus of claim 6, wherein the generating module comprises:
the hot event cluster generating sub-module is used for generating a hot event cluster;
and the storage submodule is used for storing the seed report to the hot spot event cluster.
9. The apparatus of claim 6 or 8, wherein the computing module comprises:
a first calculation submodule, configured to calculate similarity between each non-seed report and each report in the hot spot event cluster;
a second calculating submodule, configured to calculate a first similarity average between each non-seed report and each report in the hot spot event cluster;
a third calculation submodule for calculating a second similarity of each non-seed report to the title of the hotspot event;
and the fourth calculation submodule is used for respectively calculating the third similarity of each non-seed report by adopting the first similarity average value and the second similarity.
10. The apparatus of claim 6, further comprising:
and the duplication removing module is used for carrying out duplication removal on the reports in the hot spot event cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810354569.6A CN108829699B (en) | 2018-04-19 | 2018-04-19 | Hot event aggregation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810354569.6A CN108829699B (en) | 2018-04-19 | 2018-04-19 | Hot event aggregation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108829699A CN108829699A (en) | 2018-11-16 |
CN108829699B true CN108829699B (en) | 2021-05-25 |
Family
ID=64154461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810354569.6A Active CN108829699B (en) | 2018-04-19 | 2018-04-19 | Hot event aggregation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108829699B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134876B (en) * | 2019-01-29 | 2021-10-26 | 国家计算机网络与信息安全管理中心 | Network space population event sensing and detecting method based on crowd sensing sensor |
CN111460289B (en) * | 2020-03-27 | 2024-03-29 | 北京百度网讯科技有限公司 | News information pushing method and device |
CN113569563A (en) * | 2021-06-25 | 2021-10-29 | 北京房江湖科技有限公司 | Method and device for identifying hot friend circle text |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298635A (en) * | 2011-09-13 | 2011-12-28 | 苏州大学 | Method and system for fusing event information |
CN103955489A (en) * | 2014-04-15 | 2014-07-30 | 华南理工大学 | Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
CN106951554A (en) * | 2017-03-29 | 2017-07-14 | 浙江大学 | A kind of stratification hot news and its excavation and the method for visualizing of evolution |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103164427B (en) * | 2011-12-13 | 2016-03-02 | 中国移动通信集团公司 | News Aggreagation method and device |
CN102929977B (en) * | 2012-10-16 | 2015-07-22 | 浙江大学 | Event tracing method aiming at news website |
US9965459B2 (en) * | 2014-08-07 | 2018-05-08 | Accenture Global Services Limited | Providing contextual information associated with a source document using information from external reference documents |
KR101764696B1 (en) * | 2015-09-25 | 2017-08-04 | 충북대학교 산학협력단 | Method and System for determination of social network hot topic in consideration of user’s influence and time |
CN106021418B (en) * | 2016-05-13 | 2019-09-06 | 北京奇虎科技有限公司 | The clustering method and device of media event |
CN107679144B (en) * | 2017-09-25 | 2021-07-16 | 平安科技(深圳)有限公司 | News sentence clustering method and device based on semantic similarity and storage medium |
-
2018
- 2018-04-19 CN CN201810354569.6A patent/CN108829699B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298635A (en) * | 2011-09-13 | 2011-12-28 | 苏州大学 | Method and system for fusing event information |
CN103955489A (en) * | 2014-04-15 | 2014-07-30 | 华南理工大学 | Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
CN106951554A (en) * | 2017-03-29 | 2017-07-14 | 浙江大学 | A kind of stratification hot news and its excavation and the method for visualizing of evolution |
Non-Patent Citations (1)
Title |
---|
基于移动群智数据的城市热点事件感知方法;张佳凡等;《计算机科学》;20150630;第42卷(第6A期);第5-9,37页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108829699A (en) | 2018-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102092691B1 (en) | Web page training methods and devices, and search intention identification methods and devices | |
CN108829699B (en) | Hot event aggregation method and device | |
JP5984917B2 (en) | Method and apparatus for providing suggested words | |
CN107180093B (en) | Information searching method and device and timeliness query word identification method and device | |
US10565253B2 (en) | Model generation method, word weighting method, device, apparatus, and computer storage medium | |
EP2774061A1 (en) | Method and apparatus of ranking search results, and search method and apparatus | |
CN110019668A (en) | A kind of text searching method and device | |
CN106598999B (en) | Method and device for calculating text theme attribution degree | |
CN110322897B (en) | Audio retrieval identification method and device | |
CN107293308B (en) | A kind of audio-frequency processing method and device | |
CN106503175A (en) | The inquiry of Similar Text, problem extended method, device and robot | |
CN111090771B (en) | Song searching method, device and computer storage medium | |
US20160275355A1 (en) | Video Classification Method and Apparatus | |
CN107688563B (en) | Synonym recognition method and recognition device | |
CN110928986A (en) | Legal evidence sorting and recommending method, device, equipment and storage medium | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
CN106294358A (en) | The search method of a kind of information and system | |
CN107092679B (en) | Feature word vector obtaining method and text classification method and device | |
CN106997340B (en) | Word stock generation method and device and document classification method and device using word stock | |
CN110019806A (en) | A kind of document clustering method and equipment | |
US10776420B2 (en) | Fingerprint clustering for content-based audio recognition | |
CN113656575B (en) | Training data generation method and device, electronic equipment and readable medium | |
US20140136565A1 (en) | Similar contents searching apparatus based on user preference and similar contents searching method thereof | |
CN108460131B (en) | Classification label processing method and device | |
CN107423294A (en) | A kind of community image search method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |