CN108829699B - Hot event aggregation method and device - Google Patents

Hot event aggregation method and device Download PDF

Info

Publication number
CN108829699B
CN108829699B CN201810354569.6A CN201810354569A CN108829699B CN 108829699 B CN108829699 B CN 108829699B CN 201810354569 A CN201810354569 A CN 201810354569A CN 108829699 B CN108829699 B CN 108829699B
Authority
CN
China
Prior art keywords
report
similarity
seed
title
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810354569.6A
Other languages
Chinese (zh)
Other versions
CN108829699A (en
Inventor
张轩玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201810354569.6A priority Critical patent/CN108829699B/en
Publication of CN108829699A publication Critical patent/CN108829699A/en
Application granted granted Critical
Publication of CN108829699B publication Critical patent/CN108829699B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for aggregating hot events, wherein the method comprises the following steps: obtaining an original report based on the title of the hotspot event; determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story; generating a hot event cluster by using the seed report; calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster; obtaining a non-seed report with the highest similarity; judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value; and if so, storing the non-seed report with the highest similarity to the hotspot event cluster. According to the embodiment of the invention, the seed reports, the hot spot events and the similarity between the reports are introduced in the aggregation process, so that the clustering algorithm can more surround the events, the similarity of texts can be more accurately measured, and a better aggregation effect can be obtained.

Description

Hot event aggregation method and device
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a method and an apparatus for aggregating hot events.
Background
The hot event aggregation is an important basic technology of NLP (natural language processing), and plays an important role in recommending, searching, and bubble services.
Aggregation is performed according to reports related to the hot event, and a TF-IDF word weight clustering method is mostly adopted at present, so that a certain effect is achieved on similarity between related reports. After the text is divided into words, TF-IDF is calculated to serve as the weight of the corresponding word, after a word vector is generated, the similarity is calculated according to the cosine distance, and then corresponding reports are aggregated according to the similarity between the texts through a related clustering algorithm.
But because the TF-IDF does not consider the influence of the context of the text, some defects may be generated in the expression of similarity, and the following disadvantages may be brought to the application of the aggregation hotspot event:
1. because the calculation of the TF-IDF depends on the size and the quality of the corpus, the larger the corpus is, the better the quality is, and the more accurate the calculated TF-IDF is, but the higher the cost is needed in the process of preparing the corpus.
2. The calculated TF-IDF is based on the assumption of independent words, so that the obtained corresponding word weights are also independent from each other, but in an actual text, the relation among the words in the text is also close, and the calculation of the similarity of the subsequent text is directly influenced.
3. When the text similarity is calculated, the similarity of texts can be more accurately expressed by the difference of information distribution among the texts, and at this point, the relative entropy is superior to the calculation of the similarity of the word vectors of the TF-IDF.
In addition, due to the independence of the words of the TF-IDF, when the similarity between the report and the event is evaluated, the problem that the similarity is not accurately judged because the emphasis of the text is ignored occurs. For example, the weights of the world cups in "2018 world cup drawing start" and "2017 world cup open-curtain will start" are high, the calculated similarity is high, but for the hot event, the two should not be grouped into one type.
Disclosure of Invention
In view of the foregoing problems, embodiments of the present invention provide a method and a device for aggregating hotspot events.
In order to solve the above problem, an embodiment of the present invention discloses a method for aggregating hotspot events, including:
obtaining an original report based on the title of the hotspot event;
determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
generating a hot event cluster by using the seed report;
calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster;
obtaining a non-seed report with the highest similarity;
judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
and if so, storing the non-seed report with the highest similarity to the hotspot event cluster.
Preferably, the step of obtaining the original report based on the title of the hot spot event includes:
acquiring a title of a first report;
determining semantic similarity between the title of the first story and the title of the hotspot event;
and when the semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold, determining that the first report is an original report.
Preferably, the original report includes one or more reports;
the step of determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story comprises:
performing word segmentation processing on the title of the hot event and the title of each original report;
calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
calculating the word frequency by adopting a relative entropy to obtain the similarity of each original report and the title of the hotspot event;
obtaining an original report with the highest similarity;
and taking the original report with the highest similarity as a seed report, and taking the original reports except the seed report as non-seed reports.
Preferably, the step of generating the hot spot event cluster by using the seed report includes:
generating a hot event cluster;
storing the seed report to the hotspot event cluster.
Preferably, the step of calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster comprises:
respectively calculating the similarity of each non-seed report and each report in the hotspot event cluster;
respectively calculating a first similarity average value of each non-seed report and each report in the hotspot event cluster;
respectively calculating a second similarity of each non-seed report and the title of the hotspot event;
and respectively calculating the third similarity of each non-seed report by using the first similarity average value and the second similarity.
Preferably, the method further comprises the following steps:
and carrying out deduplication on the reports in the hotspot event cluster.
Correspondingly, the embodiment of the invention discloses a hot event aggregation device, which comprises:
the original report acquisition module is used for acquiring an original report based on the title of the hotspot event;
a determining module for determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
the generating module is used for generating the hot event cluster by adopting the seed report;
the calculation module is used for calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster;
the non-seed report acquisition module is used for acquiring a non-seed report with the highest similarity;
the judging module is used for judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
and the storage module is used for storing the non-seed reports with the highest similarity to the hotspot event cluster.
Preferably, the original report acquiring module comprises:
the title acquisition submodule is used for acquiring the title of the first report;
the similarity judgment submodule is used for judging the semantic similarity between the title of the first report and the title of the hotspot event;
an original report determining submodule, configured to determine that the first report is an original report when semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold.
Preferably, the original report includes one or more reports;
the determining module comprises:
the word segmentation submodule is used for carrying out word segmentation on the title of the hot event and the title of each original report;
the word frequency calculation submodule is used for calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
the similarity obtaining submodule is used for calculating the word frequency by adopting relative entropy to obtain the similarity between each original report and the title of the hotspot event;
an original report acquisition submodule for acquiring an original report with the highest similarity;
and a seed report determination submodule for taking the original report with the highest similarity as a seed report and taking the original reports except the seed report as non-seed reports.
Preferably, the generating module includes:
the hot event cluster generating sub-module is used for generating a hot event cluster;
and the storage submodule is used for storing the seed report to the hot spot event cluster.
Preferably, the calculation module includes:
a first calculation submodule, configured to calculate similarity between each non-seed report and each report in the hot spot event cluster;
a second calculating submodule, configured to calculate a first similarity average between each non-seed report and each report in the hot spot event cluster;
a third calculation submodule for calculating a second similarity of each non-seed report to the title of the hotspot event;
and the fourth calculation submodule is used for respectively calculating the third similarity of each non-seed report by adopting the first similarity average value and the second similarity.
Preferably, the method further comprises the following steps:
and the duplication removing module is used for carrying out duplication removal on the reports in the hot spot event cluster.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, an original report is obtained based on a title of the hot event, then one seed report and a plurality of non-seed reports are determined based on the title of the hot event and the original report, then a hot event cluster is generated by adopting the seed reports, the similarity between each non-seed report and the title of the hot event and each report in the hot event cluster is calculated, then a non-seed report with the highest similarity is obtained, whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value is judged, and if yes, the non-seed report with the highest similarity is stored in the hot event cluster. Therefore, seed reports, hot spot events and similarity between reports are introduced in the aggregation process, so that the clustering algorithm can surround the events, the similarity of texts can be measured more accurately, and a better aggregation effect can be obtained.
Drawings
FIG. 1 is a flowchart illustrating the steps of an embodiment of a method for aggregating hotspot events according to the present invention;
fig. 2 is a block diagram of an embodiment of an aggregation apparatus for hot spot events according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a method for aggregating hotspot events in the present invention is shown, which may specifically include the following steps:
step 101, obtaining an original report based on the title of the hotspot event;
in practical application, the search engine may acquire the hotspot event from the hot search board, may also extract the hotspot event from some data with a steep query click amount, and of course, may also determine the hotspot event in other ways, which is not limited in this embodiment of the present invention.
In the embodiment of the present invention, the hotspot event may be one, that is, the title of the hotspot event may also be one.
In a preferred embodiment of the present invention, the step of obtaining the original report based on the title of the hot spot event includes:
acquiring a title of a first report;
determining semantic similarity between the title of the first story and the title of the hotspot event;
and when the semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold, determining that the first report is an original report.
Specifically, the first story refers to one or more stories collected by the search engine according to the title of the hot event, the collected rules may be whether the titles of the stories are related to the title of the hot event, such as certain words and phrases are equal, or whether certain words and phrases in the contents of the stories are equal to the title of the hot event, and if so, the stories may be determined to have semantic similarity to the title of the hot event, thereby determining that the story is the original story.
In practical applications, the step may aggregate reports that are not related to the title of the hotspot event, so that the semantic similarity threshold may be set to be lower, and thus, more original reports related to the title of the hotspot event may be collected.
Step 102, determining a seed report and a plurality of non-seed reports based on the title of the hotspot event and the original report;
after a plurality of original reports are obtained, one report which is most similar to the title of the hotspot event can be further selected from the original reports to serve as a seed report. The seed report is a report with the highest semantic similarity with the title of the hot event, and is used for providing a good-quality basis when the reports of the hot event are aggregated, so that the quality of the reports aggregated subsequently can be guaranteed, otherwise, if the quality of the seed report is not good (the semantic similarity with the title of the hot event is low), a linkage effect may be caused, and the quality of the reports aggregated subsequently to the hot event is not good (the semantic similarity with the title of the hot event is low).
Of course, in practical applications, the following may also occur: assuming that only one original report is obtained under a certain condition, and the similarity of the original report and the hotspot event is very high, for example, the similarity is 99.5%, the report can be regarded as a seed report, and no non-seed report exists.
In a preferred embodiment of the present invention, the original report comprises one or more reports;
the step of determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story comprises:
performing word segmentation processing on the title of the hot event and the title of each original report;
calculating each word in the title of the hot event and the word frequency with word weight of each word in the title of each original report;
calculating the word frequency by adopting a relative entropy to obtain the similarity of each original report and the title of the hotspot event;
obtaining an original report with the highest similarity;
and taking the original report with the highest similarity as a seed report, and taking the original reports except the seed report as non-seed reports.
Specifically, word segmentation processing is carried out on the titles of the hot events and the titles of all original reports, then word frequency of attached word weight of each word in the titles of the hot events and all the original reports is calculated, similarity between each original report and the title of the hot events is calculated by adopting a formula of relative entropy, and finally the original report with the highest similarity is used as a seed report. The word frequency with the word weight is the word frequency and the word weight, in the prior art, the similarity is calculated only by adopting the word frequency, so that the obtained similarity is low.
The word weight is obtained by firstly segmenting the hot event title and then inputting each word obtained by segmenting into a preset word weight calculation model for calculation.
For example, the title of the hot event is "a woman denies pregnancy", the title of an original report is "a woman denies pregnancy", a net friend deems a woman lies ", the title of the hot event and the title of the original report are segmented to obtain" a woman "," denials "," pregnancy "," net friend "," deem "," lie "6 words, and the word frequencies of the 6 words are [3, 2, 2, 1, 1 ] respectively]Then normalized to [3/10, 2/10, 2/10, 1/10, 1/10, 1/10]And finally, multiplying the corresponding weight of each word respectively, and calculating the word frequency by adopting a relative entropy formula to obtain the similarity between each original report and the title of the hotspot event. Wherein, the formula of the relative entropy is as follows:
Figure BDA0001634301880000071
103, generating a hot event cluster by adopting the seed report;
in a preferred embodiment of the present invention, the step of generating the hot spot event cluster by using the seed report includes:
generating a hot event cluster;
storing the seed report to the hotspot event cluster.
Specifically, after the seed report is determined, a hot spot event cluster can be generated, and then the seed report is stored as the first report of the hot spot event cluster in the hot spot event cluster.
Step 104, calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster;
generally speaking, the search engine collects more than one original report and only one seed report, and then the reports other than the seed report in the original report are all non-seed reports. However, among the non-seed reports, some reports may have a relatively low correlation with the title of the hot event, and those reports having a relatively low correlation may be filtered out, while some reports may have a relatively high correlation with the title of the hot event, but lower than the seed reports, may be stored again in the hot event cluster. Therefore, there is a need for further screening of non-seed reports.
In a preferred embodiment of the present invention, the step of calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster comprises:
respectively calculating the similarity of each non-seed report and each report in the hotspot event cluster;
respectively calculating a first similarity average value of each non-seed report and each report in the hotspot event cluster;
respectively calculating a second similarity of each non-seed report and the title of the hotspot event;
and respectively calculating the third similarity of each non-seed report by using the first similarity average value and the second similarity.
For example, if there were 5 original reports A, B, C, D, E, where B is the seed report, then the non-seed report was A, C, D, E.
1) And (3) selecting an optional A from all the non-seed reports, calculating JS distances (Jensen-Shannon divergence) between the A and each report in the hot spot event cluster, and averaging all the obtained JS distances to obtain a first similarity average value between the A and each report in the hot spot cluster. Wherein, JS distance is also the similarity, and JS distance is the smaller, and the similarity is higher, and JS distance formula is:
Figure BDA0001634301880000081
Figure BDA0001634301880000091
2) calculating the JS distance between the A and the title of the hot event to obtain a second similarity;
3) calculating the average value of the first similarity and the average value of the second similarity to obtain a third similarity of the A;
4) a third similarity is in turn calculated C, D, E.
105, acquiring a non-seed report with the highest similarity;
continuing with the above example, after sequentially calculating the third similarity of A, C, D, E, a report with the smallest JS distance, i.e., the highest similarity, such as D, is selected.
Step 106, judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
continuing with the above example, it is determined whether the third similarity of D is greater than the similarity threshold.
And 107, if so, storing the non-seed report with the highest similarity to the hotspot event cluster.
Continuing with the above example, if yes, D is stored in the hot spot event cluster. Thus, there are B, D two reports in the hotspot event cluster.
It should be noted that, after D exists in the hot spot event cluster, the flow of 1) to 4) still needs to be performed again, because there are only A, C, E non-seed reports at this time, the first similarity average value changes, so that the third similarity also changes, and if not recalculated, the final third similarity is biased.
In a preferred embodiment of the present invention, the method further comprises: and carrying out deduplication on the reports in the hotspot event cluster.
In practical applications, some reports in the non-seed reports have very high similarity to the title of the hot event, but are not seed reports, and these reports are stored in the hot event cluster, so that the reports need to be deduplicated at this time, and duplicate reports are not stored.
In 90 events, the algorithm pair of the present application and the prior art is shown in table 1:
Figure BDA0001634301880000092
Figure BDA0001634301880000101
TABLE 1
As can be seen from the evaluation index of F1-score, the polymerization effect of the method is greatly improved compared with the prior art.
In the embodiment of the invention, an original report is obtained based on a title of the hot event, then one seed report and a plurality of non-seed reports are determined based on the title of the hot event and the original report, then a hot event cluster is generated by adopting the seed reports, the similarity between each non-seed report and the title of the hot event and each report in the hot event cluster is calculated, then a non-seed report with the highest similarity is obtained, whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value is judged, and if yes, the non-seed report with the highest similarity is stored in the hot event cluster. Therefore, reports, hot spot events and similarity among the reports are introduced in the aggregation process, so that the clustering algorithm can be centered around the events, the similarity of texts can be measured more accurately, and a better aggregation effect can be obtained.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 2, a block diagram of an embodiment of an aggregation apparatus for hotspot events in the present invention is shown, which may specifically include the following modules:
an original report obtaining module 201, configured to obtain an original report based on a title of the hotspot event;
a determining module 202, configured to determine a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
a generating module 203, configured to generate a hot event cluster by using the seed report;
a calculating module 204, configured to calculate similarity between each non-seed report and the title of the hotspot event and each report in the hotspot event cluster;
a non-seed report acquisition module 205, configured to acquire a non-seed report with a highest similarity;
a judging module 206, configured to judge whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold;
a storage module 207, configured to store the non-seed report with the highest similarity to the hotspot event cluster.
In a preferred embodiment of the present invention, the original report acquiring module comprises:
the title acquisition submodule is used for acquiring the title of the first report;
the similarity judgment submodule is used for judging the semantic similarity between the title of the first report and the title of the hotspot event;
an original report determining submodule, configured to determine that the first report is an original report when semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold.
In a preferred embodiment of the present invention, the original report comprises one or more reports;
the determining module comprises:
the word segmentation submodule is used for carrying out word segmentation on the title of the hot event and the title of each original report;
the word frequency calculation submodule is used for calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
the similarity obtaining submodule is used for calculating the word frequency by adopting relative entropy to obtain the similarity between each original report and the title of the hotspot event;
an original report acquisition submodule for acquiring an original report with the highest similarity;
and a seed report determination submodule for taking the original report with the highest similarity as a seed report and taking the original reports except the seed report as non-seed reports.
In a preferred embodiment of the present invention, the generating module includes:
the hot event cluster generating sub-module is used for generating a hot event cluster;
and the storage submodule is used for storing the seed report to the hot spot event cluster.
In a preferred embodiment of the present invention, the calculation module includes:
a first calculation submodule, configured to calculate similarity between each non-seed report and each report in the hot spot event cluster;
a second calculating submodule, configured to calculate a first similarity average between each non-seed report and each report in the hot spot event cluster;
a third calculation submodule for calculating a second similarity of each non-seed report to the title of the hotspot event;
and the fourth calculation submodule is used for respectively calculating the third similarity of each non-seed report by adopting the first similarity average value and the second similarity.
In a preferred embodiment of the present invention, the method further comprises:
and the duplication removing module is used for carrying out duplication removal on the reports in the hot spot event cluster.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The hot spot event aggregation method and the hot spot event aggregation device provided by the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for aggregating hotspot events, comprising:
obtaining an original report based on the title of the hotspot event;
determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
generating a hot event cluster by using the seed report;
calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster;
obtaining a non-seed report with the highest similarity;
judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
if yes, storing the non-seed report with the highest similarity to the hotspot event cluster;
wherein the original report comprises one or more reports;
the step of determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story comprises:
performing word segmentation processing on the title of the hot event and the title of each original report;
calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
calculating the word frequency by adopting a relative entropy to obtain the similarity of each original report and the title of the hotspot event;
obtaining an original report with the highest similarity;
and taking the original report with the highest similarity as a seed report, and taking the original reports except the seed report as non-seed reports.
2. The method of claim 1, wherein the step of obtaining the original story based on the title of the hotspot event comprises:
acquiring a title of a first report;
determining semantic similarity between the title of the first story and the title of the hotspot event;
and when the semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold, determining that the first report is an original report.
3. The method of claim 1, wherein the step of generating a cluster of hotspot events using the seed story comprises:
generating a hot event cluster;
storing the seed report to the hotspot event cluster.
4. The method of claim 1 or 3, wherein the step of calculating the similarity between each non-seed report and the title of the hotspot event and each report in the hotspot event cluster comprises:
respectively calculating the similarity of each non-seed report and each report in the hotspot event cluster;
respectively calculating a first similarity average value of each non-seed report and each report in the hotspot event cluster;
respectively calculating a second similarity of each non-seed report and the title of the hotspot event;
and respectively calculating the third similarity of each non-seed report by using the first similarity average value and the second similarity.
5. The method of claim 1, further comprising:
and carrying out deduplication on the reports in the hotspot event cluster.
6. An aggregation apparatus of hotspot events, comprising:
the original report acquisition module is used for acquiring an original report based on the title of the hotspot event;
a determining module for determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;
the generating module is used for generating the hot event cluster by adopting the seed report;
the calculation module is used for calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster;
the non-seed report acquisition module is used for acquiring a non-seed report with the highest similarity;
the judging module is used for judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;
the storage module is used for storing the non-seed reports with the highest similarity to the hotspot event cluster;
wherein the original report comprises one or more reports;
the determining module comprises:
the word segmentation submodule is used for carrying out word segmentation on the title of the hot event and the title of each original report;
the word frequency calculation submodule is used for calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;
the similarity obtaining submodule is used for calculating the word frequency by adopting relative entropy to obtain the similarity between each original report and the title of the hotspot event;
an original report acquisition submodule for acquiring an original report with the highest similarity;
and a seed report determination submodule for taking the original report with the highest similarity as a seed report and taking the original reports except the seed report as non-seed reports.
7. The apparatus of claim 6, wherein the original story acquisition module comprises:
the title acquisition submodule is used for acquiring the title of the first report;
the similarity judgment submodule is used for judging the semantic similarity between the title of the first report and the title of the hotspot event;
an original report determining submodule, configured to determine that the first report is an original report when semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold.
8. The apparatus of claim 6, wherein the generating module comprises:
the hot event cluster generating sub-module is used for generating a hot event cluster;
and the storage submodule is used for storing the seed report to the hot spot event cluster.
9. The apparatus of claim 6 or 8, wherein the computing module comprises:
a first calculation submodule, configured to calculate similarity between each non-seed report and each report in the hot spot event cluster;
a second calculating submodule, configured to calculate a first similarity average between each non-seed report and each report in the hot spot event cluster;
a third calculation submodule for calculating a second similarity of each non-seed report to the title of the hotspot event;
and the fourth calculation submodule is used for respectively calculating the third similarity of each non-seed report by adopting the first similarity average value and the second similarity.
10. The apparatus of claim 6, further comprising:
and the duplication removing module is used for carrying out duplication removal on the reports in the hot spot event cluster.
CN201810354569.6A 2018-04-19 2018-04-19 Hot event aggregation method and device Active CN108829699B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810354569.6A CN108829699B (en) 2018-04-19 2018-04-19 Hot event aggregation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810354569.6A CN108829699B (en) 2018-04-19 2018-04-19 Hot event aggregation method and device

Publications (2)

Publication Number Publication Date
CN108829699A CN108829699A (en) 2018-11-16
CN108829699B true CN108829699B (en) 2021-05-25

Family

ID=64154461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810354569.6A Active CN108829699B (en) 2018-04-19 2018-04-19 Hot event aggregation method and device

Country Status (1)

Country Link
CN (1) CN108829699B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134876B (en) * 2019-01-29 2021-10-26 国家计算机网络与信息安全管理中心 Network space population event sensing and detecting method based on crowd sensing sensor
CN111460289B (en) * 2020-03-27 2024-03-29 北京百度网讯科技有限公司 News information pushing method and device
CN113569563A (en) * 2021-06-25 2021-10-29 北京房江湖科技有限公司 Method and device for identifying hot friend circle text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298635A (en) * 2011-09-13 2011-12-28 苏州大学 Method and system for fusing event information
CN103955489A (en) * 2014-04-15 2014-07-30 华南理工大学 Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106951554A (en) * 2017-03-29 2017-07-14 浙江大学 A kind of stratification hot news and its excavation and the method for visualizing of evolution

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164427B (en) * 2011-12-13 2016-03-02 中国移动通信集团公司 News Aggreagation method and device
CN102929977B (en) * 2012-10-16 2015-07-22 浙江大学 Event tracing method aiming at news website
US9965459B2 (en) * 2014-08-07 2018-05-08 Accenture Global Services Limited Providing contextual information associated with a source document using information from external reference documents
KR101764696B1 (en) * 2015-09-25 2017-08-04 충북대학교 산학협력단 Method and System for determination of social network hot topic in consideration of user’s influence and time
CN106021418B (en) * 2016-05-13 2019-09-06 北京奇虎科技有限公司 The clustering method and device of media event
CN107679144B (en) * 2017-09-25 2021-07-16 平安科技(深圳)有限公司 News sentence clustering method and device based on semantic similarity and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298635A (en) * 2011-09-13 2011-12-28 苏州大学 Method and system for fusing event information
CN103955489A (en) * 2014-04-15 2014-07-30 华南理工大学 Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106951554A (en) * 2017-03-29 2017-07-14 浙江大学 A kind of stratification hot news and its excavation and the method for visualizing of evolution

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于移动群智数据的城市热点事件感知方法;张佳凡等;《计算机科学》;20150630;第42卷(第6A期);第5-9,37页 *

Also Published As

Publication number Publication date
CN108829699A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
KR102092691B1 (en) Web page training methods and devices, and search intention identification methods and devices
CN108829699B (en) Hot event aggregation method and device
JP5984917B2 (en) Method and apparatus for providing suggested words
CN107180093B (en) Information searching method and device and timeliness query word identification method and device
US10565253B2 (en) Model generation method, word weighting method, device, apparatus, and computer storage medium
EP2774061A1 (en) Method and apparatus of ranking search results, and search method and apparatus
CN110019668A (en) A kind of text searching method and device
CN106598999B (en) Method and device for calculating text theme attribution degree
CN110322897B (en) Audio retrieval identification method and device
CN107293308B (en) A kind of audio-frequency processing method and device
CN106503175A (en) The inquiry of Similar Text, problem extended method, device and robot
CN111090771B (en) Song searching method, device and computer storage medium
US20160275355A1 (en) Video Classification Method and Apparatus
CN107688563B (en) Synonym recognition method and recognition device
CN110928986A (en) Legal evidence sorting and recommending method, device, equipment and storage medium
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN106294358A (en) The search method of a kind of information and system
CN107092679B (en) Feature word vector obtaining method and text classification method and device
CN106997340B (en) Word stock generation method and device and document classification method and device using word stock
CN110019806A (en) A kind of document clustering method and equipment
US10776420B2 (en) Fingerprint clustering for content-based audio recognition
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
US20140136565A1 (en) Similar contents searching apparatus based on user preference and similar contents searching method thereof
CN108460131B (en) Classification label processing method and device
CN107423294A (en) A kind of community image search method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant