CN108829699B

CN108829699B - Hot event aggregation method and device

Info

Publication number: CN108829699B
Application number: CN201810354569.6A
Authority: CN
Inventors: 张轩玮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2021-05-25
Anticipated expiration: 2038-04-19
Also published as: CN108829699A

Abstract

The embodiment of the invention provides a method and a device for aggregating hot events, wherein the method comprises the following steps: obtaining an original report based on the title of the hotspot event; determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story; generating a hot event cluster by using the seed report; calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster; obtaining a non-seed report with the highest similarity; judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value; and if so, storing the non-seed report with the highest similarity to the hotspot event cluster. According to the embodiment of the invention, the seed reports, the hot spot events and the similarity between the reports are introduced in the aggregation process, so that the clustering algorithm can more surround the events, the similarity of texts can be more accurately measured, and a better aggregation effect can be obtained.

Description

Hot event aggregation method and device

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method and an apparatus for aggregating hot events.

Background

The hot event aggregation is an important basic technology of NLP (natural language processing), and plays an important role in recommending, searching, and bubble services.

Aggregation is performed according to reports related to the hot event, and a TF-IDF word weight clustering method is mostly adopted at present, so that a certain effect is achieved on similarity between related reports. After the text is divided into words, TF-IDF is calculated to serve as the weight of the corresponding word, after a word vector is generated, the similarity is calculated according to the cosine distance, and then corresponding reports are aggregated according to the similarity between the texts through a related clustering algorithm.

But because the TF-IDF does not consider the influence of the context of the text, some defects may be generated in the expression of similarity, and the following disadvantages may be brought to the application of the aggregation hotspot event:

1. because the calculation of the TF-IDF depends on the size and the quality of the corpus, the larger the corpus is, the better the quality is, and the more accurate the calculated TF-IDF is, but the higher the cost is needed in the process of preparing the corpus.

2. The calculated TF-IDF is based on the assumption of independent words, so that the obtained corresponding word weights are also independent from each other, but in an actual text, the relation among the words in the text is also close, and the calculation of the similarity of the subsequent text is directly influenced.

3. When the text similarity is calculated, the similarity of texts can be more accurately expressed by the difference of information distribution among the texts, and at this point, the relative entropy is superior to the calculation of the similarity of the word vectors of the TF-IDF.

In addition, due to the independence of the words of the TF-IDF, when the similarity between the report and the event is evaluated, the problem that the similarity is not accurately judged because the emphasis of the text is ignored occurs. For example, the weights of the world cups in "2018 world cup drawing start" and "2017 world cup open-curtain will start" are high, the calculated similarity is high, but for the hot event, the two should not be grouped into one type.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present invention provide a method and a device for aggregating hotspot events.

In order to solve the above problem, an embodiment of the present invention discloses a method for aggregating hotspot events, including:

obtaining an original report based on the title of the hotspot event;

determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;

generating a hot event cluster by using the seed report;

calculating the similarity of each non-seed report with the title of the hotspot event and each report in the hotspot event cluster;

obtaining a non-seed report with the highest similarity;

judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;

and if so, storing the non-seed report with the highest similarity to the hotspot event cluster.

Preferably, the step of obtaining the original report based on the title of the hot spot event includes:

acquiring a title of a first report;

determining semantic similarity between the title of the first story and the title of the hotspot event;

and when the semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold, determining that the first report is an original report.

Preferably, the original report includes one or more reports;

the step of determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story comprises:

performing word segmentation processing on the title of the hot event and the title of each original report;

calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;

calculating the word frequency by adopting a relative entropy to obtain the similarity of each original report and the title of the hotspot event;

obtaining an original report with the highest similarity;

and taking the original report with the highest similarity as a seed report, and taking the original reports except the seed report as non-seed reports.

Preferably, the step of generating the hot spot event cluster by using the seed report includes:

generating a hot event cluster;

storing the seed report to the hotspot event cluster.

Preferably, the step of calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster comprises:

respectively calculating the similarity of each non-seed report and each report in the hotspot event cluster;

respectively calculating a first similarity average value of each non-seed report and each report in the hotspot event cluster;

respectively calculating a second similarity of each non-seed report and the title of the hotspot event;

and respectively calculating the third similarity of each non-seed report by using the first similarity average value and the second similarity.

Preferably, the method further comprises the following steps:

and carrying out deduplication on the reports in the hotspot event cluster.

Correspondingly, the embodiment of the invention discloses a hot event aggregation device, which comprises:

the original report acquisition module is used for acquiring an original report based on the title of the hotspot event;

a determining module for determining a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;

the generating module is used for generating the hot event cluster by adopting the seed report;

the calculation module is used for calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster;

the non-seed report acquisition module is used for acquiring a non-seed report with the highest similarity;

the judging module is used for judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;

and the storage module is used for storing the non-seed reports with the highest similarity to the hotspot event cluster.

Preferably, the original report acquiring module comprises:

the title acquisition submodule is used for acquiring the title of the first report;

the similarity judgment submodule is used for judging the semantic similarity between the title of the first report and the title of the hotspot event;

an original report determining submodule, configured to determine that the first report is an original report when semantic similarity between the title of the first report and the title of the hotspot event is greater than a semantic similarity threshold.

Preferably, the original report includes one or more reports;

the determining module comprises:

the word segmentation submodule is used for carrying out word segmentation on the title of the hot event and the title of each original report;

the word frequency calculation submodule is used for calculating the word frequency of each word in the title of the hot event and the attached word weight of each word in the title of each original report;

the similarity obtaining submodule is used for calculating the word frequency by adopting relative entropy to obtain the similarity between each original report and the title of the hotspot event;

an original report acquisition submodule for acquiring an original report with the highest similarity;

and a seed report determination submodule for taking the original report with the highest similarity as a seed report and taking the original reports except the seed report as non-seed reports.

Preferably, the generating module includes:

the hot event cluster generating sub-module is used for generating a hot event cluster;

and the storage submodule is used for storing the seed report to the hot spot event cluster.

Preferably, the calculation module includes:

a first calculation submodule, configured to calculate similarity between each non-seed report and each report in the hot spot event cluster;

a second calculating submodule, configured to calculate a first similarity average between each non-seed report and each report in the hot spot event cluster;

a third calculation submodule for calculating a second similarity of each non-seed report to the title of the hotspot event;

and the fourth calculation submodule is used for respectively calculating the third similarity of each non-seed report by adopting the first similarity average value and the second similarity.

Preferably, the method further comprises the following steps:

and the duplication removing module is used for carrying out duplication removal on the reports in the hot spot event cluster.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, an original report is obtained based on a title of the hot event, then one seed report and a plurality of non-seed reports are determined based on the title of the hot event and the original report, then a hot event cluster is generated by adopting the seed reports, the similarity between each non-seed report and the title of the hot event and each report in the hot event cluster is calculated, then a non-seed report with the highest similarity is obtained, whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value is judged, and if yes, the non-seed report with the highest similarity is stored in the hot event cluster. Therefore, seed reports, hot spot events and similarity between reports are introduced in the aggregation process, so that the clustering algorithm can surround the events, the similarity of texts can be measured more accurately, and a better aggregation effect can be obtained.

Drawings

FIG. 1 is a flowchart illustrating the steps of an embodiment of a method for aggregating hotspot events according to the present invention;

fig. 2 is a block diagram of an embodiment of an aggregation apparatus for hot spot events according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a method for aggregating hotspot events in the present invention is shown, which may specifically include the following steps:

step 101, obtaining an original report based on the title of the hotspot event;

in practical application, the search engine may acquire the hotspot event from the hot search board, may also extract the hotspot event from some data with a steep query click amount, and of course, may also determine the hotspot event in other ways, which is not limited in this embodiment of the present invention.

In the embodiment of the present invention, the hotspot event may be one, that is, the title of the hotspot event may also be one.

In a preferred embodiment of the present invention, the step of obtaining the original report based on the title of the hot spot event includes:

acquiring a title of a first report;

Specifically, the first story refers to one or more stories collected by the search engine according to the title of the hot event, the collected rules may be whether the titles of the stories are related to the title of the hot event, such as certain words and phrases are equal, or whether certain words and phrases in the contents of the stories are equal to the title of the hot event, and if so, the stories may be determined to have semantic similarity to the title of the hot event, thereby determining that the story is the original story.

In practical applications, the step may aggregate reports that are not related to the title of the hotspot event, so that the semantic similarity threshold may be set to be lower, and thus, more original reports related to the title of the hotspot event may be collected.

Step 102, determining a seed report and a plurality of non-seed reports based on the title of the hotspot event and the original report;

after a plurality of original reports are obtained, one report which is most similar to the title of the hotspot event can be further selected from the original reports to serve as a seed report. The seed report is a report with the highest semantic similarity with the title of the hot event, and is used for providing a good-quality basis when the reports of the hot event are aggregated, so that the quality of the reports aggregated subsequently can be guaranteed, otherwise, if the quality of the seed report is not good (the semantic similarity with the title of the hot event is low), a linkage effect may be caused, and the quality of the reports aggregated subsequently to the hot event is not good (the semantic similarity with the title of the hot event is low).

Of course, in practical applications, the following may also occur: assuming that only one original report is obtained under a certain condition, and the similarity of the original report and the hotspot event is very high, for example, the similarity is 99.5%, the report can be regarded as a seed report, and no non-seed report exists.

In a preferred embodiment of the present invention, the original report comprises one or more reports;

calculating each word in the title of the hot event and the word frequency with word weight of each word in the title of each original report;

obtaining an original report with the highest similarity;

Specifically, word segmentation processing is carried out on the titles of the hot events and the titles of all original reports, then word frequency of attached word weight of each word in the titles of the hot events and all the original reports is calculated, similarity between each original report and the title of the hot events is calculated by adopting a formula of relative entropy, and finally the original report with the highest similarity is used as a seed report. The word frequency with the word weight is the word frequency and the word weight, in the prior art, the similarity is calculated only by adopting the word frequency, so that the obtained similarity is low.

The word weight is obtained by firstly segmenting the hot event title and then inputting each word obtained by segmenting into a preset word weight calculation model for calculation.

For example, the title of the hot event is "a woman denies pregnancy", the title of an original report is "a woman denies pregnancy", a net friend deems a woman lies ", the title of the hot event and the title of the original report are segmented to obtain" a woman "," denials "," pregnancy "," net friend "," deem "," lie "6 words, and the word frequencies of the 6 words are [3, 2, 2, 1, 1 ] respectively]Then normalized to [3/10, 2/10, 2/10, 1/10, 1/10, 1/10]And finally, multiplying the corresponding weight of each word respectively, and calculating the word frequency by adopting a relative entropy formula to obtain the similarity between each original report and the title of the hotspot event. Wherein, the formula of the relative entropy is as follows:

103, generating a hot event cluster by adopting the seed report;

in a preferred embodiment of the present invention, the step of generating the hot spot event cluster by using the seed report includes:

generating a hot event cluster;

storing the seed report to the hotspot event cluster.

Specifically, after the seed report is determined, a hot spot event cluster can be generated, and then the seed report is stored as the first report of the hot spot event cluster in the hot spot event cluster.

Step 104, calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster;

generally speaking, the search engine collects more than one original report and only one seed report, and then the reports other than the seed report in the original report are all non-seed reports. However, among the non-seed reports, some reports may have a relatively low correlation with the title of the hot event, and those reports having a relatively low correlation may be filtered out, while some reports may have a relatively high correlation with the title of the hot event, but lower than the seed reports, may be stored again in the hot event cluster. Therefore, there is a need for further screening of non-seed reports.

In a preferred embodiment of the present invention, the step of calculating the similarity between each non-seed report and the title of the hot spot event and each report in the hot spot event cluster comprises:

For example, if there were 5 original reports A, B, C, D, E, where B is the seed report, then the non-seed report was A, C, D, E.

1) And (3) selecting an optional A from all the non-seed reports, calculating JS distances (Jensen-Shannon divergence) between the A and each report in the hot spot event cluster, and averaging all the obtained JS distances to obtain a first similarity average value between the A and each report in the hot spot cluster. Wherein, JS distance is also the similarity, and JS distance is the smaller, and the similarity is higher, and JS distance formula is:

2) calculating the JS distance between the A and the title of the hot event to obtain a second similarity;

3) calculating the average value of the first similarity and the average value of the second similarity to obtain a third similarity of the A;

4) a third similarity is in turn calculated C, D, E.

105, acquiring a non-seed report with the highest similarity;

continuing with the above example, after sequentially calculating the third similarity of A, C, D, E, a report with the smallest JS distance, i.e., the highest similarity, such as D, is selected.

Step 106, judging whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value;

continuing with the above example, it is determined whether the third similarity of D is greater than the similarity threshold.

And 107, if so, storing the non-seed report with the highest similarity to the hotspot event cluster.

Continuing with the above example, if yes, D is stored in the hot spot event cluster. Thus, there are B, D two reports in the hotspot event cluster.

It should be noted that, after D exists in the hot spot event cluster, the flow of 1) to 4) still needs to be performed again, because there are only A, C, E non-seed reports at this time, the first similarity average value changes, so that the third similarity also changes, and if not recalculated, the final third similarity is biased.

In a preferred embodiment of the present invention, the method further comprises: and carrying out deduplication on the reports in the hotspot event cluster.

In practical applications, some reports in the non-seed reports have very high similarity to the title of the hot event, but are not seed reports, and these reports are stored in the hot event cluster, so that the reports need to be deduplicated at this time, and duplicate reports are not stored.

In 90 events, the algorithm pair of the present application and the prior art is shown in table 1:

TABLE 1

As can be seen from the evaluation index of F1-score, the polymerization effect of the method is greatly improved compared with the prior art.

In the embodiment of the invention, an original report is obtained based on a title of the hot event, then one seed report and a plurality of non-seed reports are determined based on the title of the hot event and the original report, then a hot event cluster is generated by adopting the seed reports, the similarity between each non-seed report and the title of the hot event and each report in the hot event cluster is calculated, then a non-seed report with the highest similarity is obtained, whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold value is judged, and if yes, the non-seed report with the highest similarity is stored in the hot event cluster. Therefore, reports, hot spot events and similarity among the reports are introduced in the aggregation process, so that the clustering algorithm can be centered around the events, the similarity of texts can be measured more accurately, and a better aggregation effect can be obtained.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 2, a block diagram of an embodiment of an aggregation apparatus for hotspot events in the present invention is shown, which may specifically include the following modules:

an original report obtaining module 201, configured to obtain an original report based on a title of the hotspot event;

a determining module 202, configured to determine a seed story and a plurality of non-seed stories based on the title of the hotspot event and the original story;

a generating module 203, configured to generate a hot event cluster by using the seed report;

a calculating module 204, configured to calculate similarity between each non-seed report and the title of the hotspot event and each report in the hotspot event cluster;

a non-seed report acquisition module 205, configured to acquire a non-seed report with a highest similarity;

a judging module 206, configured to judge whether the similarity of the non-seed report with the highest similarity is greater than a similarity threshold;

a storage module 207, configured to store the non-seed report with the highest similarity to the hotspot event cluster.

In a preferred embodiment of the present invention, the original report acquiring module comprises:

the determining module comprises:

In a preferred embodiment of the present invention, the generating module includes:

In a preferred embodiment of the present invention, the calculation module includes:

In a preferred embodiment of the present invention, the method further comprises:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The hot spot event aggregation method and the hot spot event aggregation device provided by the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for aggregating hotspot events, comprising:

obtaining an original report based on the title of the hotspot event;

generating a hot event cluster by using the seed report;

obtaining a non-seed report with the highest similarity;

if yes, storing the non-seed report with the highest similarity to the hotspot event cluster;

wherein the original report comprises one or more reports;

obtaining an original report with the highest similarity;

2. The method of claim 1, wherein the step of obtaining the original story based on the title of the hotspot event comprises:

acquiring a title of a first report;

3. The method of claim 1, wherein the step of generating a cluster of hotspot events using the seed story comprises:

generating a hot event cluster;

storing the seed report to the hotspot event cluster.

4. The method of claim 1 or 3, wherein the step of calculating the similarity between each non-seed report and the title of the hotspot event and each report in the hotspot event cluster comprises:

5. The method of claim 1, further comprising:

and carrying out deduplication on the reports in the hotspot event cluster.

6. An aggregation apparatus of hotspot events, comprising:

the storage module is used for storing the non-seed reports with the highest similarity to the hotspot event cluster;

wherein the original report comprises one or more reports;

the determining module comprises:

7. The apparatus of claim 6, wherein the original story acquisition module comprises:

8. The apparatus of claim 6, wherein the generating module comprises:

9. The apparatus of claim 6 or 8, wherein the computing module comprises:

10. The apparatus of claim 6, further comprising: