WO2022267325A1

WO2022267325A1 - News popularity calculation method, device and storage medium

Info

Publication number: WO2022267325A1
Application number: PCT/CN2021/132517
Authority: WO
Inventors: 计明杰; 薛晓舟; 蔡承蒙; 陈邦忠
Original assignee: 完美世界控股集团有限公司
Priority date: 2021-06-25
Filing date: 2021-11-23
Publication date: 2022-12-29
Also published as: CN113449077A; CN113449077B

Abstract

Embodiments of the present application provide a news popularity calculation method, a device, and a storage medium. In the news popularity calculation method, after a news set of an event is obtained, multiple pieces of news having a release interval and a release duration that satisfy set requirements are selected from the news set, and popularity weights of the news are determined according to release agencies corresponding to the news. The release interval, the release duration, and the release agencies of the news are comprehensively considered, multi-dimensional information of the news can be fully utilized, and then accurate news popularity is obtained by means of calculation.

Description

News popularity calculation method, equipment and storage medium

cross reference

This application claims the priority of the Chinese patent application submitted on June 25, 2021, with the application number "20210711197.X", and the title of the invention is "News popularity calculation method, equipment and storage medium", the entire content of which is incorporated by reference In this application.

technical field

The present application relates to the technical field of the Internet, and in particular to a news popularity calculation method, device and storage medium.

Background technique

In the information age, all kinds of information have shown a blowout growth, and news is no exception. Analyze and filter a large number of news, get hot news, and recommend hot news to users, which can facilitate users to keep abreast of hot topics and improve news reading efficiency.

Existing methods for calculating news popularity usually rely on the number of clicks, comments, etc. on news by users. This method relies more on user behavior and cannot obtain accurate news popularity analysis results. Therefore, a new solution remains to be proposed.

Contents of the invention

The present invention proposes the following technical solutions to overcome or at least partially solve or slow down the above-mentioned problems:

According to one aspect of the present invention, a method for calculating news popularity is provided, including: obtaining a news set corresponding to an event; from the news set, determining a plurality of news whose release interval and release time duration meet the set conditions; according to The publishing organization corresponding to the plurality of news determines the respective popularity weights of the plurality of news; calculates the corresponding weight of the event according to the respective popularity weights of the plurality of news, the release interval and the duration of the release time. News heat.

According to yet another aspect of the present invention, a computer device/equipment/system is provided, including a memory, a processor, and computer programs/instructions stored on the memory, and the above information is realized when the processor executes the computer program/instructions. Steps in the heat calculation method.

According to another aspect of the present invention, a computer-readable medium is provided, on which computer programs/instructions are stored, and when the computer programs/instructions are executed by a processor, the steps of the above-mentioned method for calculating news popularity are realized.

According to still another aspect of the present invention, a computer program product is provided, including computer programs/instructions, and when the computer programs/instructions are executed by a processor, the steps of the above-mentioned method for calculating news popularity are realized.

The beneficial effect of the present invention is: after obtaining the news set of the event, select a plurality of news whose release interval and release time duration meet the set requirements, and determine the popularity weight of the news according to the release organization corresponding to the news. By comprehensively considering the news release interval, the duration of the release time, and the news release organization, the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.

Description of drawings

These and various other advantages and benefits of the present invention will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. In the attached picture:

FIG. 1 is a schematic flowchart of a method for calculating news popularity provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic flowchart of a method for calculating news popularity provided by another exemplary embodiment of the present application;

FIG. 3 is a schematic flowchart of a method for identifying similar news provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application;

Figure 5 schematically shows a block diagram of a computer device/equipment/system for implementing the method according to the present invention; and

Fig. 6 schematically shows a block diagram of a computer program product implementing the method according to the invention.

detailed description

The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. The following description is only to illustrate the basic principle of the present invention and not to limit it.

Analyze and filter a large number of news, get hot news, and recommend hot news to users, which can facilitate users to keep abreast of hot topics and improve news reading efficiency.

Existing methods for calculating news popularity usually rely on the number of clicks, comments, etc. on news by users. This method relies more on user behavior and cannot obtain accurate news popularity analysis results.

Aiming at the above technical problems, some embodiments of the present application provide a solution. The technical solutions provided by each embodiment of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a method for calculating news popularity provided by an exemplary embodiment of the present application. As shown in Fig. 1, the method includes:

Step 101 , acquiring a news set corresponding to an event.

Step 102, from the news collection, determine a plurality of news whose release interval and release time duration meet the set conditions.

Step 103 , according to the publishing organizations corresponding to the multiple news, determine the popularity weight of each of the multiple news.

Step 104: Calculate the news popularity corresponding to the event according to the popularity weights of the plurality of news, the release interval, and the release time duration.

Wherein, an event refers to an object of a news report. When a new event occurs in society, there will be multiple news organizations to report on the event. When the importance of the event is high or the topic persistence is high, the number of news reporting the event is also large. Analyzing the news popularity of events is helpful for identifying hot topics and recommending hot topics.

Wherein, the news set corresponding to the event includes a plurality of news articles reporting the event. Before analyzing news popularity, a large amount of news data can be analyzed based on news classification and aggregation to obtain news sets corresponding to different events.

When calculating the popularity of news, if the release time or quantity of the news is considered alone, it is impossible to get an accurate calculation result of the popularity. For example, if the number of news about a certain event is large, but the time interval between the news is relatively large, it can be considered that the news is less popular. Similarly, if there are multiple news articles about a certain event in a very short time interval, but there is no other news that continuously reports on the event, it can be considered that the popularity of the news is low. In addition, if most of the news about a certain event comes from some less authoritative news organizations, it can be considered that the news is also less popular.

In order to obtain accurate heat calculation results, this embodiment comprehensively considers news release intervals, release time durations, and news release agencies when calculating news heat, so as to make full use of multi-dimensional information of news.

Among them, the release interval of news refers to the time difference between two adjacent news releases, which is used to indicate the frequency of news releases; the duration of news release time is used to indicate the number of news reports about events in the time dimension Persistent. News publishers refer to sources of news, such as portal websites, magazines, newspapers, and so on.

Generally, the more authoritative the news release agency, the higher the contribution to the popularity of the news. Based on this, in this embodiment, the popularity weight of the news can be determined according to the publishing agency corresponding to the news, and then the influence of the publishing agency on the news popularity can be considered to improve the accuracy of the news popularity calculation result.

In some optional embodiments, before calculating the popularity of news, a large number of news can be classified and aggregated to obtain the correspondence between news and events. That is, take the event as the dimension to filter out the news corresponding to the event. An optional implementation manner of filtering news corresponding to an event will be exemplarily described below by taking any event as an example.

Optionally, news data may be collected, and the news data may include news from multiple different news release organizations. When classifying and aggregating news in news data, it can be judged whether any two news are used to report the same event.

As shown in Figure 2, take the first news text and the second news text in the incoming news data as an example, the text similarity between the first news text and the second news text can be calculated; Analyze the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of news elements; if the similarity of the text and the coincidence degree of the elements meet the set conditions, then the The two news texts are divided into the news collection of the same event.

Optionally, when calculating the text similarity between the first news and the second news, the first headline and the first text contained in the first news text, and the second title and the second text contained in the second news text can be determined ;Calculate the title similarity between the first title and the second title according to the corresponding texts of the first title and the second title; calculate the first text and the second text according to the corresponding texts of the first text and the second text The text similarity between them; the title similarity and the text similarity are fused to obtain the similarity between the first news text and the second news text. Wherein, for optional implementation manners of calculating title similarity and text similarity, reference may be made to the descriptions in subsequent embodiments, and details are not described here.

Among them, the elements of news are the basic components of news. The six elements of news are commonly used, referring to: time, place, person, cause, process, and result of an event. In this embodiment, in order to analyze the popularity of the news, the main entities in the news can be extracted, the main entities include: time, place, person (or organization, etc.), etc., as shown in FIG. 2 .

Optionally, when analyzing the coincidence of the news elements of the first news text and the news elements of the second news text, time elements, location elements and subject elements can be extracted from the first news text and the second news text respectively; The coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the main element of the first news text and the second news text; the sum of the coincidence degree of the time element, the coincidence degree of the place element and the coincidence degree of the main element is taken as The coincidence degree of news elements. Among them, the main elements include the people, objects, organizations and so on described in the news.

For example, when calculating news A and news B, the three elements of news A are (time 1, location 1, person 1), and the three elements of news B are (time 1, location 1, person 2); that is, the time of news A If the time element of the element is the same as that of news B, and the main element of news A is the same as that of news B, then the overlap degree of news elements can be considered as 2/3.

Wherein, the text similarity and the element overlap meet the set conditions, which may include: the text similarity is greater than a set first similarity threshold, and the element overlap is greater than a set second similarity threshold. Wherein, the first similarity threshold and the second similarity threshold can be set according to actual needs. For example, the first similarity threshold can be 80% or 90%, and the second similarity threshold can be 2/3 or 1. In this implementation Examples are not limited.

After the above calculation is completed for any two news in the news data, the news in the news data can be divided into news sets of different events. For example, news A and news B can be divided into the news set of event 1, and news C, news D, news E, and news F can be divided into the news set of event 2.

In some optional embodiments, from the news collection, when determining multiple news whose release interval and release time duration meet the set conditions, n pieces of news can be intercepted from the news collection, and combined with the n pieces of news for specific analyze.

Optionally, the release time of the news in the news collection may be sorted first according to chronological order to obtain a release time sequence.

Next, a sliding window is determined, and the length of the sliding window is the time span. For ease of description and distinction, this sliding window is described as the first sliding window. For example, the length of the first sliding window is 1 hour, 2 hours, 24 hours and so on.

Next, the first sliding window is used to slide on the release time series to obtain multiple time windows. Wherein, multiple time windows have the same time span, and each time window contains one or more news. From the plurality of time windows obtained by sliding, the time window in which the quantity of news meets the set quantity requirement can be determined as the target time window. Optionally, the set quantity requirement may be: the maximum quantity, or a quantity greater than a certain quantity threshold, which is not limited in this embodiment. For example, in some embodiments, from the multiple time windows obtained by sliding, the time window with the largest number of news is determined as the target time window.

In some embodiments, when there are multiple time windows with the same number of news in the time window obtained by sliding, the average time interval of the news in each window can be calculated, and the time window with a smaller average time interval is taken as The target time window will not be repeated here.

After the target time window is determined, multiple news whose release interval meets the set interval requirement can be intercepted from the target time stamp window. Optionally, the set interval requirement may be: the average time interval is the smallest, or the average time interval is smaller than a certain time threshold, which is not limited in this embodiment.

Optionally, a sliding window may be determined, and the length of the sliding window is a set number of lengths. For ease of distinction, the sliding window may be referred to as the second sliding window.

Next, the second sliding window may be used to slide in the target time window to obtain multiple sub-windows. Among them, each sub-window has the same number of news, but may have different time intervals. For example, assuming that the window length of the second sliding window is m, when the target time window intercepts n news items, the second sliding window can be used to slide on n news items, and m items can be selected from n items of news each time when sliding News, m pieces of news have different release times.

Next, the average interval length of the news contained in the plurality of sub-windows may be calculated, and the target sub-window is determined from the plurality of sub-windows according to the average interval length. Wherein, the average interval time of news contained in the target sub-window satisfies the set interval requirement. For example, the average interval time of news in the target sub-window is the smallest.

Based on the above embodiments, when calculating the popularity of news corresponding to an event, the time interval between each news in the target sub-window and the adjacent previous news may be determined.

Next, use the time interval as the index of the specified base, calculate the index items of each news, and perform weighted calculations on the index items of each news according to the respective popularity weights of each news to obtain a weighted score. After the weighted score is obtained, the ratio of the weighted score to the length of the second sliding window may be calculated as the news popularity corresponding to the event. Wherein, the designated base number may be any constant, such as 2, 3, 4, etc., which is not limited in this embodiment.

In some embodiments, the specified base may be e (approximately 2.7182818284). The calculation process of the above-mentioned news popularity H can refer to the following formula:

In Formula 1, n represents the number of news in the target time window, m represents the news set in the target sub-window, and |m| represents the length of the second sliding window. i represents the i-th news in the target sub-window, and α represents the popularity weight of the i-th news. Usually, news published by mainstream news websites has a higher weight. Interver _i represents the time interval between the i-th news and the previous news, Interver _i =T _i -T _i-1 , where T _i represents the release time of the i-th news, i=2, 3,..., m, namely The first news in the set m does not participate in the calculation.

The above method of calculating news popularity will be further exemplified below with a specific example.

Assume that the target time window obtained by sliding the second sliding window contains 5 news items, a1, a2, a3, a4, a5, and their release times are: 10:00, 10:06, 10:07, 10:09 , 10:30, the weights of the source institutions are: 1, 2, 3, 4, 5 respectively. Assume that the length of the second sliding window is 3, ie |m|=3. Slide the second sliding window in the target time window, calculate the time interval of the news in each sliding window, and select the sliding window with the smallest average time interval, m={a2, a3, a4}, where, a2, The average time interval of the three news items a3 and a4 is the smallest. Then the news popularity is:

Based on the above calculation method of news popularity, the news release interval, the duration of the release time and the news publishing organization can be considered comprehensively, and the multi-dimensional information of the news can be fully utilized to calculate the accurate news popularity. After the news popularity is calculated, the news events may be sorted according to the news popularity, or news events with high news popularity may be recommended to the user, which is not limited in this embodiment.

The aforementioned embodiments describe the implementation of calculating the similarity between the first news text and the second news text according to the similarity of the text and the similarity of the title, and this implementation will be further described in detail below.

News texts refer to texts that report or comment on events, and news texts are usually published in magazines, newspapers, and various websites. When there are a large number of news texts, similarity recognition can be performed on the massive news texts, and similar news texts can be classified or deduplicated, etc. Among them, when performing similarity identification on a large amount of news texts, the similarity between any two news texts can be calculated.

In each embodiment of the present application, for the convenience of description and distinction, any two news texts to be identified by similarity are described as a first news text and a second news text.

News texts have certain data characteristics. Generally, news texts include at least two parts, ie, a title part and a body part. The title of the news is a general summary or evaluation of the text. Therefore, whether it is a newsletter or a long-form news, when the same content is reported, the similarity between the two titles is usually high. In this embodiment, in order to reduce the impact of text length differences on the similarity, the similarity between news is divided into two parts, that is, the similarity between titles and the similarity between texts.

In this embodiment, for ease of description and distinction, the title and text of the first news text are described as the first title and the first text, and the title of the second news text is described as the second title and the second text.

Based on the corresponding texts of the first title and the second title, the similarity between the first title and the second title can be calculated, and based on the corresponding texts of the first text and the second text, the similarity between the first text and the second text can be calculated. similarity between. When calculating the similarity based on the text, the literal similarity of the text may be calculated, which will be described in detail in subsequent embodiments and will not be repeated here. For the convenience of description and distinction, the similarity between titles is described as title similarity, and the similarity between texts is described as text similarity.

After obtaining the similarity between the first text and the text, the similarity between the title and the text is fused to obtain the similarity between the first news text and the second news text. Wherein, when the title similarity and the text similarity are fused, the title similarity and the text similarity may be fused in an arithmetic calculation manner. For example, the average of headline similarity and body similarity can be calculated as the similarity between two news texts; for example, the product of headline similarity and body similarity can be calculated as the similarity between two news texts ; For another example, the sum of the title similarity and the text similarity can be used as the similarity between two news texts.

In some exemplary embodiments, considering the contribution of titles and texts to news content, preset weight coefficients can be set for titles and texts respectively, and the title similarity and text similarity can be calculated according to the preset weight coefficients. The weighted summation is used to obtain the similarity between the first news text and the second news text. Assuming that the weight coefficient of the title is w1, the weight coefficient of the text is w2, the similarity of the title is S1, and the similarity of the text is S2, then the similarity between the first news text and the second news text is S=w1*S1+w2*S2 , where the values of w1 and w2 may be empirical values, which are not limited in this embodiment.

In this embodiment, when calculating the similarity of the news, the title and the text in the news are processed separately, and the similarity of the title is calculated according to the text corresponding to the title, and the similarity of the text is calculated according to the text corresponding to the text, which can be To a certain extent, it reduces the impact of text length differences on similarity, which is conducive to calculating more accurate similarity. At the same time, the similarity of the news is obtained by fusing the similarity of the title and the similarity of the text, which can quickly obtain the similarity calculation result of the news text, reduce the time cost and calculation cost required to identify similar news, and improve the identification efficiency of similar news .

In the above-mentioned embodiment, the implementation manner of separately processing the title and text of the news is described, and the optional implementation manner of calculating the similarity of the title and the similarity of the text will be further described below.

Optionally, as shown in Figure 3, after the first news and the second news are used as input data, it is first possible to detect whether the input text is a title, and if it is a title, enter the title processing branch, that is, execute the first embodiment; if If the input text is not a title, enter the text processing branch, that is, execute the second embodiment.

Embodiment 1: Calculate the title similarity between the first title and the second title according to the corresponding texts of the first title and the second title.

Optionally, a keyword extraction operation may be performed on the first title and the second title to obtain a set of keywords contained in the first title and a set of keywords contained in the second title. Wherein, the set of keywords contained in the first title can be described as a set of first title entries; the set of keywords contained in the second title can be described as a set of second title entries.

Wherein, the keyword extraction operation may include: extracting entries corresponding to entities, entries whose part of speech is a noun, and/or entries whose part of speech is a verb. That is, extract the entry corresponding to the entity, the entry whose part of speech is a noun, and/or the entry whose part of speech is a verb in the first title to obtain a set of entries in the first title; extract the words corresponding to the entity in the second title Items, entries whose part of speech is a noun, and/or entries whose part of speech is a verb, obtain the second title entry set.

Among them, entity (Entity) refers to the things that actually exist in nature that appear in the text corpus. An entity is a specific thing, which can be one thing or a collection of multiple things", such as names, places, organizational structures and other entities.

Next, the number of the same title entries in the first set of title entries and the second set of title entries can be calculated; wherein, the same title entry refers to both the first set of title entries and the second set of title entries. An entry for the title entry collection. When a same title entry in the title appears multiple times, only the number of the same title entry is marked as 1, regardless of the frequency of its repeated appearance.

Next, the title similarity can be determined according to the ratio of the number of the same title terms to the total number of terms included in the first set of title terms and the second set of title terms. The above calculation process can refer to the records of the following formula:

Among them, A represents the first headline entry set, |A| represents the module length of the set A, that is, the number of elements in the set A; B represents the second headline entry set, |B| represents the module length of the set B, That is, the number of elements in the set B. i represents the i-th entry in the set A. Based on the above formula, it can be seen that when the i-th entry in set A also belongs to set B, f(i, B)=1, that is, the i-th entry is the same title word of set A and set B. When the i-th entry in the set A does not belong to the set B, f(i, B)=0, that is, the i-th entry is a different title word of the A set and the B set. Wherein, the coefficient 2 on the numerator is used to ensure that the maximum value of the similarity calculation result S2 is 1. Based on formula 2, the title similarity of two news items can be calculated, that is, the title similarity.

Embodiment 2: Calculate the text similarity between the first text and the second text according to the corresponding texts of the first text and the second text.

Optionally, word segmentation processing may be performed on the first text and the second text to obtain a set of lexical entries corresponding to the first text and the second text. Wherein, the entry set corresponding to the first text may be described as a first text entry set, and the entries corresponding to the second text may be described as a second text entry set.

Among them, word segmentation processing refers to segmenting sentences and paragraphs to obtain entries, words, etc. contained in sentences. In some embodiments, in order to save data space and improve subsequent processing efficiency, a stop word removal operation may be performed on the result obtained from word segmentation processing, as shown in FIG. 3 . Among them, stop words refer to function words without practical meaning, such as "的", "在", "是" and so on.

After obtaining the first set of text entries and the second set of text entries, the same text entries and different text entries in the first set of text entries and the second set of text entries can be obtained. Wherein, the intersection of the first text entry set and the second text entry set can be determined to obtain the same text entry; after obtaining the same text entry, the first text entry set and the second text entry set can be In , entries other than the same text entry are regarded as different text entries.

For the same text entry, the frequency of occurrence of the same text entry in the first collection of text entries can be calculated to obtain the first frequency of occurrence, and the frequency of occurrence of the same text entry in the second collection of text entries can be calculated, Get the second frequency of occurrence. When the same text entry contains multiple entries, the frequency of occurrence of these multiple entries in the first text entry set can be added to obtain the first frequency of occurrence, and the multiple entries can be accumulated in the second text word. The frequency of occurrence in the bar set, get the second frequency of occurrence.

For the different text entries, the frequency of occurrence of the different text entries in the first text entry set can be calculated to obtain the third frequency of occurrence, and the different text entries in the second text entry set can be calculated The frequency of occurrence of , get the fourth frequency of occurrence. When different text entries include multiple entries, the frequency of occurrence of these multiple entries in the first text entry collection can be added to obtain the third frequency of occurrence, and these multiple entries can be accumulated in the second text. The frequency of occurrence in the entry set to obtain the fourth frequency of occurrence.

Usually, if two news articles report the same thing, then there is a high probability that the contents of the two news texts have a high similarity. If the lengths of two similar news articles are different, the calculated similarity will be small, which is not in line with the actual situation.

In order to reduce the influence of text length on similarity, in some exemplary embodiments, a similarity penalty item associated with text length may be further increased during the process of calculating text similarity. Wherein, the similarity penalty item may be calculated according to the respective text lengths of the first text and the second text.

In some optional embodiments, when calculating the similarity penalty item, the absolute value of the text length difference between the first text and the second text can be calculated; if the absolute value of the text length difference is greater than or equal to the set threshold, Then the product of the absolute value of the text length difference and the set coefficient α can be used as the similarity penalty item. If the text length difference is smaller than the set threshold, a smaller fixed value may be set as a similarity penalty item, and the fixed value may be 0. The calculation process of the above similarity penalty item can be shown in the following formula:

In Formula 3, La represents the text length of the first text, Lb represents the text length of the second text, and γ is a preset threshold. Wherein, La may be represented by the number of elements contained in the first text entry set, and Lb may be represented by the number of elements contained in the second text entry. Among them, α represents the coefficient of the penalty item, and the value coefficients of α and γ are empirical values; among them, the value of α is positively correlated with the absolute value of the difference in text length, and the larger the difference in text length, the larger the value of α. In this way, the influence of the text length on the similarity calculation result can be improved.

Among the massive news texts in the database, when the text length difference between the news text with the longest text length and the news text with the shortest text length is hundreds of words, the value of γ can take hundreds as the unit; when the text length is the longest When the text length difference between the news text with the shortest text length and the news text with the shortest text length is thousands of words, the value of γ can be in units of thousands. For example, if the shortest news in the database is only 200 characters, and the longest news is 2000 characters, then γ can take thousands as the unit.

Wherein, α can be determined according to the actual text length difference, and if the text length difference is large, a larger value can be selected for α. If the difference in text length is small, you can choose a smaller value for α to minimize the impact of text length difference on similarity calculation. For example, the value of α may be 0.01, 0.05, 0.1, etc., which will not be repeated here.

Next, the text similarity can be calculated according to the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence, the fourth frequency of occurrence and the similarity penalty item.

In some exemplary embodiments, the smaller frequency of the first frequency of occurrence and the second frequency of occurrence may be calculated; the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence and the fourth frequency of occurrence are summed, Get the total frequency.

Wherein, the similarity penalty item can be added to the total frequency, that is, the similarity penalty item is added to the total frequency to update the total frequency. According to the ratio of the smaller frequency to the updated total frequency, the text similarity can be obtained.

The above calculation process can refer to the records of the following formula:

In formula 4, N represents the set of the same text entry, i represents the i-th same text entry; M represents the set of different text entries, j represents the jth different text entry; a represents the first text word item set, and b represents the second text entry set. min() represents the function of taking the minimum value, and count() represents the function of counting the frequency of entries. F represents the similarity penalty item. Among them, the coefficient 2 on the numerator is used to ensure that the maximum value of the similarity calculation result S2 is 1, and min() is used to reduce the influence of certain terms that appear frequently in long texts on the similarity.

Based on the above implementations, when calculating the similarity of news, the headlines and texts in the news are processed separately, which can reduce the influence of text length differences on the similarity to a certain extent, and is conducive to calculating more accurate similarity. In addition, when calculating the similarity, a penalty item related to the length of the text is further added. When the lengths of the two news articles to be recognized differ greatly, the influence of the length of the text on the similarity calculation can be further reduced and the literal similarity can be improved. degree of calculation accuracy.

It should be noted that the subject of execution of each step of the method provided in the foregoing embodiments may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 201 to 204 may be device A; for another example, the execution subject of steps 201 and 202 may be device A, and the execution subject of step 203 may be device B; and so on.

In addition, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be executed in the order in which they appear herein or executed in parallel , the serial numbers of the operations, such as 201, 202, etc., are only used to distinguish different operations, and the serial numbers themselves do not represent any execution order. Additionally, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel. It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc. are different types.

Fig. 4 is a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application, and the electronic device is suitable for executing the method for calculating news popularity provided by the foregoing embodiments. As shown in FIG. 4 , the electronic device includes: a memory 401 , a processor 402 and a communication component 403 .

The memory 401 is used to store computer programs, and can be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.

Wherein, the memory 401 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable In addition to programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

The processor 402, coupled with the memory 401, is used to execute the computer program in the memory 401, so as to: obtain the news set corresponding to the event through the communication component 401; from the news set, determine the release interval and the release time duration to meet A plurality of news with set conditions; according to the publishing organization corresponding to the plurality of news, determine the respective popularity weights of the plurality of news; according to the respective popularity weights of the plurality of news, the release interval and the release time Persistence, calculating the popularity of news corresponding to the event.

Further optionally, when the processor 402 determines from the news collection a plurality of news whose publishing interval and publishing time duration meet the set conditions, it is specifically configured to: sort the news in the news collection in chronological order The release time of the news is sorted to obtain the release time sequence; the first sliding window is used to slide on the release time sequence to obtain multiple time windows; the window length of the first sliding window is a set time span; from Among the multiple time windows, determine the time window in which the number of news meets the set quantity requirement as the target time window; from the target time stamp window, intercept the multiple news whose release interval meets the set interval requirement.

Further optionally, when the processor 402 intercepts from the target time window the plurality of news whose release interval meets the set interval requirement, it is specifically configured to: adopt a second sliding window, and in the target time window slide to obtain a plurality of sub-windows; the length of the second sliding window is a set quantity length; calculate the average interval duration of news contained in each of the plurality of sub-windows; according to the average interval duration, from the plurality of sub-windows The target sub-window is determined in the target sub-window, and the average interval time of the news contained in the target sub-window meets the set interval requirement.

Further optionally, when the processor 402 calculates the news popularity corresponding to the event according to the respective popularity weights of the multiple news and the time interval between the multiple news, it is specifically used to: determine the target The time interval of each news in the window relative to the adjacent previous news; using the time interval as the index of the specified base, calculate the index item of each news; The index item is weighted to obtain a weighted score; the ratio of the weighted score to the length of the second sliding window is calculated as the news popularity corresponding to the event.

Further optionally, when acquiring the news set corresponding to the event, the processor 402 is specifically configured to: collect news data; calculate the first news text and the second news text in the news data; The text similarity between the second news texts; the news elements of the first news text and the news elements of the second news text are analyzed to obtain news element coincidence; if the text similarity and the coincidence degree of the elements satisfies the set condition, then the first news text and the second news text are classified into the news collection of the same event.

Further optionally, when calculating the text similarity between the first news and the second news, the processor 402 is specifically configured to: determine the first headline and the first text included in the first news text, And the second headline and the second text contained in the second news text; according to the respective texts corresponding to the first headline and the second headline, calculate the headline between the first headline and the second headline Similarity; according to the corresponding texts of the first text and the second text, calculate the text similarity between the first text and the second text; similarity between the title similarity and the text degrees to obtain the similarity between the first news text and the second news text.

Further optionally, when calculating the text similarity between the first text and the second text according to the corresponding texts of the first text and the second text, the processor 402 is specifically configured to: Perform word segmentation processing on the first text and the second text to obtain a first set of text entries and a second set of text entries; determine the first set of text entries and the second set of text entries Intersection to obtain the same text entry; determine the first text entry set and the second text entry set, other entries except the same text entry, as different text entries; respectively Calculate the frequency of occurrence of the same text entry in the first collection of text entries and the second collection of text entries to obtain the first frequency of occurrence and the second frequency of occurrence; respectively calculate the different text entries According to the frequency of occurrence in the first text entry set and the second text entry set, the third frequency of occurrence and the fourth frequency of occurrence are obtained; according to the respective text lengths of the first text and the second text , calculating a similarity penalty item; calculating the text similarity according to the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence, the fourth frequency of occurrence, and the similarity penalty item.

Further optionally, when the processor 402 analyzes the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of the news elements, it is specifically configured to: Extract time elements, location elements and subject elements from a news text and the second news text; calculate the coincidence degree of the time elements of the first news text and the second news text, the coincidence degree of the place elements and the subject element The coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the subject element are the sum of the coincidence degree of the news element.

The embodiment of the present application also provides an electronic device, including: a memory and a processor; the memory is used to store one or more computer instructions; the processor is used to execute the one or more computer instructions for: executing Steps in the method provided in the embodiment of the present application.

The embodiment of the present application also provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, the steps in the method provided in the embodiment of the present application can be implemented.

In the method for calculating the popularity of news provided by the embodiment of the present application, after obtaining the news set of the event, select a plurality of news whose release interval and duration of release time meet the set requirements, and determine the popularity of the news according to the corresponding publishing organization of the news Weights. By comprehensively considering the news release interval, the duration of the release time, and the news release organization, the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.

Further, as shown in FIG. 4 , the electronic device further includes: a display component 404 , a power supply component 405 , an audio component 406 and other components. FIG. 4 only schematically shows some components, which does not mean that the electronic device only includes the components shown in FIG. 4 .

Wherein, the communication component 403 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies to fulfill.

Wherein, the display component 404 includes a screen, and the screen may include a liquid crystal display component (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.

Wherein, the power supply component 405 provides power for various components of the device where the power supply component is located. A power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.

Wherein, the audio component 406 may be configured to output and/or input audio signals. For example, the audio component includes a microphone (MIC), which is configured to receive an external audio signal when the device on which the audio component is located is in an operation mode, such as a calling mode, a recording mode, and a speech recognition mode. The received audio signal may be further stored in a memory or sent via a communication component. In some embodiments, the audio component further includes a speaker for outputting audio signals.

In this embodiment, after the event news set is obtained, a plurality of news whose release interval and release time duration meet the set requirements are selected, and the popularity weight of the news is determined according to the release organization corresponding to the news. By comprehensively considering the news release interval, the duration of the release time, and the news release organization, the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.

Correspondingly, an embodiment of the present application further provides a computer-readable storage medium storing a computer program. When the computer program is executed, the steps that can be executed by the electronic device in the above method embodiments can be implemented.

The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in the electronic device according to the embodiments of the present invention. The present invention can also be implemented as programs/instructions (eg, computer programs/instructions and computer program products) of devices or means for performing part or all of the methods described herein. Such programs/instructions for implementing the present invention may be stored on a computer-readable medium, or may exist in the form of one or more signals, such signals may be downloaded from an Internet website, or provided on a carrier signal, or in any form Available in other formats.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.

FIG. 5 schematically shows a computer device/equipment/system that can implement the method for calculating news popularity according to the present invention. The computer device/equipment/system includes a processor 510 and a computer-readable medium in the form of a memory 520 . Memory 520 is one example of a computer readable medium having storage space 530 for storing computer programs/instructions 531 . When the computer program/instruction 531 is executed by the processor 510, various steps in the news popularity calculation method described above can be realized.

Fig. 6 schematically shows a block diagram of a computer program product implementing the method according to the invention. The computer program product includes a computer program/instruction 610. When the computer program/instruction 610 is executed by a processor such as the processor 510 shown in FIG. each step.

The foregoing describes certain embodiments of the specification which, together with other embodiments, are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily follow the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or advantageous in certain embodiments.

It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

It should be understood that the above-mentioned embodiments are only for the purpose of illustrating the present invention rather than limiting the present invention. Without departing from the basic spirit and characteristics of the present invention, those skilled in the art can implement the present invention in other ways. The scope of the present invention shall be based on the appended claims, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of one or more embodiments of the present specification shall be covered therein.

Claims

A method for calculating news popularity, characterized by comprising:

Obtain the news collection corresponding to the event;

From the news collection, determine a plurality of news whose release interval and duration of release time meet the set conditions;

Determining the popularity weights of the multiple news according to the publishing agencies corresponding to the multiple news;

The news popularity corresponding to the event is calculated according to the respective popularity weights of the plurality of news, the publishing interval, and the publishing time duration.
The method according to claim 1, characterized in that, from the news collection, determining a plurality of news whose release interval and duration of release time meet the set conditions includes:

Sorting the release time of the news in the news collection according to chronological order to obtain a release time sequence;

Using a first sliding window to slide on the release time series to obtain multiple time windows; the window length of the first sliding window is a set time span;

From the plurality of time windows, determine the time window in which the quantity of news meets the set quantity requirement as the target time window;

From the target time stamp window, intercept the multiple news whose release interval meets the set interval requirement.
The method according to claim 2, wherein, from the target time window, intercepting the plurality of news whose publishing interval meets the requirement of the set interval includes:

Using a second sliding window to slide in the target time window to obtain multiple sub-windows; the length of the second sliding window is a set number of lengths;

calculating the average interval duration of the news contained in each of the plurality of sub-windows;

According to the average interval length, a target sub-window is determined from the multiple sub-windows, and the average interval time of the news contained in the target sub-window meets the set interval requirement.
The method according to claim 3, wherein calculating the news popularity corresponding to the event according to the respective popularity weights of the plurality of news and the time interval between the plurality of news includes:

Determine the time interval of each news in the target sub-window relative to the adjacent previous news;

Compute an index term for each news using said time interval as an index to the specified base;

The weighted calculation of the index items of each news is carried out according to the respective heat weights of each news to obtain the weighted score;

Calculate the ratio of the weighted score to the length of the second sliding window as the news popularity corresponding to the event.
The method according to any one of claims 1-4, wherein obtaining a news set corresponding to an event includes:

collect news data;

For the first news text and the second news text in the news data, calculate the text similarity between the first news text and the second news text;

Performing an overlap analysis on the news elements of the first news text and the news elements of the second news text to obtain the overlap of news elements;

If the text similarity and the element coincidence meet the set conditions, the first news text and the second news text are classified into a news collection of the same event.
The method according to claim 5, wherein calculating the text similarity between the first news and the second news comprises:

determining a first title and a first text included in the first news text, and a second title and a second text included in the second news text;

calculating the title similarity between the first title and the second title according to the respective texts corresponding to the first title and the second title;

calculating the text similarity between the first text and the second text according to the respective texts corresponding to the first text and the second text;

The title similarity and the text similarity are fused to obtain the similarity between the first news text and the second news text.
The method according to claim 6, wherein, according to the respective texts corresponding to the first text and the second text, calculating the text similarity between the first text and the second text includes :

performing word segmentation processing on the first text and the second text to obtain a first set of text entries and a second set of text entries;

Determining the intersection of the first set of text entries and the second set of text entries to obtain the same text entry;

Determining that in the first set of text entries and the second set of text entries, other entries except the same text entries are used as different text entries;

Calculate the frequency of occurrence of the same text entry in the first collection of text entries and the second collection of text entries to obtain the first frequency of occurrence and the second frequency of occurrence;

Calculate the frequency of occurrence of the different text entries in the first collection of text entries and the second collection of text entries to obtain the third frequency of occurrence and the fourth frequency of occurrence;

calculating a similarity penalty item according to the respective text lengths of the first text and the second text;

The text similarity is calculated according to the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence, the fourth frequency of occurrence and the similarity penalty item.
The method according to claim 5, characterized in that, analyzing the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of news elements, including:

extracting time elements, location elements, and subject elements from the first news text and the second news text respectively;

calculating the coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the subject element of the first news text and the second news text;

The sum of the coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the subject element is taken as the coincidence degree of the news element.
An electronic device, characterized in that it includes: a memory and a processor;

The memory is used to store one or more computer instructions;

The processor is configured to execute the one or more computer instructions for: performing the steps in the method of any one of claims 1-8.
A computer device/equipment/system, comprising a memory, a processor, and computer programs/instructions stored on the memory, when the processor executes the computer program/instructions, it implements any one of claims 1-8 The steps of the news popularity calculation method.
A computer-readable medium, on which computer programs/instructions are stored, and when the computer programs/instructions are executed by a processor, the steps of the method for calculating news popularity according to any one of claims 1-8 are realized.
A computer program product, including computer programs/instructions, when the computer programs/instructions are executed by a processor, the steps of the method for calculating news popularity according to any one of claims 1-8 are realized.