WO2022267325A1 - News popularity calculation method, device and storage medium - Google Patents

News popularity calculation method, device and storage medium Download PDF

Info

Publication number
WO2022267325A1
WO2022267325A1 PCT/CN2021/132517 CN2021132517W WO2022267325A1 WO 2022267325 A1 WO2022267325 A1 WO 2022267325A1 CN 2021132517 W CN2021132517 W CN 2021132517W WO 2022267325 A1 WO2022267325 A1 WO 2022267325A1
Authority
WO
WIPO (PCT)
Prior art keywords
news
text
similarity
time
popularity
Prior art date
Application number
PCT/CN2021/132517
Other languages
French (fr)
Chinese (zh)
Inventor
计明杰
薛晓舟
蔡承蒙
陈邦忠
Original Assignee
完美世界控股集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 完美世界控股集团有限公司 filed Critical 完美世界控股集团有限公司
Publication of WO2022267325A1 publication Critical patent/WO2022267325A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Definitions

  • the present application relates to the technical field of the Internet, and in particular to a news popularity calculation method, device and storage medium.
  • a method for calculating news popularity including: obtaining a news set corresponding to an event; from the news set, determining a plurality of news whose release interval and release time duration meet the set conditions; according to The publishing organization corresponding to the plurality of news determines the respective popularity weights of the plurality of news; calculates the corresponding weight of the event according to the respective popularity weights of the plurality of news, the release interval and the duration of the release time.
  • News heat including: obtaining a news set corresponding to an event; from the news set, determining a plurality of news whose release interval and release time duration meet the set conditions; according to The publishing organization corresponding to the plurality of news determines the respective popularity weights of the plurality of news; calculates the corresponding weight of the event according to the respective popularity weights of the plurality of news, the release interval and the duration of the release time.
  • a computer device/equipment/system including a memory, a processor, and computer programs/instructions stored on the memory, and the above information is realized when the processor executes the computer program/instructions. Steps in the heat calculation method.
  • a computer-readable medium on which computer programs/instructions are stored, and when the computer programs/instructions are executed by a processor, the steps of the above-mentioned method for calculating news popularity are realized.
  • a computer program product including computer programs/instructions, and when the computer programs/instructions are executed by a processor, the steps of the above-mentioned method for calculating news popularity are realized.
  • the beneficial effect of the present invention is: after obtaining the news set of the event, select a plurality of news whose release interval and release time duration meet the set requirements, and determine the popularity weight of the news according to the release organization corresponding to the news.
  • the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.
  • FIG. 1 is a schematic flowchart of a method for calculating news popularity provided by an exemplary embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for calculating news popularity provided by another exemplary embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for identifying similar news provided by an exemplary embodiment of the present application
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application.
  • Figure 5 schematically shows a block diagram of a computer device/equipment/system for implementing the method according to the present invention.
  • Fig. 6 schematically shows a block diagram of a computer program product implementing the method according to the invention.
  • Fig. 1 is a schematic flowchart of a method for calculating news popularity provided by an exemplary embodiment of the present application. As shown in Fig. 1, the method includes:
  • Step 101 acquiring a news set corresponding to an event.
  • Step 102 from the news collection, determine a plurality of news whose release interval and release time duration meet the set conditions.
  • Step 103 according to the publishing organizations corresponding to the multiple news, determine the popularity weight of each of the multiple news.
  • Step 104 Calculate the news popularity corresponding to the event according to the popularity weights of the plurality of news, the release interval, and the release time duration.
  • an event refers to an object of a news report.
  • the importance of the event is high or the topic persistence is high, the number of news reporting the event is also large. Analyzing the news popularity of events is helpful for identifying hot topics and recommending hot topics.
  • the news set corresponding to the event includes a plurality of news articles reporting the event.
  • a large amount of news data can be analyzed based on news classification and aggregation to obtain news sets corresponding to different events.
  • the release time or quantity of the news is considered alone, it is impossible to get an accurate calculation result of the popularity. For example, if the number of news about a certain event is large, but the time interval between the news is relatively large, it can be considered that the news is less popular. Similarly, if there are multiple news articles about a certain event in a very short time interval, but there is no other news that continuously reports on the event, it can be considered that the popularity of the news is low. In addition, if most of the news about a certain event comes from some less authoritative news organizations, it can be considered that the news is also less popular.
  • this embodiment comprehensively considers news release intervals, release time durations, and news release agencies when calculating news heat, so as to make full use of multi-dimensional information of news.
  • the release interval of news refers to the time difference between two adjacent news releases, which is used to indicate the frequency of news releases; the duration of news release time is used to indicate the number of news reports about events in the time dimension Persistent.
  • News publishers refer to sources of news, such as portal websites, magazines, newspapers, and so on.
  • the popularity weight of the news can be determined according to the publishing agency corresponding to the news, and then the influence of the publishing agency on the news popularity can be considered to improve the accuracy of the news popularity calculation result.
  • a large number of news can be classified and aggregated to obtain the correspondence between news and events. That is, take the event as the dimension to filter out the news corresponding to the event.
  • An optional implementation manner of filtering news corresponding to an event will be exemplarily described below by taking any event as an example.
  • news data may be collected, and the news data may include news from multiple different news release organizations.
  • the news data may include news from multiple different news release organizations.
  • the text similarity between the first news text and the second news text can be calculated; Analyze the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of news elements; if the similarity of the text and the coincidence degree of the elements meet the set conditions, then the The two news texts are divided into the news collection of the same event.
  • the first headline and the first text contained in the first news text, and the second title and the second text contained in the second news text can be determined ;Calculate the title similarity between the first title and the second title according to the corresponding texts of the first title and the second title; calculate the first text and the second text according to the corresponding texts of the first text and the second text The text similarity between them; the title similarity and the text similarity are fused to obtain the similarity between the first news text and the second news text.
  • the title similarity and the text similarity are fused to obtain the similarity between the first news text and the second news text.
  • the elements of news are the basic components of news.
  • the six elements of news are commonly used, referring to: time, place, person, cause, process, and result of an event.
  • the main entities in the news can be extracted, the main entities include: time, place, person (or organization, etc.), etc., as shown in FIG. 2 .
  • time elements, location elements and subject elements can be extracted from the first news text and the second news text respectively;
  • the coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the main element of the first news text and the second news text; the sum of the coincidence degree of the time element, the coincidence degree of the place element and the coincidence degree of the main element is taken as The coincidence degree of news elements.
  • the main elements include the people, objects, organizations and so on described in the news.
  • the three elements of news A are (time 1, location 1, person 1), and the three elements of news B are (time 1, location 1, person 2); that is, the time of news A If the time element of the element is the same as that of news B, and the main element of news A is the same as that of news B, then the overlap degree of news elements can be considered as 2/3.
  • the text similarity and the element overlap meet the set conditions, which may include: the text similarity is greater than a set first similarity threshold, and the element overlap is greater than a set second similarity threshold.
  • the first similarity threshold and the second similarity threshold can be set according to actual needs.
  • the first similarity threshold can be 80% or 90%
  • the second similarity threshold can be 2/3 or 1. In this implementation Examples are not limited.
  • the news in the news data can be divided into news sets of different events.
  • news A and news B can be divided into the news set of event 1
  • news C, news D, news E, and news F can be divided into the news set of event 2.
  • n pieces of news can be intercepted from the news collection, and combined with the n pieces of news for specific analyze.
  • the release time of the news in the news collection may be sorted first according to chronological order to obtain a release time sequence.
  • a sliding window is determined, and the length of the sliding window is the time span.
  • this sliding window is described as the first sliding window.
  • the length of the first sliding window is 1 hour, 2 hours, 24 hours and so on.
  • the first sliding window is used to slide on the release time series to obtain multiple time windows.
  • multiple time windows have the same time span, and each time window contains one or more news.
  • the time window in which the quantity of news meets the set quantity requirement can be determined as the target time window.
  • the set quantity requirement may be: the maximum quantity, or a quantity greater than a certain quantity threshold, which is not limited in this embodiment.
  • the time window with the largest number of news is determined as the target time window.
  • the average time interval of the news in each window can be calculated, and the time window with a smaller average time interval is taken as The target time window will not be repeated here.
  • the set interval requirement may be: the average time interval is the smallest, or the average time interval is smaller than a certain time threshold, which is not limited in this embodiment.
  • a sliding window may be determined, and the length of the sliding window is a set number of lengths.
  • the sliding window may be referred to as the second sliding window.
  • the second sliding window may be used to slide in the target time window to obtain multiple sub-windows.
  • each sub-window has the same number of news, but may have different time intervals.
  • the window length of the second sliding window is m
  • the second sliding window can be used to slide on n news items, and m items can be selected from n items of news each time when sliding News, m pieces of news have different release times.
  • the average interval length of the news contained in the plurality of sub-windows may be calculated, and the target sub-window is determined from the plurality of sub-windows according to the average interval length.
  • the average interval time of news contained in the target sub-window satisfies the set interval requirement. For example, the average interval time of news in the target sub-window is the smallest.
  • the time interval between each news in the target sub-window and the adjacent previous news may be determined.
  • the time interval as the index of the specified base, calculate the index items of each news, and perform weighted calculations on the index items of each news according to the respective popularity weights of each news to obtain a weighted score.
  • the ratio of the weighted score to the length of the second sliding window may be calculated as the news popularity corresponding to the event.
  • the designated base number may be any constant, such as 2, 3, 4, etc., which is not limited in this embodiment.
  • the specified base may be e (approximately 2.7182818284).
  • the calculation process of the above-mentioned news popularity H can refer to the following formula:
  • n the number of news in the target time window
  • m the news set in the target sub-window
  • the length of the second sliding window
  • i the i-th news in the target sub-window
  • the popularity weight of the i-th news.
  • Interver i the time interval between the i-th news and the previous news
  • the target time window obtained by sliding the second sliding window contains 5 news items, a1, a2, a3, a4, a5, and their release times are: 10:00, 10:06, 10:07, 10:09 , 10:30, the weights of the source institutions are: 1, 2, 3, 4, 5 respectively.
  • the length of the second sliding window is 3, ie
  • 3.
  • the news popularity is:
  • the news release interval, the duration of the release time and the news publishing organization can be considered comprehensively, and the multi-dimensional information of the news can be fully utilized to calculate the accurate news popularity.
  • the news events may be sorted according to the news popularity, or news events with high news popularity may be recommended to the user, which is not limited in this embodiment.
  • News texts refer to texts that report or comment on events, and news texts are usually published in magazines, newspapers, and various websites.
  • similarity recognition can be performed on the massive news texts, and similar news texts can be classified or deduplicated, etc.
  • similarity identification can be performed on a large amount of news texts, the similarity between any two news texts can be calculated.
  • any two news texts to be identified by similarity are described as a first news text and a second news text.
  • News texts have certain data characteristics. Generally, news texts include at least two parts, ie, a title part and a body part. The title of the news is a general summary or evaluation of the text. Therefore, whether it is a newsletter or a long-form news, when the same content is reported, the similarity between the two titles is usually high. In this embodiment, in order to reduce the impact of text length differences on the similarity, the similarity between news is divided into two parts, that is, the similarity between titles and the similarity between texts.
  • the title and text of the first news text are described as the first title and the first text
  • the title of the second news text is described as the second title and the second text.
  • the similarity between the first title and the second title can be calculated, and based on the corresponding texts of the first text and the second text, the similarity between the first text and the second text can be calculated. Similarity between.
  • the literal similarity of the text may be calculated, which will be described in detail in subsequent embodiments and will not be repeated here.
  • the similarity between titles is described as title similarity
  • the similarity between texts is described as text similarity.
  • the similarity between the title and the text is fused to obtain the similarity between the first news text and the second news text.
  • the title similarity and the text similarity may be fused in an arithmetic calculation manner.
  • the average of headline similarity and body similarity can be calculated as the similarity between two news texts; for example, the product of headline similarity and body similarity can be calculated as the similarity between two news texts ;
  • the sum of the title similarity and the text similarity can be used as the similarity between two news texts.
  • preset weight coefficients can be set for titles and texts respectively, and the title similarity and text similarity can be calculated according to the preset weight coefficients.
  • the weighted summation is used to obtain the similarity between the first news text and the second news text.
  • the weight coefficient of the title is w1
  • the weight coefficient of the text is w2
  • the similarity of the title is S1
  • the similarity of the text is S2
  • the title and the text in the news are processed separately, and the similarity of the title is calculated according to the text corresponding to the title, and the similarity of the text is calculated according to the text corresponding to the text, which can be To a certain extent, it reduces the impact of text length differences on similarity, which is conducive to calculating more accurate similarity.
  • the similarity of the news is obtained by fusing the similarity of the title and the similarity of the text, which can quickly obtain the similarity calculation result of the news text, reduce the time cost and calculation cost required to identify similar news, and improve the identification efficiency of similar news .
  • Embodiment 1 Calculate the title similarity between the first title and the second title according to the corresponding texts of the first title and the second title.
  • a keyword extraction operation may be performed on the first title and the second title to obtain a set of keywords contained in the first title and a set of keywords contained in the second title.
  • the set of keywords contained in the first title can be described as a set of first title entries; the set of keywords contained in the second title can be described as a set of second title entries.
  • the keyword extraction operation may include: extracting entries corresponding to entities, entries whose part of speech is a noun, and/or entries whose part of speech is a verb. That is, extract the entry corresponding to the entity, the entry whose part of speech is a noun, and/or the entry whose part of speech is a verb in the first title to obtain a set of entries in the first title; extract the words corresponding to the entity in the second title Items, entries whose part of speech is a noun, and/or entries whose part of speech is a verb, obtain the second title entry set.
  • Entity refers to the things that actually exist in nature that appear in the text corpus.
  • An entity is a specific thing, which can be one thing or a collection of multiple things", such as names, places, organizational structures and other entities.
  • the number of the same title entries in the first set of title entries and the second set of title entries can be calculated; wherein, the same title entry refers to both the first set of title entries and the second set of title entries.
  • the title similarity can be determined according to the ratio of the number of the same title terms to the total number of terms included in the first set of title terms and the second set of title terms.
  • the above calculation process can refer to the records of the following formula:
  • A represents the first headline entry set
  • represents the module length of the set A, that is, the number of elements in the set A
  • B represents the second headline entry set
  • represents the module length of the set B, That is, the number of elements in the set B.
  • i represents the i-th entry in the set A.
  • f(i, B) 0, that is, the i-th entry is a different title word of the A set and the B set.
  • the coefficient 2 on the numerator is used to ensure that the maximum value of the similarity calculation result S2 is 1.
  • the title similarity of two news items can be calculated, that is, the title similarity.
  • Embodiment 2 Calculate the text similarity between the first text and the second text according to the corresponding texts of the first text and the second text.
  • word segmentation processing may be performed on the first text and the second text to obtain a set of lexical entries corresponding to the first text and the second text.
  • the entry set corresponding to the first text may be described as a first text entry set
  • the entries corresponding to the second text may be described as a second text entry set.
  • word segmentation processing refers to segmenting sentences and paragraphs to obtain entries, words, etc. contained in sentences.
  • a stop word removal operation may be performed on the result obtained from word segmentation processing, as shown in FIG. 3 .
  • stop words refer to function words without practical meaning, such as " ⁇ ", " ⁇ ", " ⁇ ” and so on.
  • the same text entries and different text entries in the first set of text entries and the second set of text entries can be obtained.
  • the intersection of the first text entry set and the second text entry set can be determined to obtain the same text entry; after obtaining the same text entry, the first text entry set and the second text entry set can be In , entries other than the same text entry are regarded as different text entries.
  • the frequency of occurrence of the same text entry in the first collection of text entries can be calculated to obtain the first frequency of occurrence
  • the frequency of occurrence of the same text entry in the second collection of text entries can be calculated, Get the second frequency of occurrence.
  • the frequency of occurrence of these multiple entries in the first text entry set can be added to obtain the first frequency of occurrence, and the multiple entries can be accumulated in the second text word.
  • the frequency of occurrence in the bar set get the second frequency of occurrence.
  • the frequency of occurrence of the different text entries in the first text entry set can be calculated to obtain the third frequency of occurrence, and the different text entries in the second text entry set can be calculated The frequency of occurrence of , get the fourth frequency of occurrence.
  • the frequency of occurrence of these multiple entries in the first text entry collection can be added to obtain the third frequency of occurrence, and these multiple entries can be accumulated in the second text. The frequency of occurrence in the entry set to obtain the fourth frequency of occurrence.
  • a similarity penalty item associated with text length may be further increased during the process of calculating text similarity.
  • the similarity penalty item may be calculated according to the respective text lengths of the first text and the second text.
  • the absolute value of the text length difference between the first text and the second text can be calculated; if the absolute value of the text length difference is greater than or equal to the set threshold, Then the product of the absolute value of the text length difference and the set coefficient ⁇ can be used as the similarity penalty item. If the text length difference is smaller than the set threshold, a smaller fixed value may be set as a similarity penalty item, and the fixed value may be 0.
  • the calculation process of the above similarity penalty item can be shown in the following formula:
  • La represents the text length of the first text
  • Lb represents the text length of the second text
  • is a preset threshold.
  • La may be represented by the number of elements contained in the first text entry set
  • Lb may be represented by the number of elements contained in the second text entry.
  • represents the coefficient of the penalty item
  • the value coefficients of ⁇ and ⁇ are empirical values; among them, the value of ⁇ is positively correlated with the absolute value of the difference in text length, and the larger the difference in text length, the larger the value of ⁇ . In this way, the influence of the text length on the similarity calculation result can be improved.
  • the value of ⁇ can take hundreds as the unit; when the text length is the longest When the text length difference between the news text with the shortest text length and the news text with the shortest text length is thousands of words, the value of ⁇ can be in units of thousands. For example, if the shortest news in the database is only 200 characters, and the longest news is 2000 characters, then ⁇ can take thousands as the unit.
  • can be determined according to the actual text length difference, and if the text length difference is large, a larger value can be selected for ⁇ . If the difference in text length is small, you can choose a smaller value for ⁇ to minimize the impact of text length difference on similarity calculation.
  • the value of ⁇ may be 0.01, 0.05, 0.1, etc., which will not be repeated here.
  • the text similarity can be calculated according to the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence, the fourth frequency of occurrence and the similarity penalty item.
  • the smaller frequency of the first frequency of occurrence and the second frequency of occurrence may be calculated; the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence and the fourth frequency of occurrence are summed, Get the total frequency.
  • the similarity penalty item can be added to the total frequency, that is, the similarity penalty item is added to the total frequency to update the total frequency. According to the ratio of the smaller frequency to the updated total frequency, the text similarity can be obtained.
  • min() represents the function of taking the minimum value
  • count() represents the function of counting the frequency of entries.
  • F represents the similarity penalty item.
  • the coefficient 2 on the numerator is used to ensure that the maximum value of the similarity calculation result S2 is 1, and min() is used to reduce the influence of certain terms that appear frequently in long texts on the similarity.
  • the headlines and texts in the news are processed separately, which can reduce the influence of text length differences on the similarity to a certain extent, and is conducive to calculating more accurate similarity.
  • a penalty item related to the length of the text is further added.
  • each step of the method may be the same device, or the method may also be executed by different devices.
  • the execution subject of steps 201 to 204 may be device A; for another example, the execution subject of steps 201 and 202 may be device A, and the execution subject of step 203 may be device B; and so on.
  • Fig. 4 is a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application, and the electronic device is suitable for executing the method for calculating news popularity provided by the foregoing embodiments.
  • the electronic device includes: a memory 401 , a processor 402 and a communication component 403 .
  • the memory 401 is used to store computer programs, and can be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.
  • the memory 401 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable In addition to programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory magnetic memory
  • flash memory magnetic disk or optical disk.
  • the processor 402 coupled with the memory 401, is used to execute the computer program in the memory 401, so as to: obtain the news set corresponding to the event through the communication component 401; from the news set, determine the release interval and the release time duration to meet A plurality of news with set conditions; according to the publishing organization corresponding to the plurality of news, determine the respective popularity weights of the plurality of news; according to the respective popularity weights of the plurality of news, the release interval and the release time Persistence, calculating the popularity of news corresponding to the event.
  • the processor 402 determines from the news collection a plurality of news whose publishing interval and publishing time duration meet the set conditions, it is specifically configured to: sort the news in the news collection in chronological order The release time of the news is sorted to obtain the release time sequence; the first sliding window is used to slide on the release time sequence to obtain multiple time windows; the window length of the first sliding window is a set time span; from Among the multiple time windows, determine the time window in which the number of news meets the set quantity requirement as the target time window; from the target time stamp window, intercept the multiple news whose release interval meets the set interval requirement.
  • the processor 402 intercepts from the target time window the plurality of news whose release interval meets the set interval requirement, it is specifically configured to: adopt a second sliding window, and in the target time window slide to obtain a plurality of sub-windows; the length of the second sliding window is a set quantity length; calculate the average interval duration of news contained in each of the plurality of sub-windows; according to the average interval duration, from the plurality of sub-windows
  • the target sub-window is determined in the target sub-window, and the average interval time of the news contained in the target sub-window meets the set interval requirement.
  • the processor 402 calculates the news popularity corresponding to the event according to the respective popularity weights of the multiple news and the time interval between the multiple news, it is specifically used to: determine the target The time interval of each news in the window relative to the adjacent previous news; using the time interval as the index of the specified base, calculate the index item of each news; The index item is weighted to obtain a weighted score; the ratio of the weighted score to the length of the second sliding window is calculated as the news popularity corresponding to the event.
  • the processor 402 when acquiring the news set corresponding to the event, is specifically configured to: collect news data; calculate the first news text and the second news text in the news data; The text similarity between the second news texts; the news elements of the first news text and the news elements of the second news text are analyzed to obtain news element coincidence; if the text similarity and the coincidence degree of the elements satisfies the set condition, then the first news text and the second news text are classified into the news collection of the same event.
  • the processor 402 when calculating the text similarity between the first news and the second news, is specifically configured to: determine the first headline and the first text included in the first news text, And the second headline and the second text contained in the second news text; according to the respective texts corresponding to the first headline and the second headline, calculate the headline between the first headline and the second headline Similarity; according to the corresponding texts of the first text and the second text, calculate the text similarity between the first text and the second text; similarity between the title similarity and the text degrees to obtain the similarity between the first news text and the second news text.
  • the processor 402 is specifically configured to: Perform word segmentation processing on the first text and the second text to obtain a first set of text entries and a second set of text entries; determine the first set of text entries and the second set of text entries Intersection to obtain the same text entry; determine the first text entry set and the second text entry set, other entries except the same text entry, as different text entries; respectively Calculate the frequency of occurrence of the same text entry in the first collection of text entries and the second collection of text entries to obtain the first frequency of occurrence and the second frequency of occurrence; respectively calculate the different text entries According to the frequency of occurrence in the first text entry set and the second text entry set, the third frequency of occurrence and the fourth frequency of occurrence are obtained; according to the respective text lengths of the first text and the second text , calculating a similarity penalty item; calculating the text similarity according to the first frequency of occurrence, the second frequency of occurrence
  • the processor 402 analyzes the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of the news elements, it is specifically configured to: Extract time elements, location elements and subject elements from a news text and the second news text; calculate the coincidence degree of the time elements of the first news text and the second news text, the coincidence degree of the place elements and the subject element
  • the coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the subject element are the sum of the coincidence degree of the news element.
  • the embodiment of the present application also provides an electronic device, including: a memory and a processor; the memory is used to store one or more computer instructions; the processor is used to execute the one or more computer instructions for: executing Steps in the method provided in the embodiment of the present application.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program.
  • the computer program When the computer program is executed by a processor, the steps in the method provided in the embodiment of the present application can be implemented.
  • the method for calculating the popularity of news provided by the embodiment of the present application, after obtaining the news set of the event, select a plurality of news whose release interval and duration of release time meet the set requirements, and determine the popularity of the news according to the corresponding publishing organization of the news Weights.
  • the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.
  • the electronic device further includes: a display component 404 , a power supply component 405 , an audio component 406 and other components.
  • FIG. 4 only schematically shows some components, which does not mean that the electronic device only includes the components shown in FIG. 4 .
  • the communication component 403 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices.
  • the device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof.
  • the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component may be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies to fulfill.
  • NFC Near Field Communication
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • Bluetooth Bluetooth
  • the display component 404 includes a screen, and the screen may include a liquid crystal display component (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
  • the power supply component 405 provides power for various components of the device where the power supply component is located.
  • a power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.
  • the audio component 406 may be configured to output and/or input audio signals.
  • the audio component includes a microphone (MIC), which is configured to receive an external audio signal when the device on which the audio component is located is in an operation mode, such as a calling mode, a recording mode, and a speech recognition mode.
  • the received audio signal may be further stored in a memory or sent via a communication component.
  • the audio component further includes a speaker for outputting audio signals.
  • the event news set is obtained, a plurality of news whose release interval and release time duration meet the set requirements are selected, and the popularity weight of the news is determined according to the release organization corresponding to the news.
  • the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.
  • an embodiment of the present application further provides a computer-readable storage medium storing a computer program.
  • the computer program When the computer program is executed, the steps that can be executed by the electronic device in the above method embodiments can be implemented.
  • the various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in the electronic device according to the embodiments of the present invention.
  • DSP digital signal processor
  • the present invention can also be implemented as programs/instructions (eg, computer programs/instructions and computer program products) of devices or means for performing part or all of the methods described herein.
  • Such programs/instructions for implementing the present invention may be stored on a computer-readable medium, or may exist in the form of one or more signals, such signals may be downloaded from an Internet website, or provided on a carrier signal, or in any form Available in other formats.
  • Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM Electrically Er
  • FIG. 5 schematically shows a computer device/equipment/system that can implement the method for calculating news popularity according to the present invention.
  • the computer device/equipment/system includes a processor 510 and a computer-readable medium in the form of a memory 520 .
  • Memory 520 is one example of a computer readable medium having storage space 530 for storing computer programs/instructions 531 .
  • the computer program/instruction 531 is executed by the processor 510, various steps in the news popularity calculation method described above can be realized.
  • Fig. 6 schematically shows a block diagram of a computer program product implementing the method according to the invention.
  • the computer program product includes a computer program/instruction 610.
  • a processor such as the processor 510 shown in FIG. each step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present application provide a news popularity calculation method, a device, and a storage medium. In the news popularity calculation method, after a news set of an event is obtained, multiple pieces of news having a release interval and a release duration that satisfy set requirements are selected from the news set, and popularity weights of the news are determined according to release agencies corresponding to the news. The release interval, the release duration, and the release agencies of the news are comprehensively considered, multi-dimensional information of the news can be fully utilized, and then accurate news popularity is obtained by means of calculation.

Description

新闻热度计算方法、设备及存储介质News popularity calculation method, equipment and storage medium
交叉引用cross reference
本申请要求2021年06月25日递交的、申请号为“20210711197.X”、发明名称为“新闻热度计算方法、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted on June 25, 2021, with the application number "20210711197.X", and the title of the invention is "News popularity calculation method, equipment and storage medium", the entire content of which is incorporated by reference In this application.
技术领域technical field
本申请涉及互联网技术领域,尤其涉及一种新闻热度计算方法、设备及存储介质。The present application relates to the technical field of the Internet, and in particular to a news popularity calculation method, device and storage medium.
背景技术Background technique
在信息时代,各类信息呈现井喷式增长,新闻也不例外。对大量的新闻进行分析、筛选,得到热点新闻,并将热点新闻推荐给用户,可便于用户及时了解热点话题,提升新闻阅读效率。In the information age, all kinds of information have shown a blowout growth, and news is no exception. Analyze and filter a large number of news, get hot news, and recommend hot news to users, which can facilitate users to keep abreast of hot topics and improve news reading efficiency.
现有的新闻热度计算方法通常依赖于用户对新闻的点击量、评论量等等。这种方式较为依赖用户行为,无法得到准确的新闻热度分析结果。因此,有待提出一种新的解决方案。Existing methods for calculating news popularity usually rely on the number of clicks, comments, etc. on news by users. This method relies more on user behavior and cannot obtain accurate news popularity analysis results. Therefore, a new solution remains to be proposed.
发明内容Contents of the invention
本发明提出以下技术方案以克服或者至少部分地解决或者减缓上述问题:The present invention proposes the following technical solutions to overcome or at least partially solve or slow down the above-mentioned problems:
根据本发明的一个方面,提供了一种新闻热度计算方法,包括:获取事件对应的新闻集合;从所述新闻集合中,确定发布间隔以及发布时间持续性满足设定条件的多个新闻;根据所述多个新闻对应的发布机构,确定所述多个新闻各自的热度权重;根据所述多个新闻各自的热度权重、所述发布间隔以及所述发布时间持续性,计算所述事件对应的新闻热度。According to one aspect of the present invention, a method for calculating news popularity is provided, including: obtaining a news set corresponding to an event; from the news set, determining a plurality of news whose release interval and release time duration meet the set conditions; according to The publishing organization corresponding to the plurality of news determines the respective popularity weights of the plurality of news; calculates the corresponding weight of the event according to the respective popularity weights of the plurality of news, the release interval and the duration of the release time. News heat.
根据本发明的又一个方面,提供了一种计算机装置/设备/系统,包括存储器、处理器及存储在存储器上的计算机程序/指令,所述处理器执行所述计算机程序/指令时实现上述新闻热度计算方法的步骤。According to yet another aspect of the present invention, a computer device/equipment/system is provided, including a memory, a processor, and computer programs/instructions stored on the memory, and the above information is realized when the processor executes the computer program/instructions. Steps in the heat calculation method.
根据本发明的再一个方面,提供了一种计算机可读介质,其上存储有 计算机程序/指令,所述计算机程序/指令被处理器执行时实现上述新闻热度计算方法的步骤。According to another aspect of the present invention, a computer-readable medium is provided, on which computer programs/instructions are stored, and when the computer programs/instructions are executed by a processor, the steps of the above-mentioned method for calculating news popularity are realized.
根据本发明的再一个方面,提供了一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现上述新闻热度计算方法的步骤。According to still another aspect of the present invention, a computer program product is provided, including computer programs/instructions, and when the computer programs/instructions are executed by a processor, the steps of the above-mentioned method for calculating news popularity are realized.
本发明的有益效果为:获取到事件的新闻集合后,从中选择发布间隔以及发布时间持续性满足设定要求的多个新闻,并根据新闻对应的发布机构,确定新闻的热度权重。综合考虑新闻的发布间隔、发布时间持续性以及新闻的发布机构,可充分利用新闻的多维度信息,进而计算得到准确的新闻热度。The beneficial effect of the present invention is: after obtaining the news set of the event, select a plurality of news whose release interval and release time duration meet the set requirements, and determine the popularity weight of the news according to the release organization corresponding to the news. By comprehensively considering the news release interval, the duration of the release time, and the news release organization, the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,本发明的上述及各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。在附图中:These and various other advantages and benefits of the present invention will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. In the attached picture:
图1为本申请一示例性实施例提供的新闻热度计算方法的流程示意图;FIG. 1 is a schematic flowchart of a method for calculating news popularity provided by an exemplary embodiment of the present application;
图2为本申请另一示例性实施例提供的新闻热度计算方法的流程示意图;FIG. 2 is a schematic flowchart of a method for calculating news popularity provided by another exemplary embodiment of the present application;
图3为本申请一示例性实施例提供的相似新闻的识别方法的流程示意图;FIG. 3 is a schematic flowchart of a method for identifying similar news provided by an exemplary embodiment of the present application;
图4为本申请一示例性实施例提供的电子设备的结构示意图;FIG. 4 is a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application;
图5示意性地示出了用于实现根据本发明的方法的计算机装置/设备/系统的框图;以及Figure 5 schematically shows a block diagram of a computer device/equipment/system for implementing the method according to the present invention; and
图6示意性地示出了实现根据本发明的方法的计算机程序产品的框图。Fig. 6 schematically shows a block diagram of a computer program product implementing the method according to the invention.
具体实施方式detailed description
下面结合附图和具体的实施方式对本发明作进一步的描述。以下描述仅为说明本发明的基本原理而并非对其进行限制。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. The following description is only to illustrate the basic principle of the present invention and not to limit it.
对大量的新闻进行分析、筛选,得到热点新闻,并将热点新闻推荐给用户,可便于用户及时了解热点话题,提升新闻阅读效率。Analyze and filter a large number of news, get hot news, and recommend hot news to users, which can facilitate users to keep abreast of hot topics and improve news reading efficiency.
现有的新闻热度计算方法通常依赖于用户对新闻的点击量、评论量等等。这种方式较为依赖用户行为,无法得到准确的新闻热度分析结果。Existing methods for calculating news popularity usually rely on the number of clicks, comments, etc. on news by users. This method relies more on user behavior and cannot obtain accurate news popularity analysis results.
针对上述技术问题,在本申请一些实施例中,提供了一种解决方案,以下结合附图,详细说明本申请各实施例提供的技术方案。Aiming at the above technical problems, some embodiments of the present application provide a solution. The technical solutions provided by each embodiment of the present application will be described in detail below with reference to the accompanying drawings.
图1为本申请一示例性实施例提供的新闻热度计算方法的流程示意图, 如图1所示,该方法包括:Fig. 1 is a schematic flowchart of a method for calculating news popularity provided by an exemplary embodiment of the present application. As shown in Fig. 1, the method includes:
步骤101、获取事件对应的新闻集合。 Step 101 , acquiring a news set corresponding to an event.
步骤102、从所述新闻集合中,确定发布间隔以及发布时间持续性满足设定条件的多个新闻。 Step 102, from the news collection, determine a plurality of news whose release interval and release time duration meet the set conditions.
步骤103、根据所述多个新闻对应的发布机构,确定所述多个新闻各自的热度权重。 Step 103 , according to the publishing organizations corresponding to the multiple news, determine the popularity weight of each of the multiple news.
步骤104、根据所述多个新闻各自的热度权重、所述发布间隔以及所述发布时间持续性,计算所述事件对应的新闻热度。Step 104: Calculate the news popularity corresponding to the event according to the popularity weights of the plurality of news, the release interval, and the release time duration.
其中,事件,指的是新闻报道的对象。当社会中发生新的事件时,会存在多个新闻机构对该事件进行报道。当该事件的重要性较高或者话题持续性较高,则该报道该事件的新闻的数量也较多。对事件的新闻热度进行分析,有利于识别热点话题,并进行热点话题的推荐。Wherein, an event refers to an object of a news report. When a new event occurs in society, there will be multiple news organizations to report on the event. When the importance of the event is high or the topic persistence is high, the number of news reporting the event is also large. Analyzing the news popularity of events is helpful for identifying hot topics and recommending hot topics.
其中,事件对应的新闻集合,包含了报道该事件的多篇新闻。在分析新闻热度之前,可基于新闻分类聚合的方式,对大量的新闻数据进行分析,得到不同事件对应的新闻集合。Wherein, the news set corresponding to the event includes a plurality of news articles reporting the event. Before analyzing news popularity, a large amount of news data can be analyzed based on news classification and aggregation to obtain news sets corresponding to different events.
计算新闻热度时,若单独考虑新闻的发布时间或者数量,则无法得到准确的热度计算结果。例如,某一事件的新闻数量多,但新闻之间的时间间隔较大,则可认为该新闻的热门程度较低。同样的,若在极短时间间隔内出现了某个事件的多篇新闻,但后续未出现其他新闻对该事件进行持续性报道,则可认为该新闻的热门程度较低。另外,若某个事件的新闻多来自一些权威性较低的新闻机构,则可认为该新闻的热门程度也较低。When calculating the popularity of news, if the release time or quantity of the news is considered alone, it is impossible to get an accurate calculation result of the popularity. For example, if the number of news about a certain event is large, but the time interval between the news is relatively large, it can be considered that the news is less popular. Similarly, if there are multiple news articles about a certain event in a very short time interval, but there is no other news that continuously reports on the event, it can be considered that the popularity of the news is low. In addition, if most of the news about a certain event comes from some less authoritative news organizations, it can be considered that the news is also less popular.
为得到准确的热度计算结果,本实施例在计算新闻热度时,综合考虑新闻的发布间隔、发布时间持续性以及新闻的发布机构,以充分利用新闻的多维度信息。In order to obtain accurate heat calculation results, this embodiment comprehensively considers news release intervals, release time durations, and news release agencies when calculating news heat, so as to make full use of multi-dimensional information of news.
其中,新闻的发布间隔,指的是相邻两个新闻的发布时间差,用于表示新闻的发布频率;新闻的发布时间持续性,用于表示关于事件的新闻报道在时间维度上表现出的数量持续性。新闻的发布机构,指的是新闻的来源,例如门户网站、杂志、报纸等等。Among them, the release interval of news refers to the time difference between two adjacent news releases, which is used to indicate the frequency of news releases; the duration of news release time is used to indicate the number of news reports about events in the time dimension Persistent. News publishers refer to sources of news, such as portal websites, magazines, newspapers, and so on.
通常,新闻的发布机构越权威,则对新闻的热门程度的贡献也越高。基于此,在本实施例中,可根据新闻对应的发布机构,确定新闻的热度权重,进而考虑发布机构对新闻热度的影响,可提升新闻热度计算结果的准确性。Generally, the more authoritative the news release agency, the higher the contribution to the popularity of the news. Based on this, in this embodiment, the popularity weight of the news can be determined according to the publishing agency corresponding to the news, and then the influence of the publishing agency on the news popularity can be considered to improve the accuracy of the news popularity calculation result.
在一些可选的实施例中,在计算新闻热度之前,可对大量的新闻进行分类聚合,得到新闻与事件的对应关系。即,以事件为维度,筛选出事件对应 的新闻。以下将以任意事件为例,对筛选事件对应的新闻的可选实施方式进行示例性说明。In some optional embodiments, before calculating the popularity of news, a large number of news can be classified and aggregated to obtain the correspondence between news and events. That is, take the event as the dimension to filter out the news corresponding to the event. An optional implementation manner of filtering news corresponding to an event will be exemplarily described below by taking any event as an example.
可选地,可采集新闻数据,该新闻数据可包括来自多个不同新闻发布机构的新闻。在对新闻数据中的新闻进行分类聚合时,可判断任意两个新闻是否用于报道相同的事件。Optionally, news data may be collected, and the news data may include news from multiple different news release organizations. When classifying and aggregating news in news data, it can be judged whether any two news are used to report the same event.
如图2所示,以传入的新闻数据中的第一新闻文本和第二新闻文本为例,可计算第一新闻文本和第二新闻文本之间的文本相似度;除此之外,可对第一新闻文本的新闻要素以及第二新闻文本的新闻要素进行重合度分析,得到新闻要素重合度;若该文本相似度以及该要素重合度满足设定条件,则将第一新闻文本和第二新闻文本划分到同一事件的新闻集合中。As shown in Figure 2, take the first news text and the second news text in the incoming news data as an example, the text similarity between the first news text and the second news text can be calculated; Analyze the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of news elements; if the similarity of the text and the coincidence degree of the elements meet the set conditions, then the The two news texts are divided into the news collection of the same event.
可选地,计算第一新闻和第二新闻之间的文本相似度时,可确定第一新闻文本包含的第一标题和第一正文,以及第二新闻文本包含的第二标题和第二正文;根据第一标题和第二标题各自对应的文本,计算第一标题和第二标题之间的标题相似度;根据第一正文和第二正文各自对应的文本,计算第一正文和第二正文之间的正文相似度;对标题相似度和正文相似度进行融合,得到第一新闻文本和第二新闻文本的相似度。其中,计算标题相似度和正文相似度的可选实施方式,可参考后续实施例的记载,此处不赘述。Optionally, when calculating the text similarity between the first news and the second news, the first headline and the first text contained in the first news text, and the second title and the second text contained in the second news text can be determined ;Calculate the title similarity between the first title and the second title according to the corresponding texts of the first title and the second title; calculate the first text and the second text according to the corresponding texts of the first text and the second text The text similarity between them; the title similarity and the text similarity are fused to obtain the similarity between the first news text and the second news text. Wherein, for optional implementation manners of calculating title similarity and text similarity, reference may be made to the descriptions in subsequent embodiments, and details are not described here.
其中,新闻要素,是新闻的基本构成成分,常用的是新闻六要素,指的是:时间、地点、人物、事件的起因、经过、结果。在本实施例中,为分析新闻热度,可对新闻中的主要实体进行提取,该主要实体包括:时间、地点、人物(或组织机构等)等,如图2所示。Among them, the elements of news are the basic components of news. The six elements of news are commonly used, referring to: time, place, person, cause, process, and result of an event. In this embodiment, in order to analyze the popularity of the news, the main entities in the news can be extracted, the main entities include: time, place, person (or organization, etc.), etc., as shown in FIG. 2 .
可选地,对第一新闻文本的新闻要素以及第二新闻文本的新闻要素进行重合度分析时,可分别从第一新闻文本以及第二新闻文本中提取时间要素、地点要素以及主体要素;计算第一新闻文本与第二新闻文本的时间要素的重合度、地点要素的重合度以及主体要素的重合度;将时间要素的重合度、地点要素的重合度以及主体要素的重合度的总和,作为新闻要素的重合度。其中,主体要素,包括新闻所描述的人、物、组织机构等等。Optionally, when analyzing the coincidence of the news elements of the first news text and the news elements of the second news text, time elements, location elements and subject elements can be extracted from the first news text and the second news text respectively; The coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the main element of the first news text and the second news text; the sum of the coincidence degree of the time element, the coincidence degree of the place element and the coincidence degree of the main element is taken as The coincidence degree of news elements. Among them, the main elements include the people, objects, organizations and so on described in the news.
例如,计算新闻A以及新闻B时,新闻A的三要素为(时间1,地点1,人物1),新闻B的三要素为(时间1,地点1,人物2);即,新闻A的时间要素与新闻B的时间要素相同,且新闻A的主体要素与新闻B的主体要素相同,则可认为新闻要素的重合度为2/3。For example, when calculating news A and news B, the three elements of news A are (time 1, location 1, person 1), and the three elements of news B are (time 1, location 1, person 2); that is, the time of news A If the time element of the element is the same as that of news B, and the main element of news A is the same as that of news B, then the overlap degree of news elements can be considered as 2/3.
其中,该文本相似度以及该要素重合度满足设定条件,可包括:文本相似度大于设定的第一相似度阈值,要素重合度大于设定的第二相似度阈值。 其中,第一相似度阈值和第二相似度阈值可根据实际需求进行设置,例如,第一相似度阈值可以为80%或者90%,第二相似度阈值可以为2/3或者1,本实施例不做限制。Wherein, the text similarity and the element overlap meet the set conditions, which may include: the text similarity is greater than a set first similarity threshold, and the element overlap is greater than a set second similarity threshold. Wherein, the first similarity threshold and the second similarity threshold can be set according to actual needs. For example, the first similarity threshold can be 80% or 90%, and the second similarity threshold can be 2/3 or 1. In this implementation Examples are not limited.
当对新闻数据中的任意两个新闻完成上述计算后,可将新闻数据中的新闻划分到不同事件的新闻集合中。例如,可将新闻A、新闻B划分到事件1的新闻集合中,可将新闻C、新闻D、新闻E、新闻F划分到事件2的新闻集合中。After the above calculation is completed for any two news in the news data, the news in the news data can be divided into news sets of different events. For example, news A and news B can be divided into the news set of event 1, and news C, news D, news E, and news F can be divided into the news set of event 2.
在一些可选的实施例中,从新闻集合中,确定发布间隔以及发布时间持续性满足设定条件的多个新闻时,可从新闻集合中截取得到n条新闻,并结合n条新闻进行具体分析。In some optional embodiments, from the news collection, when determining multiple news whose release interval and release time duration meet the set conditions, n pieces of news can be intercepted from the news collection, and combined with the n pieces of news for specific analyze.
可选地,可首先按照时间先后顺序,对新闻集合中的新闻的发布时间进行排序,得到发布时间序列。Optionally, the release time of the news in the news collection may be sorted first according to chronological order to obtain a release time sequence.
接下来,确定一滑动窗口,该滑动窗口的长度为时间跨度。为便于描述和区分,将该滑动窗口描述为第一滑动窗口。例如,第一滑动窗口的长度为1小时、2小时、24小时等等。Next, a sliding window is determined, and the length of the sliding window is the time span. For ease of description and distinction, this sliding window is described as the first sliding window. For example, the length of the first sliding window is 1 hour, 2 hours, 24 hours and so on.
接下来,采用第一滑动窗口,在该发布时间序列上滑动,得到多个时间窗口。其中,多个时间窗口具有相同的时间跨度,每个时间窗口内包含一个或者多个新闻。从滑动得到的多个时间窗口中,可确定新闻数量满足设定数量要求的时间窗口,作为目标时间窗口。可选地,该设定数量要求,可以为:数量最多,或者数量大于某一数量阈值,本实施例不做限制。例如,在一些实施例中,从滑动得到的多个时间窗口中,确定新闻数量最多的时间窗口,作为目标时间窗口。Next, the first sliding window is used to slide on the release time series to obtain multiple time windows. Wherein, multiple time windows have the same time span, and each time window contains one or more news. From the plurality of time windows obtained by sliding, the time window in which the quantity of news meets the set quantity requirement can be determined as the target time window. Optionally, the set quantity requirement may be: the maximum quantity, or a quantity greater than a certain quantity threshold, which is not limited in this embodiment. For example, in some embodiments, from the multiple time windows obtained by sliding, the time window with the largest number of news is determined as the target time window.
在一些实施例中,当滑动得到的时间窗口中,存在多个时间窗口具有相同的新闻数量时,可计算每个窗口中的新闻的平均时间间隔,并取平均时间间隔较小的时间窗口作为目标时间窗口,不再赘述。In some embodiments, when there are multiple time windows with the same number of news in the time window obtained by sliding, the average time interval of the news in each window can be calculated, and the time window with a smaller average time interval is taken as The target time window will not be repeated here.
确定目标时间窗口后,可从目标时间标窗口中,截取发布间隔满足设定间隔要求的多个新闻。可选地,该设定间隔要求,可以为:平均时间间隔最小,或者平均时间间隔小于某一时间阈值,本实施例不做限制。After the target time window is determined, multiple news whose release interval meets the set interval requirement can be intercepted from the target time stamp window. Optionally, the set interval requirement may be: the average time interval is the smallest, or the average time interval is smaller than a certain time threshold, which is not limited in this embodiment.
可选地,可确定一滑动窗口,该滑动窗口的长度为设定的数量长度。为便于区分,该滑动窗口可称为第二滑动窗口。Optionally, a sliding window may be determined, and the length of the sliding window is a set number of lengths. For ease of distinction, the sliding window may be referred to as the second sliding window.
接下来,可采用第二滑动窗口,在目标时间窗口中滑动,得到多个子窗口。其中,每个子窗口具有相同的新闻数量,但可能具有不同的时间间隔。 例如,假设第二滑动窗口的窗口长度为m,目标时间窗口截取到n条新闻时,可利用第二滑动窗口在n条新闻上滑动,每次滑动时,可从n条新闻中选取m条新闻,m条新闻具有不同的发布时间。Next, the second sliding window may be used to slide in the target time window to obtain multiple sub-windows. Among them, each sub-window has the same number of news, but may have different time intervals. For example, assuming that the window length of the second sliding window is m, when the target time window intercepts n news items, the second sliding window can be used to slide on n news items, and m items can be selected from n items of news each time when sliding News, m pieces of news have different release times.
接下来,可计算该多个子窗口各自包含的新闻的平均间隔时长,并根据该平均间隔时长,从该多个子窗口中确定目标子窗口。其中,该目标子窗口包含的新闻的平均间隔时间满足该设定间隔要求。例如,该目标子窗口中的新闻的平均间隔时间最小。Next, the average interval length of the news contained in the plurality of sub-windows may be calculated, and the target sub-window is determined from the plurality of sub-windows according to the average interval length. Wherein, the average interval time of news contained in the target sub-window satisfies the set interval requirement. For example, the average interval time of news in the target sub-window is the smallest.
基于上述实施例,在计算事件对应的新闻热度时,可确定目标子窗口中的每个新闻相对于相邻的前一新闻的时间间隔。Based on the above embodiments, when calculating the popularity of news corresponding to an event, the time interval between each news in the target sub-window and the adjacent previous news may be determined.
接下来,将该时间间隔作为指定底数的指数,计算每个新闻的指数项,并根据每个新闻各自的热度权重对每个新闻的指数项进行加权计算,得到加权分数。获取加权分数后,可计算加权分数与第二滑动窗口的长度的比值,作为事件对应的新闻热度。其中,该指定底数可以任意的常数,例如2、3、4等等,本实施例不做限制。Next, use the time interval as the index of the specified base, calculate the index items of each news, and perform weighted calculations on the index items of each news according to the respective popularity weights of each news to obtain a weighted score. After the weighted score is obtained, the ratio of the weighted score to the length of the second sliding window may be calculated as the news popularity corresponding to the event. Wherein, the designated base number may be any constant, such as 2, 3, 4, etc., which is not limited in this embodiment.
在一些实施例中,该指定底数可以取e(约2.7182818284)。上述新闻热度H的计算过程可参考如下公式所示:In some embodiments, the specified base may be e (approximately 2.7182818284). The calculation process of the above-mentioned news popularity H can refer to the following formula:
Figure PCTCN2021132517-appb-000001
Figure PCTCN2021132517-appb-000001
在公式1中,n表示目标时间窗口中的新闻的数量,m表示目标子窗口中的新闻集合,|m|表示第二滑动窗口的长度。i表示目标子窗口中的第i个新闻,α表示第i个新闻的热度权重,通常,主流新闻网站发布的新闻具有较高的权重。Interver i表示第i个新闻和前一个新闻的时间间隔,Interver i=T i-T i-1,其中,T i表示第i个新闻的发布时间,i=2,3,…,m,即集合m中的第一条新闻不参与计算。 In Formula 1, n represents the number of news in the target time window, m represents the news set in the target sub-window, and |m| represents the length of the second sliding window. i represents the i-th news in the target sub-window, and α represents the popularity weight of the i-th news. Usually, news published by mainstream news websites has a higher weight. Interver i represents the time interval between the i-th news and the previous news, Interver i =T i -T i-1 , where T i represents the release time of the i-th news, i=2, 3,..., m, namely The first news in the set m does not participate in the calculation.
以下将结合一个具体的例子对上述计算新闻热度的方式进行进一步示例性说明。The above method of calculating news popularity will be further exemplified below with a specific example.
假设,第二滑动窗口滑动得到的目标时间窗口中,包含5条新闻,a1,a2,a3,a4,a5,其发布时间分别为:10:00,10:06,10:07,10:09,10:30,其来源机构的权重分别为:1,2,3,4,5。假设,第二滑动窗口的长度为3,即|m|=3。将第二滑动窗口在目标时间窗口中滑动,计算每个滑动窗口内的新闻的时间间隔,并选择平均时间间隔最小的滑动窗口,可得m={a2,a3,a4},其中,a2,a3,a4这三条新闻平均时间间隔最小。则新闻热度为:Assume that the target time window obtained by sliding the second sliding window contains 5 news items, a1, a2, a3, a4, a5, and their release times are: 10:00, 10:06, 10:07, 10:09 , 10:30, the weights of the source institutions are: 1, 2, 3, 4, 5 respectively. Assume that the length of the second sliding window is 3, ie |m|=3. Slide the second sliding window in the target time window, calculate the time interval of the news in each sliding window, and select the sliding window with the smallest average time interval, m={a2, a3, a4}, where, a2, The average time interval of the three news items a3 and a4 is the smallest. Then the news popularity is:
Figure PCTCN2021132517-appb-000002
Figure PCTCN2021132517-appb-000002
基于上述新闻热度计算方法,可综合考虑新闻的发布间隔、发布时间持 续性以及新闻的发布机构,可充分利用新闻的多维度信息,进而计算得到准确的新闻热度。计算得到新闻热度后,可根据新闻热度对对新闻事件进行热度排序,或者,可向用户推荐新闻热度较高的新闻事件,本实施例不做限制。Based on the above calculation method of news popularity, the news release interval, the duration of the release time and the news publishing organization can be considered comprehensively, and the multi-dimensional information of the news can be fully utilized to calculate the accurate news popularity. After the news popularity is calculated, the news events may be sorted according to the news popularity, or news events with high news popularity may be recommended to the user, which is not limited in this embodiment.
前述实施例记载了根据正文相似度以及标题相似度来计算第一新闻文本和第二新闻文本的相似度的实施方式,以下将对这一实施方式进行进一步详细介绍。The aforementioned embodiments describe the implementation of calculating the similarity between the first news text and the second news text according to the similarity of the text and the similarity of the title, and this implementation will be further described in detail below.
新闻文本,指的是对事件进行报道或者评论的文本,新闻文本通常发布在杂志、报纸以及各网站上。当存在海量的新闻文本时,可对海量的新闻文本进行相似度识别,并可对相似的新闻文本进行归类或者去重等等。其中,对海量的新闻文本进行相似度识别时,可计算任意两个新闻文本的相似度。News texts refer to texts that report or comment on events, and news texts are usually published in magazines, newspapers, and various websites. When there are a large number of news texts, similarity recognition can be performed on the massive news texts, and similar news texts can be classified or deduplicated, etc. Among them, when performing similarity identification on a large amount of news texts, the similarity between any two news texts can be calculated.
在本申请的各实施例中,为便于描述和区分,将待进行相似度识别的任意两个新闻文本,描述为第一新闻文本和第二新闻文本。In each embodiment of the present application, for the convenience of description and distinction, any two news texts to be identified by similarity are described as a first news text and a second news text.
新闻文本具有一定的数据特性,通常,新闻文本包括至少两个部分,即标题部分和正文部分。新闻的标题是对正文的概括性总结或者评价,因此,无论是简讯还是长篇新闻,报道相同内容时,二者的标题相似度通常较高。在本实施例中,为降低文本长度差异对相似度的影响,将新闻之间的相似度拆分为两个部分,即标题之间的相似度以及文本之间的相似度。News texts have certain data characteristics. Generally, news texts include at least two parts, ie, a title part and a body part. The title of the news is a general summary or evaluation of the text. Therefore, whether it is a newsletter or a long-form news, when the same content is reported, the similarity between the two titles is usually high. In this embodiment, in order to reduce the impact of text length differences on the similarity, the similarity between news is divided into two parts, that is, the similarity between titles and the similarity between texts.
本实施例中,为便于描述和区分,将第一新闻文本的标题和正文描述为第一标题和第一文本,将第二新闻文本的标题描述为第二标题和第二文本。In this embodiment, for ease of description and distinction, the title and text of the first news text are described as the first title and the first text, and the title of the second news text is described as the second title and the second text.
基于第一标题和第二标题各自对应的文本,可计算第一标题和第二标题之间相似度,基于第一正文和第二正文各自对应的文本,可计算第一正文和第二正文之间相似度。基于文本计算相似度时,可计算文本的字面相似度,此部分将在后续的实施例中进行详细介绍,此处不赘述。为便于描述和区分,将标题之间的相似度描述为标题相似度,将正文之间的相似度描述为正文相似度。Based on the corresponding texts of the first title and the second title, the similarity between the first title and the second title can be calculated, and based on the corresponding texts of the first text and the second text, the similarity between the first text and the second text can be calculated. similarity between. When calculating the similarity based on the text, the literal similarity of the text may be calculated, which will be described in detail in subsequent embodiments and will not be repeated here. For the convenience of description and distinction, the similarity between titles is described as title similarity, and the similarity between texts is described as text similarity.
在得到第一文本和正文相似度之后,将标题相似度和正文相似度进行融合处理,得到第一新闻文本和第二新闻文本的相似度。其中,将标题相似度和正文相似度进行融合处理时,可采用算术计算的方式将标题相似度和文本相似度进行融合。例如,可计算标题相似度和正文相似度的平均值,作为两个新闻文本之间的相似度;例如,可计算标题相似度和正文相似度的乘积,作为两个新闻文本之间的相似度;又例如,可对标题相似度和正文相似度进行求和,作为两个新闻文本之间的相似度。After obtaining the similarity between the first text and the text, the similarity between the title and the text is fused to obtain the similarity between the first news text and the second news text. Wherein, when the title similarity and the text similarity are fused, the title similarity and the text similarity may be fused in an arithmetic calculation manner. For example, the average of headline similarity and body similarity can be calculated as the similarity between two news texts; for example, the product of headline similarity and body similarity can be calculated as the similarity between two news texts ; For another example, the sum of the title similarity and the text similarity can be used as the similarity between two news texts.
在一些示例性的实施例中,考虑到标题和正文对新闻内容的贡献程度, 可为标题和正文分别设预设权重系数,并按照预设的权重系数,对标题相似度和正文相似度进行加权求和,得到第一新闻文本和第二新闻文本的相似度。假设,标题的权重系数为w1,正文的权重系数为w2,标题相似度为S1、正文相似度为S2,则第一新闻文本和第二新闻文本的相似度S=w1*S1+w2*S2,其中,w1与w2的取值可以为经验值,本实施例不做限制。In some exemplary embodiments, considering the contribution of titles and texts to news content, preset weight coefficients can be set for titles and texts respectively, and the title similarity and text similarity can be calculated according to the preset weight coefficients. The weighted summation is used to obtain the similarity between the first news text and the second news text. Assuming that the weight coefficient of the title is w1, the weight coefficient of the text is w2, the similarity of the title is S1, and the similarity of the text is S2, then the similarity between the first news text and the second news text is S=w1*S1+w2*S2 , where the values of w1 and w2 may be empirical values, which are not limited in this embodiment.
本实施例中,在计算新闻的相似度时,将新闻中的标题与正文进行分开处理,根据标题对应的文本,计算标题的相似度,并根据正文对应的文本,计算文本的相似度,可在一定程度上降低文本长度差异对相似度的影响,有利于计算得到更加准确的相似度。同时,对标题的相似度和正文的相似度进行融合得到新闻的相似度,可快速得到新闻文本的相似度计算结果,降低识别相似新闻所需的时间成本以及计算成本,提升相似新闻的识别效率。In this embodiment, when calculating the similarity of the news, the title and the text in the news are processed separately, and the similarity of the title is calculated according to the text corresponding to the title, and the similarity of the text is calculated according to the text corresponding to the text, which can be To a certain extent, it reduces the impact of text length differences on similarity, which is conducive to calculating more accurate similarity. At the same time, the similarity of the news is obtained by fusing the similarity of the title and the similarity of the text, which can quickly obtain the similarity calculation result of the news text, reduce the time cost and calculation cost required to identify similar news, and improve the identification efficiency of similar news .
在上述实施例中,记载了将新闻的标题和正文进行分开处理的实施方式,以下将分别对计算标题的相似度以及正文的相似度的可选实施方式进行进一步说明。In the above-mentioned embodiment, the implementation manner of separately processing the title and text of the news is described, and the optional implementation manner of calculating the similarity of the title and the similarity of the text will be further described below.
可选地,如图3所示,将第一新闻和第二新闻作为输入数据后,可首先检测输入的文本是否为标题,若为标题,则进入标题处理分支,即执行实施例一;若输入的文本不为标题,则进入正文处理分支,即执行实施例二。Optionally, as shown in Figure 3, after the first news and the second news are used as input data, it is first possible to detect whether the input text is a title, and if it is a title, enter the title processing branch, that is, execute the first embodiment; if If the input text is not a title, enter the text processing branch, that is, execute the second embodiment.
实施例一:根据第一标题和第二标题各自对应的文本,计算第一标题和第二标题之间的标题相似度。Embodiment 1: Calculate the title similarity between the first title and the second title according to the corresponding texts of the first title and the second title.
可选地,可对第一标题以及第二标题进行关键词提取操作,得到第一标题包含的关键词的集合以及第二标题包含的关键词的集合。其中,第一标题包含的关键词的集合,可以描述为第一标题词条集合;第二标题包含的关键词的集合,可以描述为第二标题词条集合。Optionally, a keyword extraction operation may be performed on the first title and the second title to obtain a set of keywords contained in the first title and a set of keywords contained in the second title. Wherein, the set of keywords contained in the first title can be described as a set of first title entries; the set of keywords contained in the second title can be described as a set of second title entries.
其中,关键词提取操作,可包括:提取实体对应的词条、词性为名词的词条和/或词性为动词的词条的操作。即,提取第一标题中的与实体对应的词条、词性为名词的词条和/或词性为动词的词条,得到第一标题词条集合;提取第二标题中的与实体对应的词条、词性为名词的词条和/或词性为动词的词条,得到第二标题词条集合。Wherein, the keyword extraction operation may include: extracting entries corresponding to entities, entries whose part of speech is a noun, and/or entries whose part of speech is a verb. That is, extract the entry corresponding to the entity, the entry whose part of speech is a noun, and/or the entry whose part of speech is a verb in the first title to obtain a set of entries in the first title; extract the words corresponding to the entity in the second title Items, entries whose part of speech is a noun, and/or entries whose part of speech is a verb, obtain the second title entry set.
其中,实体(Entity)是指文本语料中出现的自然界真实存在的事物。实体是具体的事物,可以是一个事物也可以是多个事物的集合”,例如人名、地点、组织结构等实体。Among them, entity (Entity) refers to the things that actually exist in nature that appear in the text corpus. An entity is a specific thing, which can be one thing or a collection of multiple things", such as names, places, organizational structures and other entities.
接下来,可计算第一标题词条集合以及第二标题词条集合中的相同标题 词条的数量;其中,相同标题词条,指的是既位于第一标题词条集合,也位于第二标题词条集合的词条。当标题中的一个相同标题词条重复出现多次时,只标记该相同标题词条的数量为1,而不考虑其重复出现的频次。Next, the number of the same title entries in the first set of title entries and the second set of title entries can be calculated; wherein, the same title entry refers to both the first set of title entries and the second set of title entries. An entry for the title entry collection. When a same title entry in the title appears multiple times, only the number of the same title entry is marked as 1, regardless of the frequency of its repeated appearance.
接下来,可根据该相同标题词条的数量与第一标题词条集合和第二标题词条集合包含的词条总数量的比值,确定标题相似度。上述计算过程可参考如下公式的记载:Next, the title similarity can be determined according to the ratio of the number of the same title terms to the total number of terms included in the first set of title terms and the second set of title terms. The above calculation process can refer to the records of the following formula:
Figure PCTCN2021132517-appb-000003
Figure PCTCN2021132517-appb-000003
其中,A表示第一标题词条集合,|A|表示集合A的模长,即集合A中的元素的个数;B表示第二标题词条集合,|B|表示集合B的模长,即集合B中的元素的个数。i表示集合A中的第i个词条。基于上述公式可知,集合A中的第i个词条也属于集合B时,f(i,B)=1,即第i个词条为A集合与B集合的相同标题词。集合A中的第i个词条不属于集合B时,f(i,B)=0,即第i个词条为A集合与B集合的不同标题词。其中,分子上的系数2,用于确保相似度计算结果S2的最大值为1。基于公式2,可计算得到两个新闻的标题相似度,即标题相似度。Among them, A represents the first headline entry set, |A| represents the module length of the set A, that is, the number of elements in the set A; B represents the second headline entry set, |B| represents the module length of the set B, That is, the number of elements in the set B. i represents the i-th entry in the set A. Based on the above formula, it can be seen that when the i-th entry in set A also belongs to set B, f(i, B)=1, that is, the i-th entry is the same title word of set A and set B. When the i-th entry in the set A does not belong to the set B, f(i, B)=0, that is, the i-th entry is a different title word of the A set and the B set. Wherein, the coefficient 2 on the numerator is used to ensure that the maximum value of the similarity calculation result S2 is 1. Based on formula 2, the title similarity of two news items can be calculated, that is, the title similarity.
实施例二:根据第一正文和第二正文各自对应的文本,计算第一正文和第二正文之间的正文相似度。Embodiment 2: Calculate the text similarity between the first text and the second text according to the corresponding texts of the first text and the second text.
可选地,可对第一正文以及第二正文进行分词处理,得到第一正文和第二正文各自对应的词条集合。其中,第一正文对应的词条集合,可描述为第一正文词条集合,第二正文对应的词条,可描述为第二正文词条集合。Optionally, word segmentation processing may be performed on the first text and the second text to obtain a set of lexical entries corresponding to the first text and the second text. Wherein, the entry set corresponding to the first text may be described as a first text entry set, and the entries corresponding to the second text may be described as a second text entry set.
其中,分词处理,指的是对句子、段落进行切分,得到句子包含的词条、单字等等。在一些实施例中,为节省数据空间并提升后续的处理效率,可对分词处理得到的结果进行停用词去除操作,如图3所示。其中,停用词是指不具有实际意义的功能词,例如“的”、“在”、“是”等等。Among them, word segmentation processing refers to segmenting sentences and paragraphs to obtain entries, words, etc. contained in sentences. In some embodiments, in order to save data space and improve subsequent processing efficiency, a stop word removal operation may be performed on the result obtained from word segmentation processing, as shown in FIG. 3 . Among them, stop words refer to function words without practical meaning, such as "的", "在", "是" and so on.
得到第一正文词条集合以及第二正文词条集合之后,可获取第一正文词条集合与第二正文词条集合中的相同文正词条以及相异正文词条。其中,可确定第一正文词条集合以及第二正文词条集合的交集,得到相同正文词条;获取相同正文词条后,可将第一正文词条集合以及所述第二正文词条集合中,除相同正文词条之外的其他词条,作为相异正文词条。After obtaining the first set of text entries and the second set of text entries, the same text entries and different text entries in the first set of text entries and the second set of text entries can be obtained. Wherein, the intersection of the first text entry set and the second text entry set can be determined to obtain the same text entry; after obtaining the same text entry, the first text entry set and the second text entry set can be In , entries other than the same text entry are regarded as different text entries.
针对相同正文词条而言,可计算相同正文词条在第一正文词条集合中的出现频次,得到第一出现频次,并计算相同正文词条在第二正文词条集合中的出现频次,得到第二出现频次。当相同正文词条包含多个词条时,可累加 该多个词条在第一正文词条集合中的出现频次,得到第一出现频次,并可累加该多个词条在第二正文词条集合中的出现频次,得到第二出现频次。For the same text entry, the frequency of occurrence of the same text entry in the first collection of text entries can be calculated to obtain the first frequency of occurrence, and the frequency of occurrence of the same text entry in the second collection of text entries can be calculated, Get the second frequency of occurrence. When the same text entry contains multiple entries, the frequency of occurrence of these multiple entries in the first text entry set can be added to obtain the first frequency of occurrence, and the multiple entries can be accumulated in the second text word. The frequency of occurrence in the bar set, get the second frequency of occurrence.
针对相异正文词条而言,可计算相异正文词条在第一正文词条集合中的出现频次,得到第三出现频次,并可计算相异正文词条在第二正文词条集合中的出现频次,得到第四出现频次。当相异正文词条包含多个词条时,可累加该多个词条在第一正文词条集合中的出现频次,得到第三出现频次,并可累加该多个词条在第二正文词条集合中的出现频次,得到第四出现频次。For the different text entries, the frequency of occurrence of the different text entries in the first text entry set can be calculated to obtain the third frequency of occurrence, and the different text entries in the second text entry set can be calculated The frequency of occurrence of , get the fourth frequency of occurrence. When different text entries include multiple entries, the frequency of occurrence of these multiple entries in the first text entry collection can be added to obtain the third frequency of occurrence, and these multiple entries can be accumulated in the second text. The frequency of occurrence in the entry set to obtain the fourth frequency of occurrence.
通常,若两篇新闻报道相同的事情,那么大概率两篇新闻文本的内容具有较高的相似性。若两篇相似新闻的长度不同,则导致计算出来的相似度较小,不符合实际情况。Usually, if two news articles report the same thing, then there is a high probability that the contents of the two news texts have a high similarity. If the lengths of two similar news articles are different, the calculated similarity will be small, which is not in line with the actual situation.
为降低文本长度对相似度的影响,在一些示例性的实施例中,可在计算正文相似度的过程中,进一步增加与文本长度关联的相似度惩罚项。其中,相似度惩罚项可根据第一正文和第二正文各自的文本长度进行计算。In order to reduce the influence of text length on similarity, in some exemplary embodiments, a similarity penalty item associated with text length may be further increased during the process of calculating text similarity. Wherein, the similarity penalty item may be calculated according to the respective text lengths of the first text and the second text.
在一些可选的实施例中,计算相似度惩罚项时,可计算第一正文和第二正文之间的文本长度差的绝对值;若该文本长度差的绝对值大于或者等于设定阈值,则可将该文本长度差的绝对值与设定系数α的乘积作为相似度惩罚项。若该文本长度差小于该设定阈值,则可设置较小的固定值作为相似度惩罚项,该固定值可以为0。上述相似度惩罚项的计算过程可以参考以下公式所示:In some optional embodiments, when calculating the similarity penalty item, the absolute value of the text length difference between the first text and the second text can be calculated; if the absolute value of the text length difference is greater than or equal to the set threshold, Then the product of the absolute value of the text length difference and the set coefficient α can be used as the similarity penalty item. If the text length difference is smaller than the set threshold, a smaller fixed value may be set as a similarity penalty item, and the fixed value may be 0. The calculation process of the above similarity penalty item can be shown in the following formula:
Figure PCTCN2021132517-appb-000004
Figure PCTCN2021132517-appb-000004
公式3中,La表示第一正文的文本长度,Lb表示第二正文的文本长度,γ为预设阈值。其中,La可以采用第一正文词条集合包含的元素数量来表示,Lb可采用第二正文词条包含的元素的数量来表示。其中,α表示惩罚项的系数,α和γ的值系数为经验值;其中,α的值与文本长度差的绝对值成正相关关系,文本长度的差距越大,则α的取值越大,从而可提升文本长度对相似度计算结果的影响。In Formula 3, La represents the text length of the first text, Lb represents the text length of the second text, and γ is a preset threshold. Wherein, La may be represented by the number of elements contained in the first text entry set, and Lb may be represented by the number of elements contained in the second text entry. Among them, α represents the coefficient of the penalty item, and the value coefficients of α and γ are empirical values; among them, the value of α is positively correlated with the absolute value of the difference in text length, and the larger the difference in text length, the larger the value of α. In this way, the influence of the text length on the similarity calculation result can be improved.
其中,当数据库中海量的新闻文本中,文本长度最长的新闻文本与文本长度最短的新闻文本之间的文本长度差为上百字时,γ的值可取百为单位;当文本长度最长的新闻文本与文本长度最短的新闻文本之间的文本长度差为上千字时,γ的值可取千为单位。例如,数据库中最短的新闻只有200字,最长的新闻有2000字,那么γ可取千为单位。Among the massive news texts in the database, when the text length difference between the news text with the longest text length and the news text with the shortest text length is hundreds of words, the value of γ can take hundreds as the unit; when the text length is the longest When the text length difference between the news text with the shortest text length and the news text with the shortest text length is thousands of words, the value of γ can be in units of thousands. For example, if the shortest news in the database is only 200 characters, and the longest news is 2000 characters, then γ can take thousands as the unit.
其中,α可根据实际的文本长度差确定,若文本长度差较大,则可为α取较大的值。若文本长度差较小,则可为α取较小的值,以尽可能降低文本长 度差异对相似度计算造成的影响。例如,α的值可取0.01、0.05、0.1等,不再赘述。Wherein, α can be determined according to the actual text length difference, and if the text length difference is large, a larger value can be selected for α. If the difference in text length is small, you can choose a smaller value for α to minimize the impact of text length difference on similarity calculation. For example, the value of α may be 0.01, 0.05, 0.1, etc., which will not be repeated here.
接下来,可根据第一出现频次、第二出现频次、第三出现频次、第四出现频次以及相似度惩罚项,计算正文相似度。Next, the text similarity can be calculated according to the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence, the fourth frequency of occurrence and the similarity penalty item.
在一些示例性的实施例中,可计算第一出现频次以及第二出现频次中的较小频次;对第一出现频次、第二出现频次、第三出现频次以及第四出现频次进行求和,得到总频次。In some exemplary embodiments, the smaller frequency of the first frequency of occurrence and the second frequency of occurrence may be calculated; the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence and the fourth frequency of occurrence are summed, Get the total frequency.
其中,相似度惩罚项可添加在总频次上,即:在总频次上增加该相似度惩罚项,以更新该总频次。根据该较小频次与更新后的总频次的比值,可得到正文相似度。Wherein, the similarity penalty item can be added to the total frequency, that is, the similarity penalty item is added to the total frequency to update the total frequency. According to the ratio of the smaller frequency to the updated total frequency, the text similarity can be obtained.
上述计算过程可参考以下公式的记载:The above calculation process can refer to the records of the following formula:
Figure PCTCN2021132517-appb-000005
Figure PCTCN2021132517-appb-000005
公式4中,N表示相同正文词条的集合,i表示第i个相同正文词条;M表示相异正文词条的集合,j表示第j个相异正文词条;a表示第一正文词条集合,b表示第二正文词条集合。min()表示取最小值的函数,count()表示统计词条频次的函数。F表示相似度惩罚项。其中,分子上的系数2,用于确保相似度计算结果S2的最大值为1,min()用于降低长文本中频繁出现的某些词条对相似度的影响。In formula 4, N represents the set of the same text entry, i represents the i-th same text entry; M represents the set of different text entries, j represents the jth different text entry; a represents the first text word item set, and b represents the second text entry set. min() represents the function of taking the minimum value, and count() represents the function of counting the frequency of entries. F represents the similarity penalty item. Among them, the coefficient 2 on the numerator is used to ensure that the maximum value of the similarity calculation result S2 is 1, and min() is used to reduce the influence of certain terms that appear frequently in long texts on the similarity.
基于上述各实施方式,在计算新闻的相似度时,将新闻中的标题与正文进行分开处理,可在一定程度上降低文本长度差异对相似度的影响,有利于计算得到更加准确的相似度。除此之外,在进行相似度计算时,进一步添加与文本长度相关的惩罚项,当待识别的两篇新闻的长度差异较大,可进一步降低文本长度对相似度计算的影响,提升字面相似度的计算准确性。Based on the above implementations, when calculating the similarity of news, the headlines and texts in the news are processed separately, which can reduce the influence of text length differences on the similarity to a certain extent, and is conducive to calculating more accurate similarity. In addition, when calculating the similarity, a penalty item related to the length of the text is further added. When the lengths of the two news articles to be recognized differ greatly, the influence of the length of the text on the similarity calculation can be further reduced and the literal similarity can be improved. degree of calculation accuracy.
需要说明的是,上述实施例所提供方法的各步骤的执行主体均可以是同一设备,或者,该方法也由不同设备作为执行主体。比如,步骤201至步骤204的执行主体可以为设备A;又比如,步骤201和202的执行主体可以为设备A,步骤203的执行主体可以为设备B;等等。It should be noted that the subject of execution of each step of the method provided in the foregoing embodiments may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 201 to 204 may be device A; for another example, the execution subject of steps 201 and 202 may be device A, and the execution subject of step 203 may be device B; and so on.
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如201、202等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文 中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。In addition, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be executed in the order in which they appear herein or executed in parallel , the serial numbers of the operations, such as 201, 202, etc., are only used to distinguish different operations, and the serial numbers themselves do not represent any execution order. Additionally, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel. It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc. are different types.
图4是本申请一示例性实施例提供的电子设备的结构示意图,该电子设备适用于执行前述实施例提供的新闻热度计算方法。如图4所示,该电子设备包括:存储器401、处理器402以及通信组件403。Fig. 4 is a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application, and the electronic device is suitable for executing the method for calculating news popularity provided by the foregoing embodiments. As shown in FIG. 4 , the electronic device includes: a memory 401 , a processor 402 and a communication component 403 .
存储器401,用于存储计算机程序,并可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。The memory 401 is used to store computer programs, and can be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.
其中,存储器401可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。Wherein, the memory 401 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable In addition to programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
处理器402,与存储器401耦合,用于执行存储器401中的计算机程序,以用于:通过通信组件401获取事件对应的新闻集合;从所述新闻集合中,确定发布间隔以及发布时间持续性满足设定条件的多个新闻;根据所述多个新闻对应的发布机构,确定所述多个新闻各自的热度权重;根据所述多个新闻各自的热度权重、所述发布间隔以及所述发布时间持续性,计算所述事件对应的新闻热度。The processor 402, coupled with the memory 401, is used to execute the computer program in the memory 401, so as to: obtain the news set corresponding to the event through the communication component 401; from the news set, determine the release interval and the release time duration to meet A plurality of news with set conditions; according to the publishing organization corresponding to the plurality of news, determine the respective popularity weights of the plurality of news; according to the respective popularity weights of the plurality of news, the release interval and the release time Persistence, calculating the popularity of news corresponding to the event.
进一步可选地,处理器402在从所述新闻集合中,确定发布间隔以及发布时间持续性满足设定条件的多个新闻时,具体用于:按照时间先后顺序,对所述新闻集合中的新闻的发布时间进行排序,得到发布时间序列;采用第一滑动窗口,在所述发布时间序列上滑动,得到多个时间窗口;所述第一滑动窗口的窗口长度为设定的时间跨度;从所述多个时间窗口中,确定新闻数量满足设定数量要求的时间窗口,作为目标时间窗口;从所述目标时间标窗口中,截取发布间隔满足设定间隔要求的所述多个新闻。Further optionally, when the processor 402 determines from the news collection a plurality of news whose publishing interval and publishing time duration meet the set conditions, it is specifically configured to: sort the news in the news collection in chronological order The release time of the news is sorted to obtain the release time sequence; the first sliding window is used to slide on the release time sequence to obtain multiple time windows; the window length of the first sliding window is a set time span; from Among the multiple time windows, determine the time window in which the number of news meets the set quantity requirement as the target time window; from the target time stamp window, intercept the multiple news whose release interval meets the set interval requirement.
进一步可选地,处理器402在从所述目标时间窗口中,截取发布间隔满足设定间隔要求的所述多个新闻时,具体用于:采用第二滑动窗口,在所述目标时间窗口中滑动,得到多个子窗口;所述第二滑动窗口的长度为设定的数量长度;计算所述多个子窗口各自包含的新闻的平均间隔时长;根据所述平均间隔时长,从所述多个子窗口中确定目标子窗口,所述目标子窗口包含的新闻的平均间隔时间满足所述设定间隔要求。Further optionally, when the processor 402 intercepts from the target time window the plurality of news whose release interval meets the set interval requirement, it is specifically configured to: adopt a second sliding window, and in the target time window slide to obtain a plurality of sub-windows; the length of the second sliding window is a set quantity length; calculate the average interval duration of news contained in each of the plurality of sub-windows; according to the average interval duration, from the plurality of sub-windows The target sub-window is determined in the target sub-window, and the average interval time of the news contained in the target sub-window meets the set interval requirement.
进一步可选地,处理器402在根据所述多个新闻各自的热度权重以及所 述多个新闻之间的时间间隔,计算所述事件对应的新闻热度时,具体用于:确定所述目标子窗口中的每个新闻相对于相邻的前一新闻的时间间隔;将所述时间间隔作为指定底数的指数,计算每个新闻的指数项;根据每个新闻各自的热度权重对每个新闻的指数项进行加权计算,得到加权分数;计算所述加权分数与所述第二滑动窗口的长度的比值,作为所述事件对应的新闻热度。Further optionally, when the processor 402 calculates the news popularity corresponding to the event according to the respective popularity weights of the multiple news and the time interval between the multiple news, it is specifically used to: determine the target The time interval of each news in the window relative to the adjacent previous news; using the time interval as the index of the specified base, calculate the index item of each news; The index item is weighted to obtain a weighted score; the ratio of the weighted score to the length of the second sliding window is calculated as the news popularity corresponding to the event.
进一步可选地,处理器402在获取事件对应的新闻集合时,具体用于:采集新闻数据;针对所述新闻数据中的第一新闻文本和第二新闻文本,计算所述第一新闻文本和所述第二新闻文本之间的文本相似度;对所述第一新闻文本的新闻要素以及所述第二新闻文本的新闻要素进行重合度分析,得到新闻要素重合度;若所述文本相似度以及所述要素重合度满足设定条件,则将所述第一新闻文本和所述第二新闻文本划分到同一事件的新闻集合中。Further optionally, when acquiring the news set corresponding to the event, the processor 402 is specifically configured to: collect news data; calculate the first news text and the second news text in the news data; The text similarity between the second news texts; the news elements of the first news text and the news elements of the second news text are analyzed to obtain news element coincidence; if the text similarity and the coincidence degree of the elements satisfies the set condition, then the first news text and the second news text are classified into the news collection of the same event.
进一步可选地,处理器402在计算所述第一新闻和所述第二新闻之间的文本相似度时,具体用于:确定所述第一新闻文本包含的第一标题和第一正文,以及所述第二新闻文本包含的第二标题和第二正文;根据所述第一标题和所述第二标题各自对应的文本,计算所述第一标题和所述第二标题之间的标题相似度;根据所述第一正文和所述第二正文各自对应的文本,计算所述第一正文和所述第二正文之间的正文相似度;对所述标题相似度和所述正文相似度进行融合,得到所述第一新闻文本和所述第二新闻文本的相似度。Further optionally, when calculating the text similarity between the first news and the second news, the processor 402 is specifically configured to: determine the first headline and the first text included in the first news text, And the second headline and the second text contained in the second news text; according to the respective texts corresponding to the first headline and the second headline, calculate the headline between the first headline and the second headline Similarity; according to the corresponding texts of the first text and the second text, calculate the text similarity between the first text and the second text; similarity between the title similarity and the text degrees to obtain the similarity between the first news text and the second news text.
进一步可选地,处理器402在根据所述第一正文和所述第二正文各自对应的文本,计算所述第一正文和所述第二正文之间的正文相似度时,具体用于:对所述第一正文以及所述第二正文进行分词处理,得到第一正文词条集合以及第二正文词条集合;确定所述第一正文词条集合以及所述第二正文词条集合的交集,得到相同正文词条;确定所述第一正文词条集合以及所述第二正文词条集合中,除所述相同正文词条之外的其他词条,作为相异正文词条;分别计算所述相同正文词条在所述第一正文词条集合以及所述第二正文词条集合中的出现频次,得到第一出现频次和第二出现频次;分别计算所述相异正文词条在所述第一正文词条集合以及所述第二正文词条集合中的出现频次,得到第三出现频次和第四出现频次;根据所述第一正文和所述第二正文各自的文本长度,计算相似度惩罚项;根据所述第一出现频次、所述第二出现频次、所述第三出现频次、所述第四出现频次以及所述相似度惩罚项,计算所述正文相似度。Further optionally, when calculating the text similarity between the first text and the second text according to the corresponding texts of the first text and the second text, the processor 402 is specifically configured to: Perform word segmentation processing on the first text and the second text to obtain a first set of text entries and a second set of text entries; determine the first set of text entries and the second set of text entries Intersection to obtain the same text entry; determine the first text entry set and the second text entry set, other entries except the same text entry, as different text entries; respectively Calculate the frequency of occurrence of the same text entry in the first collection of text entries and the second collection of text entries to obtain the first frequency of occurrence and the second frequency of occurrence; respectively calculate the different text entries According to the frequency of occurrence in the first text entry set and the second text entry set, the third frequency of occurrence and the fourth frequency of occurrence are obtained; according to the respective text lengths of the first text and the second text , calculating a similarity penalty item; calculating the text similarity according to the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence, the fourth frequency of occurrence, and the similarity penalty item.
进一步可选地,处理器402在对所述第一新闻文本的新闻要素以及所述第二新闻文本的新闻要素进行重合度分析,得到新闻要素重合度时,具体用 于:分别从所述第一新闻文本以及所述第二新闻文本中提取时间要素、地点要素以及主体要素;计算所述第一新闻文本与所述第二新闻文本的时间要素的重合度、地点要素的重合度以及主体要素的重合度;将所述时间要素的重合度、地点要素的重合度以及主体要素的重合度的总和,作为所述新闻要素的重合度。Further optionally, when the processor 402 analyzes the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of the news elements, it is specifically configured to: Extract time elements, location elements and subject elements from a news text and the second news text; calculate the coincidence degree of the time elements of the first news text and the second news text, the coincidence degree of the place elements and the subject element The coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the subject element are the sum of the coincidence degree of the news element.
本申请实施例还提供一种电子设备,包括:存储器和处理器;所述存储器用于存储一条或多条计算机指令;所述处理器用于执行所述一条或多条计算机指令以用于:执行本申请实施例提供的方法中的步骤。The embodiment of the present application also provides an electronic device, including: a memory and a processor; the memory is used to store one or more computer instructions; the processor is used to execute the one or more computer instructions for: executing Steps in the method provided in the embodiment of the present application.
本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被处理器执行时能够实现本申请实施例提供的方法中的步骤。The embodiment of the present application also provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, the steps in the method provided in the embodiment of the present application can be implemented.
本申请实施例提供的新闻热度计算方法中,获取到事件的新闻集合后,从中选择发布间隔以及发布时间持续性满足设定要求的多个新闻,并根据新闻对应的发布机构,确定新闻的热度权重。综合考虑新闻的发布间隔、发布时间持续性以及新闻的发布机构,可充分利用新闻的多维度信息,进而计算得到准确的新闻热度。In the method for calculating the popularity of news provided by the embodiment of the present application, after obtaining the news set of the event, select a plurality of news whose release interval and duration of release time meet the set requirements, and determine the popularity of the news according to the corresponding publishing organization of the news Weights. By comprehensively considering the news release interval, the duration of the release time, and the news release organization, the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.
进一步,如图4所示,该电子设备还包括:显示组件404、电源组件405、音频组件406等其它组件。图4中仅示意性给出部分组件,并不意味着电子设备只包括图4所示组件。Further, as shown in FIG. 4 , the electronic device further includes: a display component 404 , a power supply component 405 , an audio component 406 and other components. FIG. 4 only schematically shows some components, which does not mean that the electronic device only includes the components shown in FIG. 4 .
其中,通信组件403被配置为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络,如WiFi,2G、3G、4G或5G,或它们的组合。在一个示例性实施例中,通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件可基于近场通信(NFC)技术、射频识别(RFID)技术、红外数据协会(IrDA)技术、超宽带(UWB)技术、蓝牙(BT)技术和其他技术来实现。Wherein, the communication component 403 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies to fulfill.
其中,显示组件404包括屏幕,其屏幕可以包括液晶显示组件(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。Wherein, the display component 404 includes a screen, and the screen may include a liquid crystal display component (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
其中,电源组件405,为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统,一个或多个电源,及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。Wherein, the power supply component 405 provides power for various components of the device where the power supply component is located. A power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.
其中,音频组件406,可被配置为输出和/或输入音频信号。例如,音频组件包括一个麦克风(MIC),当音频组件所在设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器或经由通信组件发送。在一些实施例中,音频组件还包括一个扬声器,用于输出音频信号。Wherein, the audio component 406 may be configured to output and/or input audio signals. For example, the audio component includes a microphone (MIC), which is configured to receive an external audio signal when the device on which the audio component is located is in an operation mode, such as a calling mode, a recording mode, and a speech recognition mode. The received audio signal may be further stored in a memory or sent via a communication component. In some embodiments, the audio component further includes a speaker for outputting audio signals.
本实施例中,获取到事件的新闻集合后,从中选择发布间隔以及发布时间持续性满足设定要求的多个新闻,并根据新闻对应的发布机构,确定新闻的热度权重。综合考虑新闻的发布间隔、发布时间持续性以及新闻的发布机构,可充分利用新闻的多维度信息,进而计算得到准确的新闻热度。In this embodiment, after the event news set is obtained, a plurality of news whose release interval and release time duration meet the set requirements are selected, and the popularity weight of the news is determined according to the release organization corresponding to the news. By comprehensively considering the news release interval, the duration of the release time, and the news release organization, the multi-dimensional information of the news can be fully utilized to calculate and obtain accurate news popularity.
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由电子设备执行的各步骤。Correspondingly, an embodiment of the present application further provides a computer-readable storage medium storing a computer program. When the computer program is executed, the steps that can be executed by the electronic device in the above method embodiments can be implemented.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的电子设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置的程序/指令(例如,计算机程序/指令和计算机程序产品)。这样的实现本发明的程序/指令可以存储在计算机可读介质上,或者可以一个或者多个信号的形式存在,这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in the electronic device according to the embodiments of the present invention. The present invention can also be implemented as programs/instructions (eg, computer programs/instructions and computer program products) of devices or means for performing part or all of the methods described herein. Such programs/instructions for implementing the present invention may be stored on a computer-readable medium, or may exist in the form of one or more signals, such signals may be downloaded from an Internet website, or provided on a carrier signal, or in any form Available in other formats.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
图5示意性地示出了可以实现根据本发明的新闻热度计算方法的计算机装置/设备/系统,该计算机装置/设备/系统包括处理器510和以存储器520形式的计算机可读介质。存储器520是计算机可读介质的一个示例, 其具有用于存储计算机程序/指令531的存储空间530。当所述计算机程序/指令531由处理器510执行时,可实现上文所描述的新闻热度计算方法中的各个步骤。FIG. 5 schematically shows a computer device/equipment/system that can implement the method for calculating news popularity according to the present invention. The computer device/equipment/system includes a processor 510 and a computer-readable medium in the form of a memory 520 . Memory 520 is one example of a computer readable medium having storage space 530 for storing computer programs/instructions 531 . When the computer program/instruction 531 is executed by the processor 510, various steps in the news popularity calculation method described above can be realized.
图6示意性地示出了实现根据本发明的方法的计算机程序产品的框图。所述计算机程序产品包括计算机程序/指令610,当所述计算机程序/指令610被诸如图5所示的处理器510之类的处理器执行时,可实现上文所描述的新闻热度计算方法中的各个步骤。Fig. 6 schematically shows a block diagram of a computer program product implementing the method according to the invention. The computer program product includes a computer program/instruction 610. When the computer program/instruction 610 is executed by a processor such as the processor 510 shown in FIG. each step.
上文对本说明书特定实施例进行了描述,其与其它实施例一并涵盖于所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定遵循示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可行的或者有利的。The foregoing describes certain embodiments of the specification which, together with other embodiments, are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily follow the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or advantageous in certain embodiments.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
应可理解,以上所述实施例仅为举例说明本发明之目的而并非对本发明进行限制。在不脱离本发明基本精神及特性的前提下,本领域技术人员还可以通过其他方式来实施本发明。本发明的范围当以后附的权利要求为准,凡在本说明书一个或多个实施例的精神和原则之内所做的任何修改、等同替换、改进等,皆应涵盖其中。It should be understood that the above-mentioned embodiments are only for the purpose of illustrating the present invention rather than limiting the present invention. Without departing from the basic spirit and characteristics of the present invention, those skilled in the art can implement the present invention in other ways. The scope of the present invention shall be based on the appended claims, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of one or more embodiments of the present specification shall be covered therein.

Claims (12)

  1. 一种新闻热度计算方法,其特征在于,包括:A method for calculating news popularity, characterized by comprising:
    获取事件对应的新闻集合;Obtain the news collection corresponding to the event;
    从所述新闻集合中,确定发布间隔以及发布时间持续性满足设定条件的多个新闻;From the news collection, determine a plurality of news whose release interval and duration of release time meet the set conditions;
    根据所述多个新闻对应的发布机构,确定所述多个新闻各自的热度权重;Determining the popularity weights of the multiple news according to the publishing agencies corresponding to the multiple news;
    根据所述多个新闻各自的热度权重、所述发布间隔以及所述发布时间持续性,计算所述事件对应的新闻热度。The news popularity corresponding to the event is calculated according to the respective popularity weights of the plurality of news, the publishing interval, and the publishing time duration.
  2. 根据权利要求1所述的方法,其特征在于,从所述新闻集合中,确定发布间隔以及发布时间持续性满足设定条件的多个新闻,包括:The method according to claim 1, characterized in that, from the news collection, determining a plurality of news whose release interval and duration of release time meet the set conditions includes:
    按照时间先后顺序,对所述新闻集合中的新闻的发布时间进行排序,得到发布时间序列;Sorting the release time of the news in the news collection according to chronological order to obtain a release time sequence;
    采用第一滑动窗口,在所述发布时间序列上滑动,得到多个时间窗口;所述第一滑动窗口的窗口长度为设定的时间跨度;Using a first sliding window to slide on the release time series to obtain multiple time windows; the window length of the first sliding window is a set time span;
    从所述多个时间窗口中,确定新闻数量满足设定数量要求的时间窗口,作为目标时间窗口;From the plurality of time windows, determine the time window in which the quantity of news meets the set quantity requirement as the target time window;
    从所述目标时间标窗口中,截取发布间隔满足设定间隔要求的所述多个新闻。From the target time stamp window, intercept the multiple news whose release interval meets the set interval requirement.
  3. 根据权利要求2所述的方法,其特征在于,从所述目标时间窗口中,截取发布间隔满足设定间隔要求的所述多个新闻,包括:The method according to claim 2, wherein, from the target time window, intercepting the plurality of news whose publishing interval meets the requirement of the set interval includes:
    采用第二滑动窗口,在所述目标时间窗口中滑动,得到多个子窗口;所述第二滑动窗口的长度为设定的数量长度;Using a second sliding window to slide in the target time window to obtain multiple sub-windows; the length of the second sliding window is a set number of lengths;
    计算所述多个子窗口各自包含的新闻的平均间隔时长;calculating the average interval duration of the news contained in each of the plurality of sub-windows;
    根据所述平均间隔时长,从所述多个子窗口中确定目标子窗口,所述目标子窗口包含的新闻的平均间隔时间满足所述设定间隔要求。According to the average interval length, a target sub-window is determined from the multiple sub-windows, and the average interval time of the news contained in the target sub-window meets the set interval requirement.
  4. 根据权利要求3所述的方法,其特征在于,根据所述多个新闻各自的热度权重以及所述多个新闻之间的时间间隔,计算所述事件对应的新闻热度,包括:The method according to claim 3, wherein calculating the news popularity corresponding to the event according to the respective popularity weights of the plurality of news and the time interval between the plurality of news includes:
    确定所述目标子窗口中的每个新闻相对于相邻的前一新闻的时间间隔;Determine the time interval of each news in the target sub-window relative to the adjacent previous news;
    将所述时间间隔作为指定底数的指数,计算每个新闻的指数项;Compute an index term for each news using said time interval as an index to the specified base;
    根据每个新闻各自的热度权重对每个新闻的指数项进行加权计算,得到加权分数;The weighted calculation of the index items of each news is carried out according to the respective heat weights of each news to obtain the weighted score;
    计算所述加权分数与所述第二滑动窗口的长度的比值,作为所述事件对 应的新闻热度。Calculate the ratio of the weighted score to the length of the second sliding window as the news popularity corresponding to the event.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,获取事件对应的新闻集合,包括:The method according to any one of claims 1-4, wherein obtaining a news set corresponding to an event includes:
    采集新闻数据;collect news data;
    针对所述新闻数据中的第一新闻文本和第二新闻文本,计算所述第一新闻文本和所述第二新闻文本之间的文本相似度;For the first news text and the second news text in the news data, calculate the text similarity between the first news text and the second news text;
    对所述第一新闻文本的新闻要素以及所述第二新闻文本的新闻要素进行重合度分析,得到新闻要素重合度;Performing an overlap analysis on the news elements of the first news text and the news elements of the second news text to obtain the overlap of news elements;
    若所述文本相似度以及所述要素重合度满足设定条件,则将所述第一新闻文本和所述第二新闻文本划分到同一事件的新闻集合中。If the text similarity and the element coincidence meet the set conditions, the first news text and the second news text are classified into a news collection of the same event.
  6. 根据权利要求5所述的方法,其特征在于,计算所述第一新闻和所述第二新闻之间的文本相似度,包括:The method according to claim 5, wherein calculating the text similarity between the first news and the second news comprises:
    确定所述第一新闻文本包含的第一标题和第一正文,以及所述第二新闻文本包含的第二标题和第二正文;determining a first title and a first text included in the first news text, and a second title and a second text included in the second news text;
    根据所述第一标题和所述第二标题各自对应的文本,计算所述第一标题和所述第二标题之间的标题相似度;calculating the title similarity between the first title and the second title according to the respective texts corresponding to the first title and the second title;
    根据所述第一正文和所述第二正文各自对应的文本,计算所述第一正文和所述第二正文之间的正文相似度;calculating the text similarity between the first text and the second text according to the respective texts corresponding to the first text and the second text;
    对所述标题相似度和所述正文相似度进行融合,得到所述第一新闻文本和所述第二新闻文本的相似度。The title similarity and the text similarity are fused to obtain the similarity between the first news text and the second news text.
  7. 根据权利要求6所述的方法,其特征在于,根据所述第一正文和所述第二正文各自对应的文本,计算所述第一正文和所述第二正文之间的正文相似度,包括:The method according to claim 6, wherein, according to the respective texts corresponding to the first text and the second text, calculating the text similarity between the first text and the second text includes :
    对所述第一正文以及所述第二正文进行分词处理,得到第一正文词条集合以及第二正文词条集合;performing word segmentation processing on the first text and the second text to obtain a first set of text entries and a second set of text entries;
    确定所述第一正文词条集合以及所述第二正文词条集合的交集,得到相同正文词条;Determining the intersection of the first set of text entries and the second set of text entries to obtain the same text entry;
    确定所述第一正文词条集合以及所述第二正文词条集合中,除所述相同正文词条之外的其他词条,作为相异正文词条;Determining that in the first set of text entries and the second set of text entries, other entries except the same text entries are used as different text entries;
    分别计算所述相同正文词条在所述第一正文词条集合以及所述第二正文词条集合中的出现频次,得到第一出现频次和第二出现频次;Calculate the frequency of occurrence of the same text entry in the first collection of text entries and the second collection of text entries to obtain the first frequency of occurrence and the second frequency of occurrence;
    分别计算所述相异正文词条在所述第一正文词条集合以及所述第二正 文词条集合中的出现频次,得到第三出现频次和第四出现频次;Calculate the frequency of occurrence of the different text entries in the first collection of text entries and the second collection of text entries to obtain the third frequency of occurrence and the fourth frequency of occurrence;
    根据所述第一正文和所述第二正文各自的文本长度,计算相似度惩罚项;calculating a similarity penalty item according to the respective text lengths of the first text and the second text;
    根据所述第一出现频次、所述第二出现频次、所述第三出现频次、所述第四出现频次以及所述相似度惩罚项,计算所述正文相似度。The text similarity is calculated according to the first frequency of occurrence, the second frequency of occurrence, the third frequency of occurrence, the fourth frequency of occurrence and the similarity penalty item.
  8. 根据权利要求5所述的方法,其特征在于,对所述第一新闻文本的新闻要素以及所述第二新闻文本的新闻要素进行重合度分析,得到新闻要素重合度,包括:The method according to claim 5, characterized in that, analyzing the coincidence degree of the news elements of the first news text and the news elements of the second news text to obtain the coincidence degree of news elements, including:
    分别从所述第一新闻文本以及所述第二新闻文本中提取时间要素、地点要素以及主体要素;extracting time elements, location elements, and subject elements from the first news text and the second news text respectively;
    计算所述第一新闻文本与所述第二新闻文本的时间要素的重合度、地点要素的重合度以及主体要素的重合度;calculating the coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the subject element of the first news text and the second news text;
    将所述时间要素的重合度、地点要素的重合度以及主体要素的重合度的总和,作为所述新闻要素的重合度。The sum of the coincidence degree of the time element, the coincidence degree of the location element and the coincidence degree of the subject element is taken as the coincidence degree of the news element.
  9. 一种电子设备,其特征在于,包括:存储器和处理器;An electronic device, characterized in that it includes: a memory and a processor;
    所述存储器用于存储一条或多条计算机指令;The memory is used to store one or more computer instructions;
    所述处理器用于执行所述一条或多条计算机指令以用于:执行权利要求1-8任一项所述的方法中的步骤。The processor is configured to execute the one or more computer instructions for: performing the steps in the method of any one of claims 1-8.
  10. 一种计算机装置/设备/系统,包括存储器、处理器及存储在存储器上的计算机程序/指令,所述处理器执行所述计算机程序/指令时实现根据权利要求1-8中任一项所述的新闻热度计算方法的步骤。A computer device/equipment/system, comprising a memory, a processor, and computer programs/instructions stored on the memory, when the processor executes the computer program/instructions, it implements any one of claims 1-8 The steps of the news popularity calculation method.
  11. 一种计算机可读介质,其上存储有计算机程序/指令,所述计算机程序/指令被处理器执行时实现根据权利要求1-8中任一项所述的新闻热度计算方法的步骤。A computer-readable medium, on which computer programs/instructions are stored, and when the computer programs/instructions are executed by a processor, the steps of the method for calculating news popularity according to any one of claims 1-8 are realized.
  12. 一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现根据权利要求1-8中任一项所述的新闻热度计算方法的步骤。A computer program product, including computer programs/instructions, when the computer programs/instructions are executed by a processor, the steps of the method for calculating news popularity according to any one of claims 1-8 are realized.
PCT/CN2021/132517 2021-06-25 2021-11-23 News popularity calculation method, device and storage medium WO2022267325A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110711197.XA CN113449077B (en) 2021-06-25 2021-06-25 News heat calculation method, device and storage medium
CN202110711197.X 2021-06-25

Publications (1)

Publication Number Publication Date
WO2022267325A1 true WO2022267325A1 (en) 2022-12-29

Family

ID=77812889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/132517 WO2022267325A1 (en) 2021-06-25 2021-11-23 News popularity calculation method, device and storage medium

Country Status (2)

Country Link
CN (1) CN113449077B (en)
WO (1) WO2022267325A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449077B (en) * 2021-06-25 2024-04-05 完美世界控股集团有限公司 News heat calculation method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150317648A1 (en) * 2012-02-24 2015-11-05 Strategic Communication Advisors, Llc System and method for assessing and ranking newsworthiness
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN107784010A (en) * 2016-08-29 2018-03-09 上海掌门科技有限公司 A kind of method and apparatus for being used to determine the temperature information of theme of news
CN109344316A (en) * 2018-08-14 2019-02-15 优视科技(中国)有限公司 News temperature calculates method and device
CN111461542A (en) * 2020-03-31 2020-07-28 支付宝(杭州)信息技术有限公司 Event statistical method and device
CN113449077A (en) * 2021-06-25 2021-09-28 完美世界控股集团有限公司 News popularity calculation method, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150317648A1 (en) * 2012-02-24 2015-11-05 Strategic Communication Advisors, Llc System and method for assessing and ranking newsworthiness
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN107784010A (en) * 2016-08-29 2018-03-09 上海掌门科技有限公司 A kind of method and apparatus for being used to determine the temperature information of theme of news
CN109344316A (en) * 2018-08-14 2019-02-15 优视科技(中国)有限公司 News temperature calculates method and device
CN111461542A (en) * 2020-03-31 2020-07-28 支付宝(杭州)信息技术有限公司 Event statistical method and device
CN113449077A (en) * 2021-06-25 2021-09-28 完美世界控股集团有限公司 News popularity calculation method, equipment and storage medium

Also Published As

Publication number Publication date
CN113449077A (en) 2021-09-28
CN113449077B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
KR100462292B1 (en) A method for providing search results list based on importance information and a system thereof
KR100544514B1 (en) Method and system for determining relation between search terms in the internet search system
US20130290232A1 (en) Identifying news events that cause a shift in sentiment
US9798831B2 (en) Processing data in a MapReduce framework
CN108733816B (en) Microblog emergency detection method
US20070198459A1 (en) System and method for online information analysis
US20140214835A1 (en) System and method for automatically classifying documents
US11176586B2 (en) Data analysis method and system thereof
US9344507B2 (en) Method of processing web access information and server implementing same
US11609959B2 (en) System and methods for generating an enhanced output of relevant content to facilitate content analysis
TW201923629A (en) Data processing method and apparatus
Rahnama Distributed real-time sentiment analysis for big data social streams
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
WO2022267325A1 (en) News popularity calculation method, device and storage medium
JP2008310626A (en) Automatic tag impartment device, automatic tag impartment method, automatic tag impartment program and recording medium recording the program
KR102126911B1 (en) Key player detection method in social media using KeyplayerRank
US9336280B2 (en) Method for entity-driven alerts based on disambiguated features
US20220343353A1 (en) Identifying Competitors of Companies
WO2016027364A1 (en) Topic cluster selection device, and search method
CN113449078A (en) Similar news identification method, equipment, system and storage medium
JP5955817B2 (en) Extraction apparatus, extraction method and program
US10643227B1 (en) Business lines
CN111461542A (en) Event statistical method and device
WO2018210045A1 (en) Method and device for identifying native object
KR100994326B1 (en) A method for providing search results list based on importance information and a system thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946803

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE